A Careful Control Loop
For Instructions, goals, constraints, read the figure as an interface check: identify the language input, grounding evidence, action representation, safety gate, and logged result before accepting the agent behavior described below.
Build And Evaluation Checklist
Depth and self-containment. This section must distinguish a free-form instruction from the executable goal and constraint objects that planners consume. Readers should finish knowing which parts of a sentence are optimization targets, which are hard constraints, and which are preferences that can be traded off.
Production and evaluation contract. A publishable artifact here records the instruction parse, the goal predicate, the forbidden predicates, and the scalar objective used during planning. Without that split, two systems can appear comparable while optimizing different notions of success.
For Instructions, goals, constraints, name the language interface, grounded world state, executable action contract, and evidence artifact before trusting any claimed improvement.
For Instructions, goals, constraints, write one evidence row recording instruction, world-state estimate, chosen action, verifier result, and failure label. Then identify which field would change first under command misunderstanding.
Instructions, goals, constraints is where language becomes a planning problem. The words are valuable only after the system separates what must happen, what must never happen, and what would be nice if time permits.
This section turns natural-language directives into a control objective that a symbolic planner, MPC stack, or policy can actually optimize.
The practical question is which clauses in the instruction should become equalities, inequalities, or preference weights in the downstream planner.
A planner is only as safe as the strongest constraint it refuses to violate. Preferences can slide; forbidden states cannot.
Theory
Suppose an instruction induces a goal variable $g$, a set of hard constraints $\mathcal C$, and a preference score $r_\text{pref}$. A planner can then solve $$\max_{\tau} \; \mathbb E\left[\sum_t r(s_t, a_t; g) + \lambda r_\text{pref}(s_t, a_t, x)\right] \quad \text{s.t.} \quad c_k(s_t, a_t, x) \le 0 \; \forall k \in \mathcal C.$$ The language front end decides what enters the reward and what enters the constraint set.
This distinction matters because optimization behaves differently under each choice. If 'do not tip the cup' is encoded as a mild reward penalty, a planner may accept spills when the goal is otherwise attractive. If it is encoded as a hard constraint or shield, the system must seek an alternative path or ask for clarification.
A good parser emits typed slots such as `goal=deliver(red_mug, user)`, `constraint=keep_upright(red_mug)`, and `preference=avoid_left_shelf`. Those slots are much more stable engineering interfaces than raw prompts because verifiers and controllers can inspect them directly.
Worked Example
Code Fragment 1 shows a compact parser that turns a single sentence into hard and soft task elements. The important detail is not the string matching itself, but the separation between mandatory and negotiable parts of the instruction.
# Split one instruction into a goal, a hard constraint, and a soft preference.
# Real systems use learned parsing, but the typed output contract is the same.
# The planner should inspect these slots directly instead of re-reading the sentence.
instruction = "bring the red mug, keep it upright, avoid the left shelf"
goal = "deliver(red_mug)"
hard_constraints = ["keep_upright(red_mug)"]
preferences = ["avoid(left_shelf)"]
print({"goal": goal, "hard": hard_constraints, "soft": preferences})
The expected output is a three-field task object with exactly one goal slot, one hard-constraint list, and one soft-preference list. If the parser merged keep_upright(red_mug) into the soft field or omitted it entirely, the downstream planner would optimize the wrong problem even if the natural-language instruction still looked correct to a human reviewer.
Libraries such as Pydantic, JSON schema tool calling, and structured-output APIs turn the same pattern into a few lines by forcing the LLM to emit typed fields. They handle validation, missing keys, and schema checks internally, so the planner receives machine-readable goals rather than brittle free text.
Practical Recipe
- Write one schema for goals, one for hard constraints, and one for preferences.
- Define a parser failure state for instructions that cannot populate the schema reliably.
- Make the verifier inspect hard constraints before any preference score is reported.
- Assign explicit units to every numeric threshold extracted from text, such as speed or distance.
- Treat underspecified slots as a clarification trigger, not as permission to improvise.
A common mistake is to overfit to clean lab instructions where every constraint is stated explicitly. Real instructions omit quantities, reference hidden user preferences, and conflict with the geometry of the scene. Silent default choices can look intelligent while actually violating the user's intent.
A home assistant that hears 'bring me the soup, but do not spill it and do not wake the baby' should parse one delivery goal, one fluid-stability constraint, and one acoustic preference. The last item may reshape route choice and speed even when the delivery target stays the same.
Natural language loves to hide a legal department inside one comma. 'Bring the mug, but not that mug, and be quick, but be careful' is still one sentence to the human and three optimization problems to the robot.
Recent work on structured outputs, semantic parsers for robotics, and constrained policy optimization increasingly blurs the line between natural-language tasking and formal task specifications. The open question is how much of the structure should be learned end to end and how much should stay explicit for safety and debugging.
If you remove the sentence and keep only the parsed task object, can the downstream planner still tell what is mandatory and what is merely preferred?
This section is where language touches control theory most directly. Once the utterance becomes a constrained optimization problem, the usual machinery of feasibility, receding-horizon planning, and safety filtering applies. The LLM is useful because it proposes the task object; it is not the final judge of whether the task object is physically or ethically valid.
The best engineering pattern is therefore asymmetric: let language be flexible at proposal time and rigid at execution time. Proposal modules may entertain multiple parses, but execution modules should consume one validated, typed contract whose semantics are stable across seeds, prompts, and model versions.
| Tool or Library | Role in the Topic | Builder Advice |
|---|---|---|
| Pydantic or dataclasses | Typed task-object validation. | Use them to reject malformed parses before the planner sees them. |
| OpenAI or Anthropic structured outputs | Schema-constrained LLM parsing. | Use them when free-form prompts are too brittle for production tasking. |
| BehaviorTree.CPP | Execution logic with explicit success and failure branches. | Use it when a parsed constraint should trigger fallback or clarification instead of silent retries. |
| MoveIt Task Constructor | Constraint-aware manipulation planning. | Use it when language specifies goal poses, collision exclusions, or grasp requirements. |
| ROS 2 actions | Long-running skill invocation with cancelation and feedback. | Use actions when language goals may be revised mid-execution. |
Code Fragment 2 scores candidate plans against one hard constraint and one preference to make the distinction visible numerically. The hard constraint prunes infeasible plans first; only then does the preference score choose between survivors.
- Generate several candidate plans from the same typed instruction object.
- Reject every candidate that violates a hard rule before computing preference scores.
- Score the surviving candidates with a transparent preference model.
- If no candidate is feasible, ask a clarification question tied to the missing slot or impossible constraint.
- Save both the feasible set and the rejected set so later audits can separate parser and planner failures.
The expected output is a feasible-set list that excludes short_path, followed by the name of the lowest-cost surviving plan, quiet_route. Reading the trace should make the algorithmic order obvious: feasibility filtering happens first, preference optimization happens second, and any output that still contains an upright=False plan indicates that the hard-rule gate failed.
If execution violates intent, inspect the failure in order: parsing, constraint typing, feasibility filtering, then preference ranking. Many so-called planning errors are actually parse errors where a soft preference was accidentally promoted or a hard rule was accidentally softened.
Language-guided planning improves when instructions are converted into typed goals and constraints whose semantics survive the transition from text to control.
Take one household instruction with at least two clauses and express it as a typed goal object with one hard rule and one soft preference. Then explain how your planner should behave when the hard rule makes all current plans infeasible.
Ahn et al. (2022). "Do As I Can, Not As I Say: Grounding Language in Robotic Affordances." arXiv.
SayCan is a key example of separating linguistic plausibility from executable feasibility, which is exactly the distinction between semantic intent and constraint satisfaction.
ROS 2 Documentation. 'Creating an action.'
ROS 2 actions illustrate how long-running goals become typed contracts with feedback, cancelation, and result states.
BehaviorTree.CPP Documentation. 'Integration with ROS2.'
Behavior trees provide a practical execution language for turning parsed constraints into retry, fallback, and verification structure.