Section 31.2: Instructions, goals, constraints | Building Embodied AI: From Perception to Autonomous Action

A Careful Control Loop

Technical illustration for Section 31.2: Instructions, goals, constraints. — Figure 31.2A: Instructions, goals, and constraints encoded as three separate signals to a robot: the instruction names the task, the goal specifies the terminal state, and a constraint channel marks regions and actions that are forbidden.

For Instructions, goals, constraints, read the figure as an interface check: identify the language input, grounding evidence, action representation, safety gate, and logged result before accepting the agent behavior described below.

Figure 31.2: A closed-loop map for Instructions, goals, constraints. The diagram forces the reader to name the input, model boundary, action interface, and evidence record before trusting the system.

Build And Evaluation Checklist

Depth and self-containment. This section must distinguish a free-form instruction from the executable goal and constraint objects that planners consume. Readers should finish knowing which parts of a sentence are optimization targets, which are hard constraints, and which are preferences that can be traded off.

Production and evaluation contract. A publishable artifact here records the instruction parse, the goal predicate, the forbidden predicates, and the scalar objective used during planning. Without that split, two systems can appear comparable while optimizing different notions of success.

Checklist Memory Anchor

For Instructions, goals, constraints, name the language interface, grounded world state, executable action contract, and evidence artifact before trusting any claimed improvement.

Mini Audit Exercise

For Instructions, goals, constraints, write one evidence row recording instruction, world-state estimate, chosen action, verifier result, and failure label. Then identify which field would change first under command misunderstanding.

Big Picture

Instructions, goals, constraints is where language becomes a planning problem. The words are valuable only after the system separates what must happen, what must never happen, and what would be nice if time permits.

This section turns natural-language directives into a control objective that a symbolic planner, MPC stack, or policy can actually optimize.

The practical question is which clauses in the instruction should become equalities, inequalities, or preference weights in the downstream planner.

Action Is The Test

A planner is only as safe as the strongest constraint it refuses to violate. Preferences can slide; forbidden states cannot.

Theory

Suppose an instruction induces a goal variable $g$, a set of hard constraints $\mathcal C$, and a preference score $r_\text{pref}$. A planner can then solve $$\max_{\tau} \; \mathbb E\left[\sum_t r(s_t, a_t; g) + \lambda r_\text{pref}(s_t, a_t, x)\right] \quad \text{s.t.} \quad c_k(s_t, a_t, x) \le 0 \; \forall k \in \mathcal C.$$ The language front end decides what enters the reward and what enters the constraint set.

This distinction matters because optimization behaves differently under each choice. If 'do not tip the cup' is encoded as a mild reward penalty, a planner may accept spills when the goal is otherwise attractive. If it is encoded as a hard constraint or shield, the system must seek an alternative path or ask for clarification.

Mechanism

A good parser emits typed slots such as `goal=deliver(red_mug, user)`, `constraint=keep_upright(red_mug)`, and `preference=avoid_left_shelf`. Those slots are much more stable engineering interfaces than raw prompts because verifiers and controllers can inspect them directly.

Worked Example

Code Fragment 1 shows a compact parser that turns a single sentence into hard and soft task elements. The important detail is not the string matching itself, but the separation between mandatory and negotiable parts of the instruction.

# Split one instruction into a goal, a hard constraint, and a soft preference.
# Real systems use learned parsing, but the typed output contract is the same.
# The planner should inspect these slots directly instead of re-reading the sentence.
instruction = "bring the red mug, keep it upright, avoid the left shelf"

goal = "deliver(red_mug)"
hard_constraints = ["keep_upright(red_mug)"]
preferences = ["avoid(left_shelf)"]

print({"goal": goal, "hard": hard_constraints, "soft": preferences})

{'goal': 'deliver(red_mug)', 'hard': ['keep_upright(red_mug)'], 'soft': ['avoid(left_shelf)']}

The expected output is a three-field task object with exactly one goal slot, one hard-constraint list, and one soft-preference list. If the parser merged keep_upright(red_mug) into the soft field or omitted it entirely, the downstream planner would optimize the wrong problem even if the natural-language instruction still looked correct to a human reviewer.

Code Fragment 1: This parser emits a typed contract that later modules can inspect without guessing which clauses are negotiable. The crucial distinction is that `keep_upright(red_mug)` is preserved as a hard rule, while `avoid(left_shelf)` remains a soft preference that a planner may relax only if necessary.

Library Shortcut

Libraries such as Pydantic, JSON schema tool calling, and structured-output APIs turn the same pattern into a few lines by forcing the LLM to emit typed fields. They handle validation, missing keys, and schema checks internally, so the planner receives machine-readable goals rather than brittle free text.

Practical Recipe

Write one schema for goals, one for hard constraints, and one for preferences.
Define a parser failure state for instructions that cannot populate the schema reliably.
Make the verifier inspect hard constraints before any preference score is reported.
Assign explicit units to every numeric threshold extracted from text, such as speed or distance.
Treat underspecified slots as a clarification trigger, not as permission to improvise.

Common Failure Mode

A common mistake is to overfit to clean lab instructions where every constraint is stated explicitly. Real instructions omit quantities, reference hidden user preferences, and conflict with the geometry of the scene. Silent default choices can look intelligent while actually violating the user's intent.

Practical Example

A home assistant that hears 'bring me the soup, but do not spill it and do not wake the baby' should parse one delivery goal, one fluid-stability constraint, and one acoustic preference. The last item may reshape route choice and speed even when the delivery target stays the same.

Memory Hook

Natural language loves to hide a legal department inside one comma. 'Bring the mug, but not that mug, and be quick, but be careful' is still one sentence to the human and three optimization problems to the robot.

Research Frontier

Recent work on structured outputs, semantic parsers for robotics, and constrained policy optimization increasingly blurs the line between natural-language tasking and formal task specifications. The open question is how much of the structure should be learned end to end and how much should stay explicit for safety and debugging.

Self Check

If you remove the sentence and keep only the parsed task object, can the downstream planner still tell what is mandatory and what is merely preferred?

This section is where language touches control theory most directly. Once the utterance becomes a constrained optimization problem, the usual machinery of feasibility, receding-horizon planning, and safety filtering applies. The LLM is useful because it proposes the task object; it is not the final judge of whether the task object is physically or ethically valid.

The best engineering pattern is therefore asymmetric: let language be flexible at proposal time and rigid at execution time. Proposal modules may entertain multiple parses, but execution modules should consume one validated, typed contract whose semantics are stable across seeds, prompts, and model versions.

Tool Choices For Typed Instruction Interfaces

Tool or Library	Role in the Topic	Builder Advice
Pydantic or dataclasses	Typed task-object validation.	Use them to reject malformed parses before the planner sees them.
OpenAI or Anthropic structured outputs	Schema-constrained LLM parsing.	Use them when free-form prompts are too brittle for production tasking.
BehaviorTree.CPP	Execution logic with explicit success and failure branches.	Use it when a parsed constraint should trigger fallback or clarification instead of silent retries.
MoveIt Task Constructor	Constraint-aware manipulation planning.	Use it when language specifies goal poses, collision exclusions, or grasp requirements.
ROS 2 actions	Long-running skill invocation with cancelation and feedback.	Use actions when language goals may be revised mid-execution.

Code Fragment 2 scores candidate plans against one hard constraint and one preference to make the distinction visible numerically. The hard constraint prunes infeasible plans first; only then does the preference score choose between survivors.

Generate several candidate plans from the same typed instruction object.
Reject every candidate that violates a hard rule before computing preference scores.
Score the surviving candidates with a transparent preference model.
If no candidate is feasible, ask a clarification question tied to the missing slot or impossible constraint.
Save both the feasible set and the rejected set so later audits can separate parser and planner failures.

The expected output is a feasible-set list that excludes short_path, followed by the name of the lowest-cost surviving plan, quiet_route. Reading the trace should make the algorithmic order obvious: feasibility filtering happens first, preference optimization happens second, and any output that still contains an upright=False plan indicates that the hard-rule gate failed.

Code Fragment 2: This ranking stage makes the constraint hierarchy explicit. `short_path` is discarded before the preference score is even considered, and the final choice comes from the feasible set rather than from the global minimum cost over invalid actions.

If execution violates intent, inspect the failure in order: parsing, constraint typing, feasibility filtering, then preference ranking. Many so-called planning errors are actually parse errors where a soft preference was accidentally promoted or a hard rule was accidentally softened.

Key Takeaway

Language-guided planning improves when instructions are converted into typed goals and constraints whose semantics survive the transition from text to control.

Exercise 31.2.1

Take one household instruction with at least two clauses and express it as a typed goal object with one hard rule and one soft preference. Then explain how your planner should behave when the hard rule makes all current plans infeasible.

Bibliography and Further Reading

Primary Sources and Tools

Ahn et al. (2022). "Do As I Can, Not As I Say: Grounding Language in Robotic Affordances." arXiv.

SayCan is a key example of separating linguistic plausibility from executable feasibility, which is exactly the distinction between semantic intent and constraint satisfaction.

Paper or Documentation

ROS 2 Documentation. 'Creating an action.'

ROS 2 actions illustrate how long-running goals become typed contracts with feedback, cancelation, and result states.

Paper or Documentation

BehaviorTree.CPP Documentation. 'Integration with ROS2.'

Behavior trees provide a practical execution language for turning parsed constraints into retry, fallback, and verification structure.

Paper or Documentation