Section 34.7: Prompting and conditioning embodied policies | Building Embodied AI: From Perception to Autonomous Action

"A robot policy is a promise about the next second of the world."
A Grounded AI Agent

Library Shortcut

Use prompt suites as data files, not prose buried in notebooks. A simple CSV or JSON prompt panel lets every model variant run on the same goal, object, constraint, and stop-condition cases.

Big Picture

Prompting changes robot behavior because it narrows the policy distribution at runtime. In a VLA, the prompt is not presentation text. It is part of the control contract that links language, scene understanding, and action execution.

Prompting Is Runtime Conditioning

In a language model, a prompt can ask for a style or answer format. In a VLA, a prompt can change motor behavior. That makes prompt design part of the control interface. The prompt must specify the goal, relevant object, constraints, and stop condition without asking the low-level policy to reason beyond its grounding.

Conditioning can also come from goal images, robot state, task embeddings, skill identifiers, or structured plans from an LLM planner. The important point is that conditioning narrows the policy distribution. A vague instruction such as "clean this up" may belong to a planner in Chapter 33, while the VLA needs a grounded command such as "pick up the red block and place it in the blue tray."

Prompt As Contract

A good VLA prompt is not poetic. It is a compact contract between the human, the perception system, and the action policy.

Prompt Patterns

Pattern	Example	Use
Goal only	"put the cup on the coaster"	Simple familiar tasks
Goal plus constraint	"move the cup without touching the plate"	Safety or clutter
Goal plus object attributes	"pick the smaller red block"	Visual grounding tests
Goal plus stop condition	"release when the block is centered on the tray"	Precise completion

Prompt Injection Has A Physical Version

If a robot sees text in the scene or hears competing instructions, the policy needs an instruction hierarchy. System constraints, operator commands, perception labels, and environmental text should not have equal authority.

Practical Recipe

Maintain a prompt test set just like a visual test set. Include synonyms, distractor objects, ambiguous references, negations, and safety constraints. Run every policy variant on the same prompt set and save videos for the first 20 failures.

Hands-On Lab: Build A VLA Dataset Card And Fine-Tuning Plan

Duration: about 75 minutesIntermediate

Objective

Build a practical VLA adaptation plan for one tabletop task using a LeRobot-style dataset schema, prompt templates, evaluation splits, and a small-policy shortcut.

What You'll Practice

Defining observation, state, action, and language fields for VLA training.
Writing prompt templates that constrain robot behavior.
Creating construct-matched evaluation panels.
Using a maintained open VLA toolchain instead of custom loaders.

Setup

The code below is designed for a notebook or Colab-like environment. Use the current LeRobot install instructions before running because package extras change.

# Install the open robot-learning toolkit and common notebook dependencies.
# Check the LeRobot repository for the current extras before a real fine-tune.
pip install lerobot numpy pandas

Code Fragment 1: The setup command installs LeRobot plus lightweight analysis dependencies. It intentionally avoids model downloads so the first pass can run on modest hardware.

Figure 34.7 gives this page a compact map of the interface. Read it left to right, then check whether the surrounding prose names the same observation, action, and evidence contract.

Figure 34.7: A closed-loop map for Prompting and conditioning embodied policies. The diagram forces the reader to name the input, model boundary, action interface, and evidence record before trusting the system.

Build And Evaluation Checklist

Curriculum, depth, and self-containment. Prompting and conditioning narrow the policy distribution. Good prompts specify task, object, scene, robot state, and success condition. For Prompting and conditioning embodied policies, the practical reading is to pin down the interface, assumptions, concrete example, and failure mode before comparing methods.

Production and evaluation contract. Conditioning should remove ambiguity that the action head cannot safely resolve. For Prompting and conditioning embodied policies, treat the diagram, code, table, exercise, warning, and references as one evidence packet: boundary, artifact, tool choice, transfer check, failure mode, and source grounding.

Checklist Memory Anchor

Before accepting a Prompting and conditioning embodied policies result, name the loop variable that changed, the tool that makes it reproducible, the failure that would fool the metric, and the source that backs the claim.

Mini Audit Exercise

For this section, write one evidence row with observation, action, metric, dataset or robot, seed, and failure label. Then explain why comparing that row with a result from a different setup would be invalid.

Steps

Step 1: Define the dataset card

Create a structured card before touching model code. The reader-fill fields force you to name the contract that the policy will learn.

# Dataset card: record the robot, sensors, action space, and task language.
# Fill the reader fields before collecting or fine-tuning any demonstrations.
from dataclasses import dataclass

@dataclass
class VLADatasetCard:
    robot: str
    cameras: list[str]
    action_space: str
    control_hz: int
    prompt_template: str
    success_metric: str

    def as_row(self) -> dict[str, object]:
        return asdict(self)

card = VLADatasetCard(
    robot="aloha_static",
    cameras=["wrist_rgb", "front_rgb"],
    action_space="7D end-effector delta plus gripper state",
    control_hz=10,
    prompt_template="pick up the {object} and place it on the {target}",
    success_metric="object center lies inside tray after release",
)
print(card)

Code Fragment 2: The VLADatasetCard object captures the practical fields that determine whether a VLA fine-tune is reproducible. The reader-fill values should be completed before model training begins.

Hint

For a pick-and-place task, use one wrist camera, one third-person camera, a 7D end-effector action, and a success metric based on object pose after release.

Step 2: Write prompt variants

Prompting a robot policy is not creative writing. It is interface design for goal, object, constraint, and stop condition.

# Prompt variants: test whether wording changes the intended task semantics.
# Keep the action goal stable while varying object names and constraints.
templates = [
    "pick up the {object} and place it on the {target}",
    "move the {object} to the {target} without touching the distractor",
    "grasp the {object}, lift it, then release it over the {target}",
]
for template in templates:
    print(template.format(object="red block", target="blue tray"))

Code Fragment 3: The templates list separates goal wording from object and target slots. This makes prompt sensitivity visible before it becomes a robot failure.

Hint

Keep one variable fixed at a time. If object and target both change, you cannot tell which phrase caused the behavior shift.

Step 3: Build a construct-matched evaluation panel

Evaluation episodes must be shared across policy variants. This step creates the same panel for all comparisons.

# Evaluation panel: all policies must run on these same scenarios and seeds.
# Add perturbations that test language, perception, and control separately.
import pandas as pd

panel = pd.DataFrame([
    {"episode": 1, "object": "red block", "target": "blue tray", "lighting": "normal", "seed": 11},
    {"episode": 2, "object": "red block", "target": "blue tray", "lighting": "dim", "seed": 12},
    {"episode": 3, "object": "red cube", "target": "blue tray", "lighting": "normal", "seed": 13},
])
print(panel)

Code Fragment 4: The panel dataframe defines shared scenarios for prompt, visual, and lighting perturbations. It prevents comparing one policy on easy episodes with another policy on hard episodes.

Hint

Add only one perturbation per row when diagnosing a failure. Combined perturbations are useful later, after isolated tests pass.

Step 4: Use the library shortcut

After the schema is clear, hand the repetitive data and training mechanics to the toolchain.

# LeRobot shortcut: inspect a dataset schema before choosing a policy class.
repo_id = "lerobot/aloha_static_coffee"
policy_options = ["act", "diffusion_policy", "openvla_adapter"]
selected = policy_options[0]
command = f"python -m lerobot.scripts.train configs/{selected}.yaml"
print({"repo_id": repo_id, "policy_options": policy_options, "initial_policy": selected, "command": command})

Code Fragment 5: The repo_id field marks the transition from planning to a maintained training command. LeRobot handles dataset indexing, transforms, batching, and checkpointing once the schema is valid.

Hint

Do not start a fine-tune until your converted dataset opens and one episode can be visualized from start to finish.

Expected Output

You should finish with a filled dataset card, three prompt variants, a three-episode evaluation panel, and a concrete LeRobot or SmolVLA command path. The artifact is a fine-tuning plan that another reader could review before compute is spent.

Stretch Goals

Add one safety constraint to the prompt template and one metric that detects violations.
Convert one real or simulated demonstration into the LeRobot dataset format.
Run a tiny baseline policy and compare it against a SmolVLA fine-tuning plan using the same panel.

Complete Solution

# Complete lab solution: dataset card, prompt variants, and evaluation panel.
# This is a planning artifact that should run before any expensive VLA fine-tune.
from dataclasses import dataclass
import pandas as pd

@dataclass
class VLADatasetCard:
    robot: str
    cameras: list[str]
    action_space: str
    control_hz: int
    prompt_template: str
    success_metric: str

    def as_row(self) -> dict[str, object]:
        return asdict(self)

card = VLADatasetCard(
    robot="ALOHA-style dual-arm tabletop robot",
    cameras=["front_rgb", "left_wrist_rgb", "right_wrist_rgb"],
    action_space="14D joint position targets plus two gripper commands",
    control_hz=20,
    prompt_template="pick up the {object} and place it on the {target}",
    success_metric="object center is inside target region after release",
)

templates = [
    "pick up the {object} and place it on the {target}",
    "move the {object} to the {target} without touching the distractor",
    "grasp the {object}, lift it, then release it over the {target}",
]
prompts = [template.format(object="red block", target="blue tray") for template in templates]

panel = pd.DataFrame([
    {"episode": 1, "object": "red block", "target": "blue tray", "lighting": "normal", "seed": 11},
    {"episode": 2, "object": "red block", "target": "blue tray", "lighting": "dim", "seed": 12},
    {"episode": 3, "object": "red cube", "target": "blue tray", "lighting": "normal", "seed": 13},
])

print(card)
print(prompts)
print(panel)

Code Fragment 6: The solution fills the VLADatasetCard, generates prompt variants, and builds the shared panel. These three artifacts are the minimum review package before running a VLA fine-tune.

Expected output: Prompting and conditioning embodied policies should leave a reproducible VLA evidence trace with checkpoint, action representation, robot interface, metric, and failure label.

Memory Hook

For prompting and conditioning embodied policies, the useful test is simple: could a teammate point to the log line, plot, or trace that proves the idea changed the agent's next action?

Self Check

Rewrite "tidy the table" as three VLA-ready prompts: one for picking, one for placing, and one with a safety constraint.

Research Frontier

The open frontier is compositional conditioning: combining language, goal images, memory, task graphs, and safety policies without confusing the action head. Gemini Robotics and GR00T-style systems point toward this direction, but reproducible open evaluation remains essential.

Key Takeaway

Prompting a VLA is interface design for physical behavior. The prompt should make the intended action distribution narrower, safer, and easier to evaluate.

Exercise 34.7

Create five prompts for one task: plain goal, synonym variant, attribute variant, constraint variant, and stop-condition variant. Predict which one is most likely to fail and why.

What's Next?

Section 34.8 closes the chapter with evaluation, limitations, and open problems.

Bibliography and Further Reading

Foundational Papers and Reports

Hugging Face. "LeRobot." GitHub.

LeRobot is the practical open-source toolkit used here for datasets, policy training, evaluation, and low-cost robot workflows. Engineers should start here before writing custom data loaders or training loops.

Tool

Hugging Face (2025). "SmolVLA: Efficient Vision-Language-Action Model trained on LeRobot Community Data." Hugging Face Blog.

SmolVLA is a compact open VLA designed to run on more accessible hardware and fine-tune on LeRobot datasets. It is the best fit for the chapter hands-on lab because it lowers the barrier to experimentation.

Tool

Kim et al. (2024). "OpenVLA: An Open-Source Vision-Language-Action Model." arXiv.

OpenVLA connects open VLM backbones to robot action generation and provides a practical codebase for fine-tuning. Practitioners should read it alongside the GitHub repository before adapting an open VLA to a new robot.

Paper

Google DeepMind (2025). "Gemini Robotics 1.5 brings AI agents into the physical world." Google DeepMind Blog.

Gemini Robotics 1.5 is described by Google DeepMind as a VLA model that maps visual information and instructions into motor commands. It is important for frontier context, but readers should distinguish official demonstrations from independently replicated results.

📝 Blog Post

Tools, Libraries, and Frontier Notes

Bjorck et al. (2025). "GR00T N1: An Open Foundation Model for Generalist Humanoid Robots." arXiv.

GR00T N1 frames humanoid control as a dual-system VLA architecture with reasoning and fast action generation. It prepares the transition from Chapter 34 into Chapter 35 and the later humanoid chapter.

Paper

Physical Intelligence (2025). "pi-zero point five: a Vision-Language-Action Model with Open-World Generalization." arXiv.

Pi-zero point five extends pi-zero through heterogeneous co-training for broader open-world generalization. It is useful for readers studying the frontier between task-specific robot policies and household-scale generalist behavior.

Paper