Book Part
Part VII

Part VII: Language, Vision, and Action

Part Overview

This part covers language-guided agents, VLMs, LLM planners, VLAs, and cross-embodiment foundation models. It connects formal ideas with the tools and labs needed to build working systems.

Chapters: 5. Each chapter includes theory, recipes, practical code, a library shortcut, and exercises.

Why This Part Matters

Language, Vision, and Action gives the reader a working layer of the embodied AI stack. Later chapters assume this layer when agents must perceive, plan, act, and recover from mistakes.

This chapter develops language-guided embodied agents as part of the embodied AI stack.

  • 31.1 Why language matters in embodied AI
  • 31.2 Instructions, goals, constraints
  • 31.3 Grounding language in perception; referring expressions
  • 31.4 Object- and region-centric grounding
  • 31.5 Task planning from language; ambiguity and clarification
  • 31.6 Human-agent interaction

Figure VII gives this page a compact map of the interface. Read it left to right, then check whether the surrounding prose names the same observation, action, and evidence contract.

Closed-loop interface for Part VII Language, Vision, and Action A four-stage loop connects input, model reasoning, action, and evidence for this page. Vision VLA Core Action Head Controller Observe, decide, act, measure, then feed failure evidence back into the next decision.
Figure VII: A closed-loop map for Part VII Language, Vision, and Action. The diagram forces the reader to name the input, model boundary, action interface, and evidence record before trusting the system.

This chapter develops vision-language models for embodiment as part of the embodied AI stack.

  • 32.1 From image-text models to embodied perception
  • 32.2 CLIP, SigLIP, DINOv2 representations
  • 32.3 Vision-language encoders and open-vocabulary detection
  • 32.4 Visual question answering and scene description in environments
  • 32.5 Multimodal memory
  • 32.6 Limits of static VLMs in dynamic worlds

This chapter develops LLMs as planners and controllers as part of the embodied AI stack.

  • 33.1 What LLMs can and cannot do in embodied tasks
  • 33.2 SayCan: affordance-grounded planning
  • 33.3 Code as Policies: LLMs that write robot code
  • 33.4 VoxPoser: composing 3D value maps
  • 33.5 ReKep: relational keypoint constraints
  • 33.6 Tool use, action APIs, plan verification, replanning
  • 33.7 Memory, state tracking, and hallucination in physical tasks
  • 33.8 Safe LLM-agent interfaces

This chapter develops vision-language-action models as embodied policies, not captioners with robot arms.

  • 34.1 From VLMs to VLAs: the core idea
  • 34.2 The lineage: RT-1, RT-2, RT-X / Open X-Embodiment
  • 34.3 Open generalist policies: Octo, OpenVLA
  • 34.4 Diffusion/flow VLAs: RDT-1B, pi-zero, pi-zero FAST, pi-zero point five
  • 34.5 Action tokenization vs. continuous heads; the FAST tokenizer
  • 34.6 Co-training with web data for semantic generalization
  • 34.7 Prompting and conditioning embodied policies
  • 34.8 Evaluating VLA behavior; limitations and open problems
  • 34.9 Action representations in VLA systems

This chapter develops robot foundation models and cross-embodiment learning as part of the embodied AI stack.

  • 35.1 Why foundation models matter for robotics
  • 35.2 Cross-embodiment training and transfer
  • 35.3 Dual-system architectures: GR00T N1.5, Helix, Gemini Robotics (with Frontier Watch caveats)
  • 35.4 Large behavior models and rigorous evaluation
  • 35.5 Adapting to new robots; prompting and conditioning
  • 35.6 Data scale, compute, and the open-vs-closed divide
  • 35.7 Limitations and open questions
  • 35.8 Serving, fine-tuning, and evaluating open robot foundation models

What's Next?

After this part, Part VIII: World Models and Model-Based Embodied AI extends the stack.