Part Overview
This part covers language-guided agents, VLMs, LLM planners, VLAs, and cross-embodiment foundation models. It connects formal ideas with the tools and labs needed to build working systems.
Chapters: 5. Each chapter includes theory, recipes, practical code, a library shortcut, and exercises.
Language, Vision, and Action gives the reader a working layer of the embodied AI stack. Later chapters assume this layer when agents must perceive, plan, act, and recover from mistakes.
This chapter develops language-guided embodied agents as part of the embodied AI stack.
- 31.1 Why language matters in embodied AI
- 31.2 Instructions, goals, constraints
- 31.3 Grounding language in perception; referring expressions
- 31.4 Object- and region-centric grounding
- 31.5 Task planning from language; ambiguity and clarification
- 31.6 Human-agent interaction
Figure VII gives this page a compact map of the interface. Read it left to right, then check whether the surrounding prose names the same observation, action, and evidence contract.
This chapter develops vision-language models for embodiment as part of the embodied AI stack.
- 32.1 From image-text models to embodied perception
- 32.2 CLIP, SigLIP, DINOv2 representations
- 32.3 Vision-language encoders and open-vocabulary detection
- 32.4 Visual question answering and scene description in environments
- 32.5 Multimodal memory
- 32.6 Limits of static VLMs in dynamic worlds
This chapter develops LLMs as planners and controllers as part of the embodied AI stack.
- 33.1 What LLMs can and cannot do in embodied tasks
- 33.2 SayCan: affordance-grounded planning
- 33.3 Code as Policies: LLMs that write robot code
- 33.4 VoxPoser: composing 3D value maps
- 33.5 ReKep: relational keypoint constraints
- 33.6 Tool use, action APIs, plan verification, replanning
- 33.7 Memory, state tracking, and hallucination in physical tasks
- 33.8 Safe LLM-agent interfaces
This chapter develops vision-language-action models as embodied policies, not captioners with robot arms.
- 34.1 From VLMs to VLAs: the core idea
- 34.2 The lineage: RT-1, RT-2, RT-X / Open X-Embodiment
- 34.3 Open generalist policies: Octo, OpenVLA
- 34.4 Diffusion/flow VLAs: RDT-1B, pi-zero, pi-zero FAST, pi-zero point five
- 34.5 Action tokenization vs. continuous heads; the FAST tokenizer
- 34.6 Co-training with web data for semantic generalization
- 34.7 Prompting and conditioning embodied policies
- 34.8 Evaluating VLA behavior; limitations and open problems
- 34.9 Action representations in VLA systems
This chapter develops robot foundation models and cross-embodiment learning as part of the embodied AI stack.
- 35.1 Why foundation models matter for robotics
- 35.2 Cross-embodiment training and transfer
- 35.3 Dual-system architectures: GR00T N1.5, Helix, Gemini Robotics (with Frontier Watch caveats)
- 35.4 Large behavior models and rigorous evaluation
- 35.5 Adapting to new robots; prompting and conditioning
- 35.6 Data scale, compute, and the open-vs-closed divide
- 35.7 Limitations and open questions
- 35.8 Serving, fine-tuning, and evaluating open robot foundation models
What's Next?
After this part, Part VIII: World Models and Model-Based Embodied AI extends the stack.