Part VII: Language, Vision, and Action | Building Embodied AI: From Perception to Autonomous Action

Part Overview

This part covers language-guided agents, VLMs, LLM planners, VLAs, and cross-embodiment foundation models. It connects formal ideas with the tools and labs needed to build working systems.

Chapters: 5. Each chapter includes theory, recipes, practical code, a library shortcut, and exercises.

Why This Part Matters

Language, Vision, and Action gives the reader a working layer of the embodied AI stack. Later chapters assume this layer when agents must perceive, plan, act, and recover from mistakes.

Chapter 31 Language-Guided Embodied Agents

This chapter develops language-guided embodied agents as part of the embodied AI stack.

31.1 Why language matters in embodied AI
31.2 Instructions, goals, constraints
31.3 Grounding language in perception; referring expressions
31.4 Object- and region-centric grounding
31.5 Task planning from language; ambiguity and clarification
31.6 Human-agent interaction

Figure VII gives this page a compact map of the interface. Read it left to right, then check whether the surrounding prose names the same observation, action, and evidence contract.

Figure VII: A closed-loop map for Part VII Language, Vision, and Action. The diagram forces the reader to name the input, model boundary, action interface, and evidence record before trusting the system.

Chapter 32 Vision-Language Models for Embodiment

This chapter develops vision-language models for embodiment as part of the embodied AI stack.

32.1 From image-text models to embodied perception
32.2 CLIP, SigLIP, DINOv2 representations
32.3 Vision-language encoders and open-vocabulary detection
32.4 Visual question answering and scene description in environments
32.5 Multimodal memory
32.6 Limits of static VLMs in dynamic worlds

Chapter 33 LLMs as Planners and Controllers

This chapter develops LLMs as planners and controllers as part of the embodied AI stack.

33.1 What LLMs can and cannot do in embodied tasks
33.2 SayCan: affordance-grounded planning
33.3 Code as Policies: LLMs that write robot code
33.4 VoxPoser: composing 3D value maps
33.5 ReKep: relational keypoint constraints
33.6 Tool use, action APIs, plan verification, replanning
33.7 Memory, state tracking, and hallucination in physical tasks
33.8 Safe LLM-agent interfaces

Chapter 34 Vision-Language-Action Models

This chapter develops vision-language-action models as embodied policies, not captioners with robot arms.

34.1 From VLMs to VLAs: the core idea
34.2 The lineage: RT-1, RT-2, RT-X / Open X-Embodiment
34.3 Open generalist policies: Octo, OpenVLA
34.4 Diffusion/flow VLAs: RDT-1B, pi-zero, pi-zero FAST, pi-zero point five
34.5 Action tokenization vs. continuous heads; the FAST tokenizer
34.6 Co-training with web data for semantic generalization
34.7 Prompting and conditioning embodied policies
34.8 Evaluating VLA behavior; limitations and open problems
34.9 Action representations in VLA systems

Chapter 35 Robot Foundation Models and Cross-Embodiment Learning

This chapter develops robot foundation models and cross-embodiment learning as part of the embodied AI stack.

35.1 Why foundation models matter for robotics
35.2 Cross-embodiment training and transfer
35.3 Dual-system architectures: GR00T N1.5, Helix, Gemini Robotics (with Frontier Watch caveats)
35.4 Large behavior models and rigorous evaluation
35.5 Adapting to new robots; prompting and conditioning
35.6 Data scale, compute, and the open-vs-closed divide
35.7 Limitations and open questions
35.8 Serving, fine-tuning, and evaluating open robot foundation models

What's Next?

After this part, Part VIII: World Models and Model-Based Embodied AI extends the stack.