Section 1.8: Map of the book | Building Embodied AI: From Perception to Autonomous Action

"A good map does not tell you everything. It tells you where the next failure belongs."
A Careful Control Loop

Big Picture

This book is organized around the closed loop introduced in Section 1.2: an agent senses, decides, acts, and inherits the consequence of its action as the next observation. The twelve parts trace that loop from the inside out and then forward in time. Parts I-III build the substrate (the formal interface, the mathematics of bodies and control, and the simulation stack on which everything is measured). Parts IV-V supply the two ways a policy is learned (reward-driven and demonstration-driven). Parts VI-VIII deepen the three contracts the loop must satisfy to scale (perception that recovers state, language that supplies intent, and world models that forecast consequence). Part IX recombines all of it into physical skills, Part X opens the loop to other agents and to people, and Parts XI-XII close it with evaluation, safety, deployment, and the open frontier. Read linearly and the dependencies resolve in order; read by need and the part titles below tell you where each contract lives.

Figure 1.8. The book expands the closed loop into twelve parts: a substrate (interface, mathematics of bodies, simulation), two learning routes (reinforcement and imitation), three scaling contracts (perception, language, world models), then recombination into skills, multi-agent and human settings, and finally evaluation, safety, deployment, and frontier.

The twelve parts and how they depend on each other

Part I, Foundations of Embodied AI (Chapters 1-3) is the part you are reading. It fixes the vocabulary: the static-versus-embodied distinction, the agent-environment interface as a controlled Markov process, and the space of system architectures from monolithic policies to modular perception-planning-control stacks. Everything later assumes this framing.

Part II, Mathematical, Robotics, and Control Foundations (Chapters 4-8) supplies the mathematics of a body that moves: coordinate frames and rigid transforms, forward and inverse kinematics, rigid-body dynamics, control for practitioners, and the sensor and state-estimation chain that turns raw measurements into usable state. This part is the prerequisite for any chapter where a real or simulated body executes a command.

Part III, Simulation, Tooling, and the Modern Stack (Chapters 9-13) is where measurement becomes repeatable: why simulation is central, the Gymnasium and PettingZoo environment contracts, the physics simulators (MuJoCo, MJX, Isaac Lab, Genesis), benchmark and task suites, and domain randomization for synthetic data. It is depended on by almost every later part, because reinforcement learning, manipulation, locomotion, driving, and safety all run their experiments here first.

Part IV, Reinforcement Learning for Embodied Agents (Chapters 14-20) develops the reward-driven route to a policy: the RL refresher, policy-gradient methods and PPO, value-based and off-policy methods, massively parallel GPU RL, reward and goal design, exploration in embodied worlds, and sim-to-real transfer. It assumes Parts II and III.

Part V, Imitation Learning, Demonstrations, and Robot Data (Chapters 21-26) develops the demonstration-driven route: imitation foundations and the compounding-error problem from Section 1.1, action chunking with diffusion and flow-matching policies, teleoperation and data collection, robot datasets and data-scaling laws, offline RL, and skills with hierarchy and task decomposition. Parts IV and V are siblings, two answers to "where does the policy come from," and later parts draw on both.

Part VI, Embodied Perception (Chapters 27-30) deepens the contract that recovers state from observation: visual perception for action, 3D perception and neural scene representations, localization and mapping, and navigation and path planning. This is the loop's sensing edge made rigorous.

Part VII, Language, Vision, and Action (Chapters 31-35) supplies intent: language-guided agents, vision-language models for embodiment, LLMs as planners and controllers, vision-language-action models, and robot foundation models with cross-embodiment learning. It depends on Part VI for grounded perception and on Parts IV-V for the action interface it drives.

Part VIII, World Models and Model-Based Embodied AI (Chapters 36-41) supplies forecast: predicting the future, model-based RL and MPC, latent world models, generative and video world models, predictive and self-supervised representations, and diffusion-based generative planning. It closes the consequence edge of the loop, letting an agent reason about what its action will produce before committing.

Part IX, Manipulation, Locomotion, and Embodied Skills (Chapters 42-48) recombines the contracts into physical capability: manipulation, grasping and dexterity, tactile and visuo-tactile learning, locomotion and mobility, humanoids and whole-body control, aerial robots, and autonomous driving as embodied AI. These chapters integrate perception, learning, control, and sometimes language and world models into one working system.

Part X, Multi-Agent and Human-Centered Embodiment (Chapters 49-51) opens the loop beyond a single agent: multi-agent embodied AI, human-robot interaction, and open-world and lifelong embodiment. The state now includes other decision-makers and people.

Part XI, Evaluation, Safety, Robustness, and Deployment (Chapters 52-55) is the part that should not wait until the end: evaluating embodied systems, robustness and uncertainty, safety, and deployment architecture. Every earlier part needs the same-config evidence standards this part defines, which is why a builder consults it early and often.

Part XII, Frontiers, Capstones, and Course Design (Chapters 56-60) ends with embodied agents with memory, continual and lifelong learning, frontier and open problems (Chapter 58), capstone projects, and a guide to teaching with the book. It is where the loop is pushed past current practice.

Read the loop, not the index

The order of the parts is not arbitrary taxonomy: it follows the loop. Substrate first (Parts I-III), then the two ways a policy is learned (IV-V), then the three contracts that let it scale (VI perception, VII intent, VIII forecast), then integration (IX), then opening the loop to others (X), then closing it with trust (XI) and the frontier (XII). When a system breaks, the part that owns the broken edge of the loop is where to look.

Two reading paths

Few readers go front to back on first contact. The two paths below are concrete orderings for the two audiences this book serves; both assume Part I as shared ground.

Choosing a path through the parts

Practitioner building a system. Read for the shortest route to a policy on hardware. Part I (Chapters 1-3) for framing, then Part II (4-8) for the body and state, then Part III (9-13) for the simulation and benchmark stack. Pick one learning route to start: Part V (21-26) if you have demonstrations or a teleoperation rig, Part IV (14-20) if you have a reward and a fast simulator. Add Part VI (27-30) for perception and, if instructions matter, Part VII (31-35) for language and VLA control. Then jump straight to your task in Part IX (manipulation 42-44, locomotion 45-46, driving 48). Consult Part XI (52-55) from the start, not at the end: define evaluation, robustness, and safety before the first deployment, and treat Chapter 55 as the deployment-architecture checklist.

Researcher. Read for the contract boundaries and open problems. Part I (1-3) and Part II (4-8) for shared formalism, then the methodological cores: Part IV (14-20) and Part V (21-26) for the learning theory, Part VIII (36-41) for world models, and Part VI-VII (27-35) for the perception and language contracts. Track the frontier chapters specifically: vision-language-action and robot foundation models in Chapters 34-35, the world-model frontier across 36-41, embodied memory and continual learning in 56-57, and the consolidated open-problems chapter 58. Pair every method chapter with Part XI (52-55) so claims are stated against same-config evidence, and use Chapter 60 if you are designing a course around the material.

The appendices and when to consult them

Nine appendices carry the prerequisite and reference material that would otherwise interrupt the chapters. Reach for them as follows. A. Linear Algebra and 3D Geometry Refresher and B. Probability, Estimation, and Optimization Refresher are the math refreshers; consult A before Part II (frames and transforms) and B before Parts IV-V and VIII (estimation and optimization). C. The Embodied AI Toolbox is the catalog of maintained libraries the book builds on. D. PyTorch and JAX for Embodied AI covers the two frameworks used throughout, including the JAX path behind GPU-parallel RL in Part IV. E. Compute Recipes gives concrete configurations for training runs and parallel simulation. F. Datasets and Benchmarks Catalog is the reference for the robot data and task suites named in Parts III, V, and IX. G. Reproducibility and Experiment Hygiene, H. Notation and Glossary, and I. Citing the Frontier support the evidence discipline that Part XI demands across the whole book.

Key Takeaway

The twelve parts are the closed loop expanded and laid flat: substrate, then the two learning routes, then the perception, language, and world-model contracts, then integration into skills, multi-agent and human settings, and finally evaluation, safety, deployment, and the frontier. Read linearly for the dependencies; read by need by matching the broken edge of the loop to the part that owns it, and lean on the appendices for the math and tooling the chapters assume.

Exercise 1.8.1

For a system you intend to build, write the shortest part ordering that gets you to a working policy, then mark which two earlier parts you would have to backtrack to if perception or control turned out to be the weak link. Add the one appendix you would consult before Part II and the one before Part IV.

What's Next?

Chapter 2 begins the substrate by formalizing the agent-environment interface as a controlled Markov process.

Bibliography & Further Reading

Foundational References For This Section

Sutton, R. S., and Barto, A. G.. "Reinforcement Learning: An Introduction." (2018). http://incompleteideas.net/book/the-book-2nd.html

The durable reference for interaction, return, policies, and episode-level evaluation.

Brohan, A. et al.. "RT-1: Robotics Transformer for real-world control at scale." (2022). https://arxiv.org/abs/2212.06817

A useful anchor for large-scale robot policy learning from real interaction data.

Open X-Embodiment Collaboration. "Open X-Embodiment: Robotic Learning Datasets and RT-X Models." (2023). https://arxiv.org/abs/2310.08864

Shows why cross-embodiment data matters for the Physical AI framing.