What This Book Covers | Building Embodied AI: From Perception to Autonomous Action

Big Picture

This book covers embodied AI as a field, not a single technique. Embodied intelligence begins when an agent's computation is inseparable from its body: the sensors determine what the world looks like, the actuators determine what can be changed, and the physics determine what the consequences will be. That inseparability is the subject. The world in which it plays out may be a factory cell, a public road, a cluttered kitchen, an underwater pipeline, or a GPU-resident simulator. The scope is wide on purpose, because the field is wide; the spine that holds it together is the closed perception-action loop, treated as the invariant while morphology, sensing, and time constants vary.

The Scope of the Field

"Embodied" is not a synonym for "humanoid robot," and embodied AI is not a narrow contrast between prediction and interaction. The field spans the full embodiment spectrum: fixed manipulators on an assembly line; wheeled and tracked mobile bases; autonomous road vehicles; aerial and underwater vehicles; legged robots, including quadrupeds and bipedal humanoids; soft and continuum robots; wearables, prosthetics, and exoskeletons that share a body with a person; micro-robots and swarms; and purely simulated agents that may never touch hardware. These bodies differ in actuation, sensing, time constant, and the cost of a mistake, yet every one of them closes the same sense, decide, act, observe loop.

What makes the field coherent is a disciplinary confluence. Embodied AI inherits feedback and stability from control theory; geometry, kinematics, and actuation from robotics and mechatronics; learning from interaction from reinforcement learning; scene understanding from computer vision; instruction following and planning from language models; and the idea that a body shapes cognition from embodied cognition in cognitive science. The lineage runs through Wiener's cybernetics, Brooks' behavior-based critique of pure sense-plan-act, Moravec's paradox (sensorimotor skill is harder to automate than abstract reasoning), and morphological computation (the body itself carries part of the control). No single parent field owns the closed loop; embodied AI is the seam between them.

The Invariant

Across every chapter, the one thing that does not change is the closed loop: the agent's own actions generate its future observations, the world enforces physics and time, mistakes change the state rather than ending an example, and competence is a property of behavior over a horizon rather than accuracy on a fixed test set. The book organizes the spectrum around that invariant and treats morphology as a parameter.

What the Twelve Parts Cover

The book is organized into twelve parts.

Part I, Foundations of Embodied AI. The structural break from static prediction to embodied interaction, the agent-environment interface, and embodied system architectures.
Part II, Mathematical, Robotics, and Control Foundations. Spatial representation and coordinate frames, kinematics, dynamics and simulation math, control for AI practitioners, and sensors, perception hardware, and state estimation.
Part III, Simulation, Tooling, and the Modern Stack. Why simulation is central, environments with Gymnasium and PettingZoo, the physics simulators (MuJoCo, MJX, Isaac Lab, Genesis), benchmarks and task suites, and domain randomization and synthetic data.
Part IV, Reinforcement Learning for Embodied Agents. An RL refresher, policy gradients and PPO, value-based and off-policy methods, massively parallel GPU RL, reward design, exploration in embodied worlds, and sim-to-real transfer.
Part V, Imitation Learning, Demonstrations, and Robot Data. Imitation learning foundations, action chunking, diffusion policy and flow matching, teleoperation and data collection, robot datasets and data scaling laws, offline RL, and skills and task decomposition.
Part VI, Embodied Perception. Visual perception for action, 3D perception and neural scene representations, localization and mapping, and navigation and path planning.
Part VII, Language, Vision, and Action. Language-guided agents, vision-language models for embodiment, LLMs as planners and controllers, vision-language-action models, and robot foundation models and cross-embodiment learning.
Part VIII, World Models and Model-Based Embodied AI. Predicting the future, model-based RL and MPC, latent world models, generative and video world models, predictive self-supervised representations, and diffusion and generative planning.
Part IX, Manipulation, Locomotion, and Embodied Skills. Robotic manipulation, grasping and dexterous manipulation, tactile and visuo-tactile learning, locomotion and mobility, humanoid whole-body control, drones and aerial embodied AI, and autonomous driving as embodied AI.
Part X, Multi-Agent and Human-Centered Embodiment. Multi-agent embodied AI, human-robot interaction, and open-world and lifelong embodiment.
Part XI, Evaluation, Safety, Robustness, and Deployment. Evaluating embodied systems, robustness and uncertainty, safety, and deployment architecture.
Part XII, Frontiers, Capstones, and Course Design. Embodied agents with memory, continual and lifelong learning, frontier and open problems, capstone projects, and teaching with the book.

Nine appendices carry the prerequisite refreshers (linear algebra and 3D geometry; probability, estimation, and optimization), an embodied AI toolbox, PyTorch and JAX usage, compute recipes, a datasets and benchmarks catalog, reproducibility hygiene, notation and glossary, and guidance on citing the frontier.

What This Book Does Not Cover

This is not a first course in programming, machine learning, or deep learning, and it does not re-teach those prerequisites inline; the appendices refresh the specific math the chapters lean on, and nothing more. It is not a mechanical engineering, electronics, or hardware-design text: actuator design, PCB layout, and mechanism fabrication are out of scope, and hardware is treated only where it constrains the learning and control problem. It is not a manual for one robot platform or one vendor SDK, and it is not a general AI survey; topics with no path to the perception-action loop (pure language modeling, recommendation, tabular ML) are left to other books in the series.

Current as of 2026

The book is written to the post-2023 state of the field. It covers vision-language-action models and robot foundation models, world models (including generative and video world models) used for planning and prediction, GPU-parallel simulation that trains policies in massively parallel environments, cross-embodiment data and transfer (the Open X-Embodiment line of work), and the maintained open-source stack practitioners actually use, including the LeRobot ecosystem for data, policies, and teleoperation. Version caveats and deprecated tools are marked where they matter, so the currency survives past the print date.