Chapter 3: Embodied System Architectures | Building Embodied AI: From Perception to Autonomous Action

"An agent becomes interesting at the exact moment the world refuses to be a dataset."
A Patient Embodied AI Agent

Big Picture

Embodied System Architectures matters because embodied intelligence is a closed loop. The agent must turn partial observations into useful state, choose actions under uncertainty, and learn from the consequences in a physical or simulated world.

Remember This Chapter

The core move is to connect how perception, estimation, planning, learning, and control are arranged into a system to testable artifacts. A static model can be accurate and still be useless if it cannot support timely, safe, and recoverable behavior.

Chapter Overview

Chapter 3 develops Embodied System Architectures as a working piece of the embodied AI stack. The chapter starts with the role this topic plays in the sense, represent, predict, decide, act, observe, and learn loop, then turns that role into a concrete implementation pattern.

The practical thread uses Hugging Face Transformers, open VLMs, OpenVLA, openpi, LeRobot, and tool-calling planners where appropriate, while the theory thread keeps the mechanism visible. The reader should leave with both a mental model and a build path.

Prerequisites

Readers should be comfortable with Python, tensors, and the perception-action loop. When the chapter uses geometry, control, or probability, the relevant appendices provide a compact refresher.

Chapter Roadmap

3.1 The canonical stack: sense, perceive, estimate, predict, plan, control, actMaps the embodied stack from sensors through action and feedback.
3.2 Classical modular robotics pipelineExplains perception, mapping, planning, and control as separately engineered modules with explicit contracts.
3.3 End-to-end learned policy pipelineShows how learned policies collapse parts of the stack and shift the burden to data, architecture, and evaluation.
3.4 Hybrid and hierarchical architecturesCombines learned skills, symbolic structure, planners, and controllers across temporal scales.
3.5 Reactive vs. deliberative agentsContrasts fast stimulus-response behavior with slower planning and search.
3.6 Dual-system (System 1 / System 2) designs and where they come fromFrames fast learned policies and slower reasoning/planning layers as cooperating systems.
3.7 Where LLMs, VLMs, and VLAs sit in the stackLocates language, vision-language, and vision-language-action models as planners, perception modules, policy components, or tool users.
3.8 Failure modes of each architectureCatalogs where modular, learned, hybrid, reactive, deliberative, and foundation-model architectures break.

Tooling Note

This chapter uses the right-tool principle. Build the mechanism once, then reach for maintained tools such as Hugging Face Transformers, open VLMs, OpenVLA, openpi, LeRobot, and tool-calling planners when the task moves from learning exercise to working system.

Hands-On Lab: Build the Chapter Evidence Artifact

Duration: ~75 minutesDifficulty: Intermediate

Objective

Turn Chapter 3's main idea into a reproducible evidence artifact with a hand-built baseline, a maintained-tool shortcut, one perturbation, and a short postmortem.

What You'll Practice

Write an interface contract for how perception, estimation, planning, learning, and control are arranged into a system
Build a minimal baseline before using a library shortcut
Record one same-config comparison artifact
Explain the most informative failure mode

Setup

pip install numpy pandas

Code Fragment 3.L1 installs the small packages used for the chapter evidence artifact.

Steps

Define observations, actions, success, failure, and safety fields.
Implement the smallest baseline that produces a trace.
Run the equivalent maintained-tool version with the same schema.
Add one perturbation that targets the chapter's main failure mode.
Save metrics, configuration, seed, and notes in one folder.

Expected Output

The finished lab produces one table and one short postmortem explaining what changed between the baseline and the library shortcut.

Stretch Goals

Add a second seed set and verify that compared metrics are co-computed in one pass.
Add a one-page data card for the failure cases.

Complete Solution

What's Next?

Continue with Section 3.1: The canonical stack: sense, perceive, estimate, predict, plan, control, act, where the chapter moves from motivation to the first concrete idea.

This chapter is written for readers who want theory and a working build path in the same pass. Read each section twice: first for the mechanism, then for the artifact you would save if you had to reproduce the result six months later.

Chapter Tool Map

Tool or Library	Where It Pays Off
ROS 2	separates system modules while preserving message contracts and timing
MuJoCo	gives architecture choices a repeatable simulated world for stress tests
LeRobot	anchors modern policy architectures in reusable datasets and policy APIs

Chapter Lab Extension

Extend the lab by adding one perturbation that targets how perception, estimation, planning, learning, and control are arranged into a system. Save configuration, logs, summary metrics, and two representative failure cases in the same artifact folder.

The chapter can be used as a self-contained reading unit or as the basis for an undergraduate or graduate teaching week. The recommended pattern is concept, minimal implementation, library shortcut, diagnostic exercise, then reflection on failure modes. This keeps the mathematical idea attached to a concrete system artifact rather than letting it float as notation.

For Embodied System Architectures, the practical stack should be introduced as a set of choices rather than a shopping list. Each tool earns its place only when it shortens the working path, improves reproducibility, or exposes a standard interface that readers will meet in real embodied systems.

Readiness Check

Before leaving the chapter, the reader should be able to state one theory claim, one implementation claim, one evaluation claim, and one realistic failure mode. If any of those four are missing, the chapter should be revisited through the lab.

Teaching Takeaway

A strong chapter session ends with an artifact: a small script, a plotted trace, a simulator run, a data card, or a reproducible evaluation panel. The artifact is what turns reading into embodied-system-building practice.

Fun Note

Chapter 3 treats architecture as something readers can test, not a poster to admire. If a diagram cannot produce a trace, it still owes the reader a better contract.

Bibliography & Further Reading

Foundational Papers, Tools, and References

Sutton, R. S., and Barto, A. G.. "Reinforcement Learning: An Introduction." (2018). http://incompleteideas.net/book/the-book-2nd.html

A foundation for value functions, policy gradients, exploration, and the RL framing used throughout the book.

Todorov, E., Erez, T., and Tassa, Y.. "MuJoCo: A physics engine for model-based control." (2012). https://mujoco.org/

The simulator lineage behind much modern robot learning, now extended through MJX and Warp workflows.

Brohan, A. et al.. "RT-1: Robotics Transformer for real-world control at scale." (2022). https://arxiv.org/abs/2212.06817

A landmark in large-scale robot policy learning with transformer policies.

Brohan, A. et al.. "RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control." (2023). https://arxiv.org/abs/2307.15818

A central reference for connecting web-scale VLM knowledge to robot actions.

Open X-Embodiment Collaboration. "Open X-Embodiment: Robotic Learning Datasets and RT-X Models." (2023). https://arxiv.org/abs/2310.08864

The cross-embodiment data and transfer reference used by the data chapters.

Chi, C. et al.. "Diffusion Policy: Visuomotor Policy Learning via Action Diffusion." (2023). https://arxiv.org/abs/2303.04137

The practical diffusion policy reference for imitation learning and continuous action generation.

Hafner, D. et al.. "Mastering Diverse Domains through World Models." (2023). https://arxiv.org/abs/2301.04104

DreamerV3, a modern reference for latent world models and imagination-based control.

Hugging Face. "LeRobot." (2024). https://github.com/huggingface/lerobot

The open robot-learning stack used for datasets, policies, demos, and low-cost embodied AI workflows.

Official documentation and source repositories for Embodied System Architectures.

Use official docs to check install commands, current APIs, and version caveats before applying Embodied System Architectures in a lab or project.