Part VI: Embodied Perception | Building Embodied AI: From Perception to Autonomous Action

Part Overview

Part VI builds the perception layer that later chapters use for language-guided action, world models, manipulation, evaluation, and deployment. The part treats perception as action support: every mask, depth map, point cloud, scene representation, pose estimate, map, and route must make a robot decision, visual servoing update, or recovery action safer or easier to debug.

The four chapters move from visual perception for action and visual servoing, to 3D scene state, to SLAM, to navigation and path planning. The practical stack includes OpenCV, Open3D, PyTorch, Segment Anything, DINOv2, Gaussian Splatting workflows, ROS 2, Nav2, and Habitat-style evaluation.

Why This Part Matters

Embodied perception is the book's bridge from seeing to doing. A model output becomes useful only after it is grounded in frames, uncertainty, timing, action constraints, and construct-matched evaluation.

Part VI Rule

Do not ask only what the agent sees. Ask what the agent can safely do because it saw it, and what evidence would convince you after the rollout.

Chapter 27 Visual Perception for Action

A robot that can name every object on a table can still knock over the cup if its visual system never answers the control question: where can I move next?

Chapter 28 3D Perception and Neural Scene Representations

A flat image can tell the agent what is visible. A 3D scene representation tells it what space it can occupy, what it can touch, and what might be hidden behind the next move.

Chapter 29 Localization and Mapping (SLAM)

A robot that does not know where it is will turn every good plan into a guess. SLAM is the discipline of making that guess explicit, updateable, and testable.

29.1 Where am I and what does the world look like
29.2 Odometry and dead reckoning
29.3 Localization (Monte Carlo / particle filters)
29.4 Mapping and occupancy grids
29.5 SLAM: graph-based and visual SLAM
29.6 Neural and Gaussian-splat SLAM
29.7 Map uncertainty
29.8 Modern SLAM systems and failure modes

Chapter 30 Navigation and Path Planning

Navigation is where perception becomes a commitment. The agent has to choose a route, spend time, avoid contact, and still recover when the world refuses to match the map.

Running Example

Across the part, imagine a mobile manipulator entering a cluttered room, finding a target object, building a map, planning a path, and recovering when a mask, pose, or local costmap becomes unreliable.

What's Next?

After this part, Part VII: Language, Vision, and Action adds language goals, VLM grounding, and VLA policies on top of the perception stack built here.