Book Part
Part VI

Part VI: Embodied Perception

Part Overview

Part VI builds the perception layer that later chapters use for language-guided action, world models, manipulation, evaluation, and deployment. The part treats perception as action support: every mask, depth map, point cloud, scene representation, pose estimate, map, and route must make a robot decision, visual servoing update, or recovery action safer or easier to debug.

The four chapters move from visual perception for action and visual servoing, to 3D scene state, to SLAM, to navigation and path planning. The practical stack includes OpenCV, Open3D, PyTorch, Segment Anything, DINOv2, Gaussian Splatting workflows, ROS 2, Nav2, and Habitat-style evaluation.

Why This Part Matters

Embodied perception is the book's bridge from seeing to doing. A model output becomes useful only after it is grounded in frames, uncertainty, timing, action constraints, and construct-matched evaluation.

Part VI Rule

Do not ask only what the agent sees. Ask what the agent can safely do because it saw it, and what evidence would convince you after the rollout.

A robot that can name every object on a table can still knock over the cup if its visual system never answers the control question: where can I move next?

A flat image can tell the agent what is visible. A 3D scene representation tells it what space it can occupy, what it can touch, and what might be hidden behind the next move.

A robot that does not know where it is will turn every good plan into a guess. SLAM is the discipline of making that guess explicit, updateable, and testable.

Navigation is where perception becomes a commitment. The agent has to choose a route, spend time, avoid contact, and still recover when the world refuses to match the map.

Running Example

Across the part, imagine a mobile manipulator entering a cluttered room, finding a target object, building a map, planning a path, and recovering when a mask, pose, or local costmap becomes unreliable.

What's Next?

After this part, Part VII: Language, Vision, and Action adds language goals, VLM grounding, and VLA policies on top of the perception stack built here.