Part Overview
Part VI builds the perception layer that later chapters use for language-guided action, world models, manipulation, evaluation, and deployment. The part treats perception as action support: every mask, depth map, point cloud, scene representation, pose estimate, map, and route must make a robot decision, visual servoing update, or recovery action safer or easier to debug.
The four chapters move from visual perception for action and visual servoing, to 3D scene state, to SLAM, to navigation and path planning. The practical stack includes OpenCV, Open3D, PyTorch, Segment Anything, DINOv2, Gaussian Splatting workflows, ROS 2, Nav2, and Habitat-style evaluation.
Embodied perception is the book's bridge from seeing to doing. A model output becomes useful only after it is grounded in frames, uncertainty, timing, action constraints, and construct-matched evaluation.
Do not ask only what the agent sees. Ask what the agent can safely do because it saw it, and what evidence would convince you after the rollout.
A robot that can name every object on a table can still knock over the cup if its visual system never answers the control question: where can I move next?
A flat image can tell the agent what is visible. A 3D scene representation tells it what space it can occupy, what it can touch, and what might be hidden behind the next move.
- 28.1 Why 3D matters for manipulation and navigation
- 28.2 Point clouds and depth maps
- 28.3 3D detection and scene reconstruction
- 28.4 Occupancy grids and voxel maps
- 28.5 NeRF: implicit radiance fields
- 28.6 3D Gaussian Splatting: explicit, editable, real-time
- 28.7 Scene representations for robotics: SLAM, real2sim, manipulation
A robot that does not know where it is will turn every good plan into a guess. SLAM is the discipline of making that guess explicit, updateable, and testable.
Navigation is where perception becomes a commitment. The agent has to choose a route, spend time, avoid contact, and still recover when the world refuses to match the map.
- 30.1 Navigation as embodied intelligence
- 30.2 Graph search: BFS, Dijkstra, A*
- 30.3 Sampling-based planning: RRT, RRT*, PRM
- 30.4 Local planning and obstacle avoidance (DWA, potential fields)
- 30.5 Learned navigation policies
- 30.6 Language- and image-goal navigation
- 30.7 Field navigation under degraded sensing
Across the part, imagine a mobile manipulator entering a cluttered room, finding a target object, building a map, planning a path, and recovering when a mask, pose, or local costmap becomes unreliable.
What's Next?
After this part, Part VII: Language, Vision, and Action adds language goals, VLM grounding, and VLA policies on top of the perception stack built here.