"An agent becomes interesting at the exact moment perception changes what it dares to do next."
A Patient Embodied AI Agent
3D Perception and Neural Scene Representations turns perception into action-ready state. A flat image can tell the agent what is visible. A 3D scene representation tells it what space it can occupy, what it can touch, and what might be hidden behind the next move.
The durable test is not whether a model looks impressive. The test is whether it improves a robot's next action while leaving a clear evidence trail for debugging.
Chapter Overview
Chapter 28 develops 3D Perception and Neural Scene Representations as a working piece of the embodied AI stack. It connects visual or spatial evidence to state estimates, action choices, visual servoing loops, timing budgets, and failure labels.
The chapter follows the right-tool rhythm used across the book: build the mechanism once, then move to maintained tools such as Open3D, PyTorch, OpenCV, Gaussian Splatting workflows.
Prerequisites
Readers should be comfortable with Python, tensors, coordinate frames, sensor noise, and the perception-action loop. Useful refreshers appear in Chapter 4, Chapter 8, and Chapter 13.
Chapter Roadmap
- 28.1 Why 3D matters for manipulation and navigation3D is the bridge from pixels to reachable, traversable, and occluded space.
- 28.2 Point clouds and depth mapsdepth maps become point clouds when pixel coordinates are lifted through camera intrinsics.
- 28.3 3D detection and scene reconstructionthe agent needs object hypotheses that live in world coordinates and survive across views.
- 28.4 Occupancy grids and voxel mapsoccupancy models store where the world is free, occupied, or unknown.
- 28.5 NeRF: implicit radiance fieldsa NeRF stores a scene as a function from 3D position and view direction to density and color.
- 28.6 3D Gaussian Splatting: explicit, editable, real-timeGaussian splats store scene content as explicit ellipsoids that can render quickly and be edited locally.
- 28.7 Scene representations for robotics: SLAM, real2sim, manipulationa robot-ready scene representation must support pose tracking, action planning, and updates after contact.
This chapter uses the right-tool principle. The teaching baseline exposes units, frames, uncertainty, and logging. The shortcut stack uses maintained tools to handle optimized kernels, visualization, data formats, simulation hooks, and deployment interfaces.
Hands-On Lab: Build A Scene-Representation Query Router
Objective
Build a small router that sends localization, collision, contact, rendering, and real2sim queries to representations with matching semantics.
What You'll Practice
- Separating visualization memory from safety-critical geometry.
- Choosing between point clouds, occupancy, object graphs, NeRFs, and Gaussian splats.
- Attaching provenance, frame, and update-rate requirements to each query.
- Designing a failure label for stale or mismatched scene memory.
Setup
Start with Python's standard data structures. Add Open3D, Nerfstudio, or ROS 2 only after the query ownership table is clear.
# Optional tools for extending the lab after the baseline router works.
python -m pip install numpy open3dSteps
Step 1: List The Robot Queries
Write queries for localize, avoid collision, plan grasp, render operator view, and export a real2sim scene.
Step 2: Route Each Query
Assign each query to a representation that can answer it with the right semantics and risk level.
# Route scene queries to representations with matching safety semantics.
# Rendering and collision checking deliberately use different owners.
routes = {
"localize": "visual_inertial_slam",
"avoid_collision": "inflated_occupancy_or_esdf",
"plan_grasp": "object_pose_plus_contact_geometry",
"render_operator_view": "nerf_or_gaussian_splats",
"export_real2sim": "mesh_plus_object_scene_graph",
}
for query, owner in routes.items():
print(f"{query}: {owner}")Step 3: Add Freshness Requirements
Attach maximum age, frame, and provenance fields to each route so stale scene memory can be rejected.
Step 4: Add A Failure Case
Create one case where a rendered scene is visually plausible but too stale or too non-conservative for collision checking.
Step 5: Replace One Route With A Tool
Use Open3D for a point-cloud or voxel route, or Nerfstudio for a visual replay route, while preserving the query contract.
Expected Output
A query table that maps each robot question to a representation, required metadata, freshness limit, and failure label.
Stretch Goals
- Create a tiny Open3D point cloud and route collision queries to it.
- Add a Gaussian-splat route for visualization with an explicit no-control warning.
- Export the routing table as JSON for a simulator or ROS 2 node.
Complete Solution
# Complete baseline for the scene-representation query router.
# It flags unsafe attempts to use rendering memory for collision checks.
routes = {
"avoid_collision": {"owner": "inflated_occupancy_or_esdf", "max_age_ms": 100},
"render_operator_view": {"owner": "nerf_or_gaussian_splats", "max_age_ms": 2000},
}
requested_owner = routes["render_operator_view"]["owner"]
failure_label = "wrong_representation_for_collision" if "splat" in requested_owner else "none"
print(failure_label)Use this chapter as a complete teaching unit for scene memory that a robot can query: point clouds, object states, occupancy, neural fields, Gaussian splats, SLAM layers, and real2sim exports. The central question is which representation can answer the action query safely, not which representation renders the most impressive image.
| Tool or Library | Where It Pays Off |
|---|---|
| Open3D | RGB-D conversion, point clouds, voxelization, registration, normals, and geometry inspection. |
| OpenCV | Camera intrinsics, stereo reconstruction, pose estimation, and calibration checks before 3D fusion. |
| PyTorch | Learned scene encoders, neural field components, uncertainty heads, and differentiable geometry prototypes. |
| Nerfstudio | NeRF and Splatfacto workflows for neural scene training, inspection, and export experiments. |
| gsplat and 3DGS workflows | CUDA-accelerated Gaussian splat rendering and explicit scene-element experimentation. |
| ROS 2 and SLAM stacks | Pose tracking, map publication, diagnostics, replay, and integration with navigation or manipulation. |
Before leaving the chapter, the reader should be able to choose a representation for localization, collision checking, contact planning, visual replay, object reasoning, and simulation export.
A strong chapter session ends with a query map: every scene representation is tied to the robot query it can answer and the safety checks it cannot replace.
What's Next?
Start with Section 28.1: Why 3D matters for manipulation and navigation. After this chapter, continue to Chapter 29: Localization and Mapping (SLAM).
Bibliography & Further Reading
Foundational Papers, Tools, and References
Mildenhall, B. et al.. "NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis." ECCV, 2020. https://arxiv.org/abs/2003.08934
The foundational neural radiance field paper behind implicit scene representations.
Kerbl, B. et al.. "3D Gaussian Splatting for Real-Time Radiance Field Rendering." ACM TOG, 2023. https://repo-sam.inria.fr/fungraph/3d-gaussian-splatting/
The reference for explicit Gaussian scene elements and real-time rendering.
Zhou, Q.-Y., Park, J., and Koltun, V.. "Open3D: A Modern Library for 3D Data Processing." arXiv, 2018. https://www.open3d.org/
A practical library reference for point clouds, meshes, registration, and visualization.
Tancik, M. et al.. "Nerfstudio: A Modular Framework for Neural Radiance Field Development." SIGGRAPH, 2023. https://docs.nerf.studio/
A maintained workflow for training, inspecting, and exporting neural scene models.