"Perception earns its keep when the next action gets safer, faster, or easier to debug."
A Patient Embodied AI Agent
Scene representations for robotics: SLAM, real2sim, manipulation chooses the right memory format for the job. SLAM needs pose and map consistency, real2sim needs editable geometry and materials, and manipulation needs local contact and object state that survive interaction.
Problem First: Why This Representation Exists
A static computer-vision system can stop when it names an object or produces a clean visualization. An embodied system cannot. The robot needs a representation that is tied to coordinates, uncertainty, latency, and an action interface, because a late or uncalibrated result can be more dangerous than no result at all.
For this section, the useful mental model is an action contract. The perception module receives sensor evidence, estimates a compact state, exposes confidence and timing, and lets a planner or controller decide whether the next action is allowed. This is the bridge from coordinate frames and sensor estimation to the closed-loop evaluation discipline used in sim-to-real transfer.
A perception output becomes embodied knowledge only when it can change an admissible action, a recovery choice, or a safety margin. If the same command is issued with and without the representation, the representation is not yet part of the control loop.
Figure 28.7.1 should be read as the Scene representations for robotics: SLAM, real2sim, manipulation handoff diagram: sensor evidence, geometric representation, uncertainty, latency, and action consumer are separate failure points.
Mathematical Core
A robot scene memory is best viewed as a set of query functions, not a single universal format.
$\mathcal M=\{q_{\mathrm{pose}},q_{\mathrm{free}},q_{\mathrm{contact}},q_{\mathrm{object}},q_{\mathrm{render}}\}$
Different tasks ask different queries. A renderer asks for appearance, a planner asks for free space, a manipulator asks for contact geometry, and a language planner asks for object relations. A mature system routes each query to the representation that can answer it safely.
- Write the downstream queries before choosing the map format.
- Separate safety-critical geometry from visualization-only memory.
- Keep object state updateable after contact, occlusion, or task progress.
- For real2sim, store provenance so synthetic scenes can be traced back to capture data and edits.
| Design Choice | Use When | Control Risk |
|---|---|---|
| Pose tracking | SLAM graph, visual-inertial odometry | Map inconsistency corrupts every downstream query. |
| Collision planning | Occupancy, ESDF, mesh, verified cloud | Rendering fields may be non-conservative. |
| Task reasoning | Object-centric scene graph | Relations can become stale after manipulation. |
| Visual replay | NeRF or 3DGS | Photorealism can hide missing control semantics. |
Worked Miniature
The expected output routing table should be interpreted as a division of semantic labor across representations. The key lesson is that a visually rich model can own operator rendering while collision checking and grasp planning still demand geometry with stricter safety meaning.
ROS 2, SLAM systems, Open3D, Nerfstudio, and simulator import pipelines already provide pieces of this routing. The engineering task is to keep provenance, timestamps, frame transforms, and query ownership explicit.
A real2sim scene that looks correct can still be physically wrong if mass, friction, joint limits, collision geometry, or object poses are not audited.
A Boston Dynamics style inspection robot might use visual-inertial SLAM for pose, an ESDF for safe footstep or body clearance, object memory for task state, and splats or NeRFs for operator visualization.
For Scene representations for robotics: SLAM, real2sim, manipulation, the perception result must answer what action changed, what uncertainty changed, and what log would reproduce the decision. Otherwise the output is still visualization, not embodied evidence.
Debugging And Evaluation
Evaluate the representation inside the same action loop that will use it. The report should include the sensor stream, calibration version, frame transform, model checkpoint or library version, latency distribution, action candidate set, chosen action, and failure label. This makes the comparison construct matched: the baseline and shortcut are judged by the same script on the same panel.
A good debugging run varies one factor at a time. Perturb lighting, occlusion, calibration, motion blur, viewpoint, object pose, or update rate, then record whether the action changed for the right reason. That single-factor habit is what turns a failed rollout into a useful engineering artifact.
The current frontier is hybrid scene memory: SLAM for pose, neural fields or splats for dense visual memory, object-centric graphs for reasoning, and verified geometry for control. The hard part is keeping those layers synchronized as the robot acts.
Chapter 29 takes scene representation one step further into simultaneous localization and mapping, where the robot must build and use the same map at the same time while its own pose remains uncertain.
Section References
NVIDIA. Isaac ROS Visual SLAM documentation. https://nvidia-isaac-ros.github.io/repositories_and_packages/isaac_ros_visual_slam/index.html
Practical visual-inertial odometry component for robotics navigation.
Nerfstudio documentation. https://docs.nerf.studio/
Maintained framework for neural scene representations used in real2sim and visualization workflows.
Open3D. Geometry and pipelines documentation. https://www.open3d.org/docs/release/
Practical geometry processing reference for robotics scene memory.
Can you name the representation, the consuming action, the uncertainty or freshness field, and the failure label for Scene representations for robotics: SLAM, real2sim, manipulation? If any one is missing, the section is not yet ready for a robot replay log.
There is no universal scene representation for robotics. Strong systems route each query to the representation whose assumptions match the action and risk.
Design a scene-memory stack for a mobile manipulator in a kitchen. Assign separate representations for localization, collision checking, object reasoning, visual replay, and simulation export.