Section 28.7: Scene representations for robotics: SLAM, real2sim, manipulation | Building Embodied AI: From Perception to Autonomous Action

"Perception earns its keep when the next action gets safer, faster, or easier to debug."
A Patient Embodied AI Agent

Big Picture

Scene representations for robotics: SLAM, real2sim, manipulation chooses the right memory format for the job. SLAM needs pose and map consistency, real2sim needs editable geometry and materials, and manipulation needs local contact and object state that survive interaction.

Problem First: Why This Representation Exists

A static computer-vision system can stop when it names an object or produces a clean visualization. An embodied system cannot. The robot needs a representation that is tied to coordinates, uncertainty, latency, and an action interface, because a late or uncalibrated result can be more dangerous than no result at all.

For this section, the useful mental model is an action contract. The perception module receives sensor evidence, estimates a compact state, exposes confidence and timing, and lets a planner or controller decide whether the next action is allowed. This is the bridge from coordinate frames and sensor estimation to the closed-loop evaluation discipline used in sim-to-real transfer.

Action Is The Unit Of Meaning

A perception output becomes embodied knowledge only when it can change an admissible action, a recovery choice, or a safety margin. If the same command is issued with and without the representation, the representation is not yet part of the control loop.

Figure 28.7.1 should be read as the Scene representations for robotics: SLAM, real2sim, manipulation handoff diagram: sensor evidence, geometric representation, uncertainty, latency, and action consumer are separate failure points.

Figure 28.7.1: Choosing scene memory for control, simulation, and manipulation. The dashed feedback path reminds the reader that perception quality is judged by action consequences and replayable diagnostics.

Mathematical Core

A robot scene memory is best viewed as a set of query functions, not a single universal format.

Formal Object

$\mathcal M=\{q_{\mathrm{pose}},q_{\mathrm{free}},q_{\mathrm{contact}},q_{\mathrm{object}},q_{\mathrm{render}}\}$

Different tasks ask different queries. A renderer asks for appearance, a planner asks for free space, a manipulator asks for contact geometry, and a language planner asks for object relations. A mature system routes each query to the representation that can answer it safely.

Representation selection procedure

Write the downstream queries before choosing the map format.
Separate safety-critical geometry from visualization-only memory.
Keep object state updateable after contact, occlusion, or task progress.
For real2sim, store provenance so synthetic scenes can be traced back to capture data and edits.

Which Representation Should Own The Query

Design Choice	Use When	Control Risk
Pose tracking	SLAM graph, visual-inertial odometry	Map inconsistency corrupts every downstream query.
Collision planning	Occupancy, ESDF, mesh, verified cloud	Rendering fields may be non-conservative.
Task reasoning	Object-centric scene graph	Relations can become stale after manipulation.
Visual replay	NeRF or 3DGS	Photorealism can hide missing control semantics.

Worked Miniature

The expected output routing table should be interpreted as a division of semantic labor across representations. The key lesson is that a visually rich model can own operator rendering while collision checking and grasp planning still demand geometry with stricter safety meaning.

Code Fragment 28.7.1: The routing table prevents one representation from being used for every job. `avoid_collision`, `render_operator_view`, and `plan_grasp` each demand different semantics and failure checks.

Library Shortcut

ROS 2, SLAM systems, Open3D, Nerfstudio, and simulator import pipelines already provide pieces of this routing. The engineering task is to keep provenance, timestamps, frame transforms, and query ownership explicit.

Failure Mode To Test

A real2sim scene that looks correct can still be physically wrong if mass, friction, joint limits, collision geometry, or object poses are not audited.

Practical Example

A Boston Dynamics style inspection robot might use visual-inertial SLAM for pose, an ESDF for safe footstep or body clearance, object memory for task state, and splats or NeRFs for operator visualization.

Memory Hook

For Scene representations for robotics: SLAM, real2sim, manipulation, the perception result must answer what action changed, what uncertainty changed, and what log would reproduce the decision. Otherwise the output is still visualization, not embodied evidence.

Debugging And Evaluation

Evaluate the representation inside the same action loop that will use it. The report should include the sensor stream, calibration version, frame transform, model checkpoint or library version, latency distribution, action candidate set, chosen action, and failure label. This makes the comparison construct matched: the baseline and shortcut are judged by the same script on the same panel.

A good debugging run varies one factor at a time. Perturb lighting, occlusion, calibration, motion blur, viewpoint, object pose, or update rate, then record whether the action changed for the right reason. That single-factor habit is what turns a failed rollout into a useful engineering artifact.

Research Frontier

The current frontier is hybrid scene memory: SLAM for pose, neural fields or splats for dense visual memory, object-centric graphs for reasoning, and verified geometry for control. The hard part is keeping those layers synchronized as the robot acts.

What's Next

Chapter 29 takes scene representation one step further into simultaneous localization and mapping, where the robot must build and use the same map at the same time while its own pose remains uncertain.

Section References

NVIDIA. Isaac ROS Visual SLAM documentation. https://nvidia-isaac-ros.github.io/repositories_and_packages/isaac_ros_visual_slam/index.html

Practical visual-inertial odometry component for robotics navigation.

Nerfstudio documentation. https://docs.nerf.studio/

Maintained framework for neural scene representations used in real2sim and visualization workflows.

Open3D. Geometry and pipelines documentation. https://www.open3d.org/docs/release/

Practical geometry processing reference for robotics scene memory.

Self Check

Can you name the representation, the consuming action, the uncertainty or freshness field, and the failure label for Scene representations for robotics: SLAM, real2sim, manipulation? If any one is missing, the section is not yet ready for a robot replay log.

Key Takeaway

There is no universal scene representation for robotics. Strong systems route each query to the representation whose assumptions match the action and risk.

Exercise 28.7.1

Design a scene-memory stack for a mobile manipulator in a kitchen. Assign separate representations for localization, collision checking, object reasoning, visual replay, and simulation export.