Section 8.2: Cameras, depth (stereo/structured light/ToF), LiDAR | Building Embodied AI: From Perception to Autonomous Action

A Careful Control Loop

Technical illustration for Section 8.2: Cameras, depth (stereo/structured light/ToF), LiDAR. — Figure 8.2A: Camera, structured-light depth, time-of-flight, and spinning LiDAR compared on the same indoor scene, highlighting each sensor's range, resolution, and failure modes under bright ambient light.

Big Picture

Cameras, depth (stereo/structured light/ToF), LiDAR is one lens on sensors, perception hardware, and state estimation. We study it because an embodied agent needs decisions that survive contact with noisy sensors, delayed effects, and changing environments.

This section develops the technical contract for Cameras, depth (stereo/structured light/ToF), LiDAR into a usable mental model. First we define the object of study, then we connect it to the agent loop, then we test it with a compact implementation.

The key question in Cameras, depth (stereo/structured light/ToF), LiDAR is practical: what must the agent know, what can it observe, what action is available, and what evidence shows that the action worked under the stated conditions?

Action Is The Test

A representation earns its place when it changes the measurable action interface. In Cameras, depth (stereo/structured light/ToF), LiDAR, the reader should keep asking which decision becomes easier, safer, or more reliable.

Theory

For Cameras, depth (stereo/structured light/ToF), LiDAR, the practical design rule is to make the interface inspectable before optimization begins: inputs, outputs, units, latency, bounds, and failure labels should all be visible in the saved artifact.

Mechanism

The mechanism in Cameras, depth (stereo/structured light/ToF), LiDAR is the contract between representation and action. Name what enters the module, what leaves it, which assumptions make that transformation valid, and which log would reveal a bad handoff.

Worked Example: Stereo Depth and the Point Cloud

A stereo pair recovers depth from disparity. The same world point projects to slightly different horizontal pixels in the left and right images, and that shift in pixels is the disparity $d$. With baseline $b$ (the distance between the two camera centers) and focal length $f$ in pixels, the depth is

$$z = \frac{b\,f}{d}$$

This relation carries two practical consequences. First, depth resolution degrades with the square of distance: since $z \propto 1/d$, a one-pixel disparity error causes a depth error that grows like $z^2/(bf)$, which is why stereo rigs are accurate up close and vague far away, and why a wider baseline buys range. Second, zero disparity means infinite depth, so distant or textureless regions where disparity cannot be measured produce holes. LiDAR sidesteps disparity entirely by timing a laser pulse, giving direct metric range with near-uniform accuracy across distance, at the cost of sparser angular sampling and moving parts. Once depth per pixel is known, back-projection through the intrinsics turns the depth image into a 3D point cloud.

# Stereo depth z = b*f/d, then back-project a depth image to a 3D point cloud.
import numpy as np

b = 0.12      # baseline between the two cameras, meters
f = 700.0     # focal length, pixels
cx, cy = 320.0, 240.0   # principal point, pixels

# A tiny 2x3 disparity map (pixels). Larger disparity => nearer surface.
disparity = np.array([[40.0, 35.0, 30.0],
                      [50.0, 45.0, 20.0]])
H, W = disparity.shape
us, vs = np.meshgrid(np.arange(W), np.arange(H))

z = b * f / disparity            # metric depth per pixel
x = (us - cx) * z / f            # back-project to the camera frame
y = (vs - cy) * z / f
cloud = np.stack([x, y, z], axis=-1).reshape(-1, 3)

print("depth range (m):", round(z.min(), 3), "to", round(z.max(), 3))
print("first 3 points (m):")
print(np.round(cloud[:3], 3))

Code Fragment 8.2.1 turns a disparity map into a metric point cloud. The nearest pixel (disparity 50) sits at 1.68 m and the farthest (disparity 20) at 4.2 m, making the inverse $z \propto 1/d$ relationship concrete: halving the disparity doubles the depth and roughly quadruples the depth uncertainty.

Library Shortcut

The fragment should keep modality-specific noise visible: pixel ray, depth sample, point coordinate, timestamp, and confidence. OpenCV and point-cloud tools scale the workflow once the geometry is correct.

Practical Recipe

Write the observation, action, and success metric before choosing a model.
Build a baseline that is simple enough to debug by inspection.
Add the library implementation only after the baseline behavior is understood.
Record failures as structured cases: perception error, state error, planning error, control error, or evaluation error.
Run at least one perturbation test before trusting the result.

Common Failure Mode

The common mistake in Cameras, depth (stereo/structured light/ToF), LiDAR is to celebrate the component score before checking the closed-loop handoff. The failure usually appears at the boundary: stale state, wrong frame, delayed action, saturated actuator, or metric that ignores the real task cost.

Practical Example

A robotics team using Cameras, depth (stereo/structured light/ToF), LiDAR should log not only final success, but intermediate observations, chosen actions, controller status, and recovery events. The logs reveal whether the method is solving the task or merely passing the easiest episodes.

Memory Hook

For cameras, depth (stereo/structured light/tof), lidar, the useful test is simple: could a teammate point to the log line, plot, or trace that proves the idea changed the agent's next action?

Research Frontier

For Cameras, depth (stereo/structured light/ToF), LiDAR, treat frontier claims as hypotheses until they expose enough detail to reproduce the result: data boundary, embodiment, controller interface, evaluation panel, and failure cases.

Two recent results significantly advance monocular metric depth, which has historically been the weakest depth source for robotics due to scale ambiguity. Depth Anything v2 (Yang et al., 2024) achieves metric monocular depth estimation up to 70 m on a ViT-L backbone by combining large-scale synthetic pretraining with carefully curated real-image pseudo-labels, improving both range and fine-grained edge accuracy over its predecessor. Metric3D v2 (Hu et al., 2024) addresses the calibration dependency problem by canonicalizing depth prediction across arbitrary camera intrinsics, enabling zero-shot metric depth on robots that use multiple lens configurations or lack accurate calibration files. Both results move monocular depth closer to the actionable, scale-certified evidence that collision checking and contact planning require.

Self Check

Can you name the observation, state estimate, action, success metric, and most likely failure mode for Cameras, depth (stereo/structured light/ToF), LiDAR? If not, the system boundary is still too vague.

Production Pattern

Cameras, depth (stereo/structured light/ToF), LiDAR sits inside the Part II robotics contract: geometry defines where things are, kinematics defines what motion is possible, dynamics defines what motion costs, control defines how errors are corrected, and sensing defines what the agent can know on time.

Camera, depth, and LiDAR pipelines should carry intrinsics, extrinsics, range limits, and missing-data behavior. This makes the section useful to students, builders, and researchers at the same time: the idea has an intuitive role, a formal interface, a runnable check, and a failure mode that can be reproduced.

Mechanism To Watch

For Cameras, depth (stereo/structured light/ToF), LiDAR, state estimation converts imperfect observations into a belief usable by control. Preserve calibration, covariance, timestamp, frame, dropout behavior, and latency.

Library Choices And Verification Checks

Tool or Library	What It Handles	Verification Check
OpenCV	handles camera models, calibration, projection, and vision preprocessing	Verify intrinsics, distortion, image timestamp, and frame-to-camera transform.
ROS 2 robot_localization	fuses odometry, IMU, GPS, pose, and twist streams through ROS estimation nodes	Verify covariance, frame IDs, timestamps, and rejected measurement counts.
FilterPy	teaches and prototypes Kalman, extended Kalman, unscented, and particle filters	Verify process noise, measurement noise, innovation, and covariance growth.
Kalibr	supports practical work on Cameras, depth (stereo/structured light/ToF), LiDAR	Verify the library output against the hand-built baseline on one small case.
Open3D	supports practical work on Cameras, depth (stereo/structured light/ToF), LiDAR	Verify the library output against the hand-built baseline on one small case.

Use this recipe when turning Cameras, depth (stereo/structured light/ToF), LiDAR into code, a simulator experiment, or a robot diagnostic. The point is not to use every library. The point is to keep the hand-built baseline and the maintained-tool path comparable.

Define each sensor message with units, frame, timestamp source, calibration file, and covariance meaning.
Run a static test, a slow-motion test, and a dropout test before fusing streams.
Compare the hand filter with FilterPy or ROS 2 robot_localization using identical measurements and noise settings.
Log innovation, covariance, delayed messages, rejected measurements, and downstream control effect.
Treat perception output as a belief with uncertainty, not as ground truth handed to the controller.

Evidence Gate

For Cameras, depth (stereo/structured light/ToF), LiDAR, compare methods only through one saved artifact that preserves the inputs, outputs, units, timestamps, latency budget, configuration, seed, metric definition, and failure labels relevant to this section. The comparison is meaningful only when the same script evaluates the same panel.

Exercise Extension

Extend the section exercise by adding one perturbation specific to Cameras, depth (stereo/structured light/ToF), LiDAR and one latency or uncertainty check. Save the result in the EvidenceRecord schema, then explain which library output you trust and why.

Vision and depth failures often come from exposure, rolling shutter, stereo texture, ToF multipath, lidar sparsity, reflective surfaces, or occlusion. Reproduce one calibration or range case by hand before retraining a perception model.

Technical Core

Cameras, depth sensors, and LiDAR all turn light into geometry, but they fail for different physical reasons. RGB cameras give dense appearance at low cost, stereo estimates depth from disparity, structured light projects a known pattern, time-of-flight estimates range from travel time, and LiDAR measures range by active scanning. Figure 8.2.T summarizes the chain this section must preserve when moving from a teaching example to a real embodied system.

Figure 8.2.T: The technical core for Cameras, depth (stereo/structured light/ToF), LiDAR connects assumptions, model, algorithm, evidence, and failure analysis.

Formal Object

For a pinhole camera, $u=f_xX/Z+c_x$ and $v=f_yY/Z+c_y$. Stereo depth uses $Z=fB/d$, where $B$ is baseline and $d$ is disparity. These equations explain the practical failure mode: small disparity errors at long range produce large depth errors, so a far obstacle can have a confident-looking pixel location and a weak range estimate.

Depth sensor calibration and validation

Calibrate intrinsics, distortion, and extrinsics for every camera or range sensor.
Measure range error at near, mid, and far distances with matte, shiny, dark, and transparent objects.
Record missing-depth masks rather than filling holes silently.
Align RGB, depth, and LiDAR timestamps before projecting points into a shared frame.
Validate with a known target, then repeat after temperature change or mechanical remounting.

Technical Contract For Cameras, Depth, And LiDAR

Sensor Type	Strength	Failure Mode To Diagnose
RGB camera	Dense texture, color, and semantic cues.	Lighting shifts, motion blur, lens distortion, and scale ambiguity.
Stereo depth	Passive depth when texture and baseline are adequate.	Low texture, repeated patterns, reflective surfaces, and long-range disparity noise.
Structured light	Good short-range geometry for manipulation and inspection.	Sunlight, transparent objects, interference, and limited working distance.
Time-of-flight	Compact active depth with direct range measurement.	Multipath reflections, flying pixels near edges, ambient light, and mixed pixels.
LiDAR	Accurate range over larger spaces and outdoor navigation.	Sparse vertical resolution, rolling scan distortion, specular returns, and calibration drift.

Expected output is a range error table and a missing-data mask, not only a pretty point cloud. The mask matters because planners fail differently when an obstacle is measured as far away versus not measured at all.

Failure Mode To Test

A depth pipeline fails when it treats invalid pixels as free space, projects points through stale extrinsics, or compares RGB and range streams with mismatched timestamps.

Section References

Core references for Cameras, depth (stereo/structured light/ToF), LiDAR: Modern Robotics; Murray, Li, and Sastry; Siciliano et al.; LaValle; and official documentation for Drake, MuJoCo, Pinocchio, CasADi, python-control, GTSAM, ROS 2, and OpenCV as applicable.

Use these references to check notation, frame conventions, units, solver assumptions, and maintained-library behavior.

Yang, L. et al. (2024). Depth Anything V2. arXiv. https://arxiv.org/abs/2406.09414

Scales monocular metric depth to 70 m using a ViT-L backbone and improved pseudo-label training. Read to understand how backbone scale and synthetic data quality jointly determine metric accuracy for outdoor robotic ranges.

Hu, W. et al. (2024). Metric3D v2: A Versatile Monocular Geometric Foundation Model for Zero-Shot Metric Depth and Surface Normal Estimation. arXiv. https://arxiv.org/abs/2404.15506

Achieves zero-shot metric depth across arbitrary camera intrinsics by separating canonical depth prediction from camera-specific projection. Read alongside the calibration material in this section to understand when metric depth can substitute for a hardware depth sensor.

Key Takeaway

Cameras, depth (stereo/structured light/ToF), LiDAR is useful when it makes the perception-action loop more reliable, not when it merely adds a more impressive model name.

Exercise 8.2.1

Design a method-matched experiment for Cameras, depth (stereo/structured light/ToF), LiDAR. Specify the environment, observations, actions, metric, one perturbation, and the library output you would compare against the hand-built baseline.