Section 27.3: Depth estimation and metric scale | Building Embodied AI: From Perception to Autonomous Action

"Depth is useful when scale errors are smaller than the robot's clearance and contact margins."
A Patient Embodied AI Agent

Scene shows depth rays leaving a calibrated camera and landing on a tabletop object, turning pixels into metric clearance for a robot hand. — **Figure 27.3A**: Depth becomes robotics evidence when pixels, intrinsics, scale, and clearance all agree.

Big Picture

Depth estimation and metric scale turns visual evidence into distances that can support collision checking, grasp approach, landing, docking, and navigation. Depth is useful for action only when its scale, frame, and failure modes are explicit.

Problem First: Why This Representation Exists

A visually plausible depth map can be dangerous when metric scale, calibration, or uncertainty is wrong. Embodied use needs depth in a named frame with error bounds tied to the robot body.

The contract here maps pixels to metric state: depth source, camera intrinsics, extrinsics, uncertainty, timestamp, and the planner or controller that consumes the resulting geometry.

Action Is The Unit Of Meaning

Depth becomes embodied knowledge when an error bound changes clearance, grasp approach, foot placement, or stopping distance.

Figure 27.3.1 should be read as a metric-scale contract: depth source, scale calibration, frame transform, uncertainty, and action consumer determine whether the robot can reach or avoid safely.

Figure 27.3.1: From pixels and disparity to metric depth. The dashed feedback path reminds the reader that perception quality is judged by action consequences and replayable diagnostics.

Mathematical Core

The pinhole camera model lifts a depth pixel into a 3D camera-frame point.

Formal Object

$X_c=\frac{(u-c_x)z}{f_x},\quad Y_c=\frac{(v-c_y)z}{f_y},\quad Z_c=z,\quad z_{\mathrm{stereo}}=\frac{fB}{d}$

The first three equations use metric depth directly. The stereo equation shows why small disparity errors become large distance errors when disparity $d$ is small, which is why far obstacles and reflective surfaces deserve extra caution.

Depth-to-action validation

Check the camera intrinsics and depth units before any planning call.
Reject missing, saturated, or physically impossible depth pixels.
Back-project task-relevant pixels into the camera frame, then transform them into the robot frame.
Compare the resulting clearance against controller limits and uncertainty margins.

Depth Sources And Failure Modes

Design Choice	Use When	Control Risk
Stereo	Outdoor robots and textured scenes	Low texture and far range create unstable disparity.
RGB-D or time-of-flight	Indoor manipulation and tabletop mapping	Reflective, transparent, or black materials can corrupt depth.
Monocular depth	Semantic priors and fallback estimates	Metric scale may drift without calibration or known references.

Worked Miniature

Code Fragment 27.3.1 implements the back-projection equations with one pixel and one depth value. This is the smallest useful check before passing points into Open3D or a planner.

# Back-project one depth pixel through camera intrinsics.
# The output is a metric 3D point in the camera frame.
fx, fy = 615.0, 615.0
cx, cy = 320.0, 240.0
u, v, z_m = 350.0, 260.0, 1.20

x_m = (u - cx) * z_m / fx
y_m = (v - cy) * z_m / fy
point_c = (round(x_m, 3), round(y_m, 3), z_m)
print(point_c)

(0.059, 0.039, 1.2)

Interpret this expected output tuple as a metric point in the camera frame, not an image-space feature. The lateral offsets are only a few centimeters, which is small enough to look harmless in pixels but large enough to change grasp clearance or foot placement.

Code Fragment 27.3.1: The variables `fx`, `fy`, `cx`, and `cy` turn pixel offsets into meters. The tiny output shift shows why camera calibration, not only neural depth prediction, decides whether a robot can trust clearance.

Library Shortcut

OpenCV and Open3D collapse this from hand-written equations into camera calibration, RGB-D image, and point-cloud constructors. The shortcut handles formats and vectorization, but it does not absolve the builder from checking units, missing depth, and transforms.

Failure Mode To Test

Depth maps often fail silently on transparent cups, glossy tabletops, thin chair legs, and motion blur. The robot must know when depth is absent or unreliable, not only when it is numerically present.

Practical Example

A drone landing system can accept monocular depth for exploratory terrain scoring, but final descent should require scale-checked stereo, lidar, or trusted altitude sensing with a conservative uncertainty margin.

Memory Hook

For Depth estimation and metric scale, the perception result must answer what action changed, what uncertainty changed, and what log would reproduce the decision. Otherwise the output is still visualization, not embodied evidence.

Debugging And Evaluation

Evaluate depth through geometry-sensitive actions: record depth source, calibration version, transform, predicted clearance, selected trajectory, contact or collision outcome, and scale-error label.

Perturb textureless surfaces, reflective materials, range, lighting, and calibration offsets, then check whether the planner margin catches the depth error before execution.

Research Frontier

Foundation depth models are improving fast, but robotics still needs calibrated scale, uncertainty, and temporal consistency. The open problem is not producing pretty depth, it is certifying when depth is good enough for contact or collision decisions.

Two recent results push monocular depth toward metric reliability. Depth Anything v2 (Yang et al., 2024) achieves metric monocular depth estimation at ranges up to 70 m using a ViT-L backbone, with significant gains on fine-grained detail and robustness to challenging lighting. Metric3D v2 (Hu et al., 2024) enables zero-shot metric depth across arbitrary camera intrinsics by disentangling canonical depth prediction from camera-specific projection, making it practical for robots that change lenses or operate without precise calibration files. Both results narrow the gap between depth estimation and the scale-certified evidence that planners and contact controllers require.

On the mapping side, 3D Gaussian-splatting representations are entering simultaneous localization and mapping. SplaTAM (Keetha et al., CVPR 2024) performs real-time 3D Gaussian-splatting SLAM with simultaneous tracking and map densification, yielding dense color and geometry at interactive rates. MonoGS (Matsuki et al., CVPR 2024) extends this to monocular cameras using photometric and depth loss, removing the need for a hardware depth sensor. The open problem shared across these systems is that Gaussian-splatting SLAM assumes a static scene; handling dynamic objects and moving cameras together in the same representation remains unsolved.

What's Next

Section 27.4 adds the time dimension: once metric depth is established, optical flow reveals how quickly regions move and whether the robot needs to react before a slower reconstruction pipeline can respond.

Section References

OpenCV. Camera calibration and 3D reconstruction documentation. https://docs.opencv.org/4.x/d9/d0c/group__calib3d.html

Primary implementation reference for calibration, projection, stereo, and pose routines.

Open3D. RGB-D images and point cloud documentation. https://www.open3d.org/docs/release/tutorial/geometry/rgbd_image.html

Shows the practical library path from depth images to point clouds.

Yang, L. et al. (2024). Depth Anything V2. arXiv. https://arxiv.org/abs/2406.09414

Scales monocular metric depth estimation to 70 m using a ViT-L backbone and improved synthetic-to-real training. Read to understand how pseudo-label quality and backbone size jointly determine metric accuracy at outdoor robotic ranges.

Hu, W. et al. (2024). Metric3D v2: A Versatile Monocular Geometric Foundation Model for Zero-Shot Metric Depth and Surface Normal Estimation. arXiv. https://arxiv.org/abs/2404.15506

Achieves zero-shot metric depth across arbitrary camera intrinsics by canonicalizing depth before decoding. Read to understand how decoupling camera-space geometry from intrinsic parameters enables deployment without precise calibration files.

Keetha, N. et al. (2024). SplaTAM: Splat, Track and Map 3D Gaussians for Dense RGB-D SLAM. CVPR 2024. https://arxiv.org/abs/2312.02126

Demonstrates real-time Gaussian-splatting SLAM with simultaneous tracking and map densification. Read alongside the depth estimation material to understand how dense metric depth feeds directly into the Gaussian map update step.

Matsuki, H. et al. (2024). Gaussian Splatting SLAM. CVPR 2024. https://arxiv.org/abs/2312.06741

Extends Gaussian-splatting SLAM to monocular input using photometric and depth loss. Read to understand how monocular depth quality limits map scale accuracy and why metric depth models like Depth Anything v2 matter for SLAM pipelines.

Self Check

Can you name the representation, the consuming action, the uncertainty or freshness field, and the failure label for Depth estimation and metric scale? If any one is missing, the section is not yet ready for a robot replay log.

Key Takeaway

Metric depth is a contract among pixels, intrinsics, units, transforms, and uncertainty; remove any one of those and the action estimate becomes suspect.

Exercise 27.3.1

Given a camera with $f_x=600$, $c_x=320$, pixel $u=380$, and depth $z=2.0$ m, compute $X_c$. Then explain how a 10 percent depth-scale error changes a grasp clearance estimate.