"Depth is useful when scale errors are smaller than the robot's clearance and contact margins."
A Patient Embodied AI Agent
Depth estimation and metric scale turns visual evidence into distances that can support collision checking, grasp approach, landing, docking, and navigation. Depth is useful for action only when its scale, frame, and failure modes are explicit.
Problem First: Why This Representation Exists
A visually plausible depth map can be dangerous when metric scale, calibration, or uncertainty is wrong. Embodied use needs depth in a named frame with error bounds tied to the robot body.
The contract here maps pixels to metric state: depth source, camera intrinsics, extrinsics, uncertainty, timestamp, and the planner or controller that consumes the resulting geometry.
Depth becomes embodied knowledge when an error bound changes clearance, grasp approach, foot placement, or stopping distance.
Figure 27.3.1 should be read as a metric-scale contract: depth source, scale calibration, frame transform, uncertainty, and action consumer determine whether the robot can reach or avoid safely.
Mathematical Core
The pinhole camera model lifts a depth pixel into a 3D camera-frame point.
$X_c=\frac{(u-c_x)z}{f_x},\quad Y_c=\frac{(v-c_y)z}{f_y},\quad Z_c=z,\quad z_{\mathrm{stereo}}=\frac{fB}{d}$
The first three equations use metric depth directly. The stereo equation shows why small disparity errors become large distance errors when disparity $d$ is small, which is why far obstacles and reflective surfaces deserve extra caution.
- Check the camera intrinsics and depth units before any planning call.
- Reject missing, saturated, or physically impossible depth pixels.
- Back-project task-relevant pixels into the camera frame, then transform them into the robot frame.
- Compare the resulting clearance against controller limits and uncertainty margins.
| Design Choice | Use When | Control Risk |
|---|---|---|
| Stereo | Outdoor robots and textured scenes | Low texture and far range create unstable disparity. |
| RGB-D or time-of-flight | Indoor manipulation and tabletop mapping | Reflective, transparent, or black materials can corrupt depth. |
| Monocular depth | Semantic priors and fallback estimates | Metric scale may drift without calibration or known references. |
Worked Miniature
Code Fragment 27.3.1 implements the back-projection equations with one pixel and one depth value. This is the smallest useful check before passing points into Open3D or a planner.
# Back-project one depth pixel through camera intrinsics.
# The output is a metric 3D point in the camera frame.
fx, fy = 615.0, 615.0
cx, cy = 320.0, 240.0
u, v, z_m = 350.0, 260.0, 1.20
x_m = (u - cx) * z_m / fx
y_m = (v - cy) * z_m / fy
point_c = (round(x_m, 3), round(y_m, 3), z_m)
print(point_c)
Interpret this expected output tuple as a metric point in the camera frame, not an image-space feature. The lateral offsets are only a few centimeters, which is small enough to look harmless in pixels but large enough to change grasp clearance or foot placement.
OpenCV and Open3D collapse this from hand-written equations into camera calibration, RGB-D image, and point-cloud constructors. The shortcut handles formats and vectorization, but it does not absolve the builder from checking units, missing depth, and transforms.
Depth maps often fail silently on transparent cups, glossy tabletops, thin chair legs, and motion blur. The robot must know when depth is absent or unreliable, not only when it is numerically present.
A drone landing system can accept monocular depth for exploratory terrain scoring, but final descent should require scale-checked stereo, lidar, or trusted altitude sensing with a conservative uncertainty margin.
For Depth estimation and metric scale, the perception result must answer what action changed, what uncertainty changed, and what log would reproduce the decision. Otherwise the output is still visualization, not embodied evidence.
Debugging And Evaluation
Evaluate depth through geometry-sensitive actions: record depth source, calibration version, transform, predicted clearance, selected trajectory, contact or collision outcome, and scale-error label.
Perturb textureless surfaces, reflective materials, range, lighting, and calibration offsets, then check whether the planner margin catches the depth error before execution.
Foundation depth models are improving fast, but robotics still needs calibrated scale, uncertainty, and temporal consistency. The open problem is not producing pretty depth, it is certifying when depth is good enough for contact or collision decisions.
Two recent results push monocular depth toward metric reliability. Depth Anything v2 (Yang et al., 2024) achieves metric monocular depth estimation at ranges up to 70 m using a ViT-L backbone, with significant gains on fine-grained detail and robustness to challenging lighting. Metric3D v2 (Hu et al., 2024) enables zero-shot metric depth across arbitrary camera intrinsics by disentangling canonical depth prediction from camera-specific projection, making it practical for robots that change lenses or operate without precise calibration files. Both results narrow the gap between depth estimation and the scale-certified evidence that planners and contact controllers require.
On the mapping side, 3D Gaussian-splatting representations are entering simultaneous localization and mapping. SplaTAM (Keetha et al., CVPR 2024) performs real-time 3D Gaussian-splatting SLAM with simultaneous tracking and map densification, yielding dense color and geometry at interactive rates. MonoGS (Matsuki et al., CVPR 2024) extends this to monocular cameras using photometric and depth loss, removing the need for a hardware depth sensor. The open problem shared across these systems is that Gaussian-splatting SLAM assumes a static scene; handling dynamic objects and moving cameras together in the same representation remains unsolved.
Section 27.4 adds the time dimension: once metric depth is established, optical flow reveals how quickly regions move and whether the robot needs to react before a slower reconstruction pipeline can respond.
Section References
OpenCV. Camera calibration and 3D reconstruction documentation. https://docs.opencv.org/4.x/d9/d0c/group__calib3d.html
Primary implementation reference for calibration, projection, stereo, and pose routines.
Open3D. RGB-D images and point cloud documentation. https://www.open3d.org/docs/release/tutorial/geometry/rgbd_image.html
Shows the practical library path from depth images to point clouds.
Yang, L. et al. (2024). Depth Anything V2. arXiv. https://arxiv.org/abs/2406.09414
Scales monocular metric depth estimation to 70 m using a ViT-L backbone and improved synthetic-to-real training. Read to understand how pseudo-label quality and backbone size jointly determine metric accuracy at outdoor robotic ranges.
Hu, W. et al. (2024). Metric3D v2: A Versatile Monocular Geometric Foundation Model for Zero-Shot Metric Depth and Surface Normal Estimation. arXiv. https://arxiv.org/abs/2404.15506
Achieves zero-shot metric depth across arbitrary camera intrinsics by canonicalizing depth before decoding. Read to understand how decoupling camera-space geometry from intrinsic parameters enables deployment without precise calibration files.
Keetha, N. et al. (2024). SplaTAM: Splat, Track and Map 3D Gaussians for Dense RGB-D SLAM. CVPR 2024. https://arxiv.org/abs/2312.02126
Demonstrates real-time Gaussian-splatting SLAM with simultaneous tracking and map densification. Read alongside the depth estimation material to understand how dense metric depth feeds directly into the Gaussian map update step.
Matsuki, H. et al. (2024). Gaussian Splatting SLAM. CVPR 2024. https://arxiv.org/abs/2312.06741
Extends Gaussian-splatting SLAM to monocular input using photometric and depth loss. Read to understand how monocular depth quality limits map scale accuracy and why metric depth models like Depth Anything v2 matter for SLAM pipelines.
Can you name the representation, the consuming action, the uncertainty or freshness field, and the failure label for Depth estimation and metric scale? If any one is missing, the section is not yet ready for a robot replay log.
Metric depth is a contract among pixels, intrinsics, units, transforms, and uncertainty; remove any one of those and the action estimate becomes suspect.
Given a camera with $f_x=600$, $c_x=320$, pixel $u=380$, and depth $z=2.0$ m, compute $X_c$. Then explain how a 10 percent depth-scale error changes a grasp clearance estimate.