Section 48.2: Sensors and sensor fusion in AVs | Building Embodied AI: From Perception to Autonomous Action

An autonomous vehicle sees the world through three sensors that disagree and one fusion algorithm that must not.
On the perception stack

Technical illustration for Section 48.2: Sensors and sensor fusion in AVs. — **Figure 48.2A**: Camera, LiDAR, and radar each measure a different physical quantity; fusion is the art of making their disagreements informative rather than dangerous.

Big Picture

The AV perception stack rests on three complementary modalities. Cameras give dense color and texture but no direct range. LiDAR gives precise 3D geometry but is sparse and weather-sensitive. Radar gives direct Doppler velocity and works in rain and fog but is angularly coarse. Sensor fusion combines them so that the weakness of one is covered by the strength of another, and 3D object detectors turn the fused evidence into tracked actors.

This section develops the perception stack as a concrete contract: ingest synchronized and calibrated sensor streams, produce 3D objects with class, pose, extent, and velocity, and quantify the result against a labeled benchmark. The recurring discipline is calibration. Fusion is only meaningful when every sensor's measurement can be expressed in a shared coordinate frame, and the worked example projects a LiDAR point into the camera image to make that frame transformation explicit.

Theory

Sensor modalities

Modality	Variants and parameters	Strengths	Weaknesses
Camera	monocular, stereo (depth from disparity), fisheye (wide FOV, strong distortion)	dense texture, color, sign and lane reading	no direct range (mono), poor in low light, glare
LiDAR	64 to 128 beams, 10 to 20 Hz, mechanical or solid-state	accurate 3D geometry, range to ~200 m	sparse at distance, degraded by rain, snow, dust
Radar	77 GHz automotive, FMCW, direct Doppler	direct radial velocity, robust to weather and lighting	coarse angular resolution, clutter, multipath

Fusion levels

Fusion is categorized by where in the pipeline modalities are combined.

Sensor-level (low-level) fusion combines raw measurements, for example painting LiDAR points with camera pixels before detection. Maximum information, maximum calibration and synchronization sensitivity.
Early (feature-level) fusion combines learned features from each modality inside a single network, as in bird's-eye-view fusion.
Late (object-level) fusion runs an independent detector per modality and merges the resulting object lists. Robust and modular, but discards cross-modal cues that only exist before detection.

3D object detection

Given fused inputs, a detector outputs oriented 3D boxes. Representative families: PointPillars voxelizes the LiDAR point cloud into vertical pillars and runs a 2D backbone for speed; CenterPoint detects object centers in a bird's-eye-view heatmap and regresses box and velocity; BEVFormer lifts multi-camera features into a shared bird's-eye-view grid using spatial and temporal attention, enabling camera-only or camera-LiDAR detection in one frame.

Calibration Is The Real Subject

Every fusion method assumes the extrinsic transform between sensors and the camera intrinsics are known and stable. A 1-degree extrinsic error at 50 m is roughly a 0.9 m lateral error: enough to place a LiDAR return for a pedestrian onto the empty road beside them. Most "fusion failures" are calibration or time-synchronization failures in disguise.

Mechanism

To express a LiDAR point in the camera image you chain three transforms. The extrinsic matrix $T_{\text{cam}\leftarrow\text{lidar}}$ (a 4x4 rotation plus translation) moves the point from the LiDAR frame to the camera frame; the intrinsic matrix $K$ (focal lengths $f_x, f_y$ and principal point $c_x, c_y$) projects the 3D camera-frame point to pixel coordinates; finally you divide by depth (the perspective division). Points behind the camera ($z \le 0$) must be discarded before division.

Worked Example

The example projects a LiDAR point $(x, y, z)$ into a camera image using the calibration matrices, the single most common operation in any fusion pipeline.

import numpy as np

# Camera intrinsics K (3x3): focal lengths and principal point in pixels.
K = np.array([[1200.0,    0.0, 960.0],
              [   0.0, 1200.0, 540.0],
              [   0.0,    0.0,   1.0]])

# Extrinsic: rotation R (3x3) and translation t (3,) mapping LiDAR -> camera frame.
# Here the camera looks forward; LiDAR is mounted 0.3 m above and 1.6 m behind it.
R = np.array([[ 0.0, -1.0,  0.0],   # LiDAR x (forward) -> camera z
              [ 0.0,  0.0, -1.0],   # LiDAR z (up)      -> -camera y
              [ 1.0,  0.0,  0.0]])  # LiDAR y (left)    -> camera x
t = np.array([0.0, 0.3, -1.6])

def project_lidar_to_image(point_lidar, R, t, K):
    """Project a LiDAR point (x, y, z) to pixel (u, v); return None if behind camera."""
    p_cam = R @ np.asarray(point_lidar, dtype=float) + t   # camera-frame 3D point
    if p_cam[2] <= 1e-6:                                    # behind the image plane
        return None
    uvw = K @ p_cam                                         # apply intrinsics
    u, v = uvw[0] / uvw[2], uvw[1] / uvw[2]                 # perspective division
    return float(u), float(v)

# A LiDAR return 20 m ahead, 1 m to the left, at ground-ish height.
pt = (20.0, 1.0, -0.5)
pix = project_lidar_to_image(pt, R, t, K)
print("pixel (u, v):", None if pix is None else (round(pix[0], 1), round(pix[1], 1)))

Expected output: a pixel coordinate near the image center-left, around (u, v) = (894.8, 592.2). Swap in a point with LiDAR-forward distance set negative and the function returns None, the guard that prevents projecting points behind the camera onto the image.

Library Shortcut

In practice use the dataset SDKs that ship calibrated transforms: the nuScenes devkit and the Waymo Open Dataset tools expose per-sensor intrinsics and extrinsics and handle ego-motion compensation. For detectors, MMDetection3D and OpenPCDet provide reference PointPillars, CenterPoint, and BEVFormer implementations. Keep the same projection convention (frame order, distortion model) across your pipeline.

Practical Recipe

Establish a shared coordinate frame and verify extrinsics by projecting LiDAR onto camera images for known objects.
Time-synchronize streams (hardware trigger or timestamp interpolation); record the residual time skew.
Choose a fusion level deliberately: late fusion to start (modular, debuggable), early or sensor-level once calibration is trusted.
Run a reference 3D detector and report mAP and velocity error against the benchmark.
Save one artifact: calibration, sync residuals, detector config, and overlaid projection images for visual audit.

Common Failure Mode

A small clock skew between camera and LiDAR makes a fast crossing pedestrian appear smeared or doubled after fusion. The detector confidence drops, the tracker stutters, and the planner over-brakes. Always log the per-frame time skew; it is the first thing to check when fused detections degrade only for fast objects.

Practical Example

A team adding radar to a camera-LiDAR stack should fuse radar Doppler at the object level first: associate radar tracks to detected boxes and use radial velocity to disambiguate stopped versus creeping vehicles. This recovers velocity in heavy rain where LiDAR returns thin out, without rebuilding the detector.

Memory Hook

Camera knows what, LiDAR knows where, radar knows how fast. Fusion's job is to keep all three answers about the same object.

Research Frontier

Bird's-eye-view fusion (BEVFormer and successors) and occupancy networks are pushing toward a unified scene representation that detection, prediction, and planning can all read. The frontier question is robustness: do these dense representations degrade gracefully under sensor dropout and calibration drift, or do they fail silently?

Self Check

Can you state, for a single LiDAR point, the two matrices and the one division needed to place it in the image, and the condition under which the projection is invalid? If not, revisit the mechanism box before fusing modalities.

Practical Tool Choices For This Section

Tool or Library	Role in the Topic	Builder Advice
nuScenes devkit, Waymo Open Dataset tools	Calibrated multi-sensor data and transforms	Use their extrinsics rather than re-deriving frames by hand.
MMDetection3D, OpenPCDet	Reference 3D detectors (PointPillars, CenterPoint, BEVFormer)	Reproduce a published number before customizing.
Projection overlay script	Visual calibration audit	Overlay LiDAR on camera every release to catch extrinsic drift.

Cross-References

Section 48.1 frames perception inside the full loop, 48.3 consumes these detections for tracking and prediction, and 48.5 explores learned representations that fuse and predict jointly.

Mini Lab

Perturb the extrinsic rotation in the worked example by 1 degree and re-project a point at 50 m. Measure the pixel shift, convert it back to a ground-plane lateral error, and label whether it would move a pedestrian off the sidewalk.

Section References

Lang et al., "PointPillars: Fast Encoders for Object Detection from Point Clouds," CVPR 2019. Yin et al., "Center-based 3D Object Detection and Tracking" (CenterPoint), CVPR 2021. Li et al., "BEVFormer: Learning Bird's-Eye-View Representation from Multi-Camera Images," ECCV 2022.

These define the detector families and the bird's-eye-view fusion paradigm referenced above.

Key Takeaway

Three modalities disagree by design; fusion turns that disagreement into a more complete and robust scene, but only when calibration and time synchronization are correct. Master the LiDAR-to-camera projection first, because every fusion method depends on it.

Exercise 48.2.1

Extend the projection function to handle a batch of LiDAR points and to color each surviving pixel by depth. Then design a same-panel experiment comparing late fusion (merge two detector outputs) against the painted-point sensor-level approach on identical frames, measuring mAP and the false-negative rate for pedestrians.