An autonomous vehicle sees the world through three sensors that disagree and one fusion algorithm that must not.
On the perception stack
The AV perception stack rests on three complementary modalities. Cameras give dense color and texture but no direct range. LiDAR gives precise 3D geometry but is sparse and weather-sensitive. Radar gives direct Doppler velocity and works in rain and fog but is angularly coarse. Sensor fusion combines them so that the weakness of one is covered by the strength of another, and 3D object detectors turn the fused evidence into tracked actors.
This section develops the perception stack as a concrete contract: ingest synchronized and calibrated sensor streams, produce 3D objects with class, pose, extent, and velocity, and quantify the result against a labeled benchmark. The recurring discipline is calibration. Fusion is only meaningful when every sensor's measurement can be expressed in a shared coordinate frame, and the worked example projects a LiDAR point into the camera image to make that frame transformation explicit.
Theory
Sensor modalities
| Modality | Variants and parameters | Strengths | Weaknesses |
|---|---|---|---|
| Camera | monocular, stereo (depth from disparity), fisheye (wide FOV, strong distortion) | dense texture, color, sign and lane reading | no direct range (mono), poor in low light, glare |
| LiDAR | 64 to 128 beams, 10 to 20 Hz, mechanical or solid-state | accurate 3D geometry, range to ~200 m | sparse at distance, degraded by rain, snow, dust |
| Radar | 77 GHz automotive, FMCW, direct Doppler | direct radial velocity, robust to weather and lighting | coarse angular resolution, clutter, multipath |
Fusion levels
Fusion is categorized by where in the pipeline modalities are combined.
- Sensor-level (low-level) fusion combines raw measurements, for example painting LiDAR points with camera pixels before detection. Maximum information, maximum calibration and synchronization sensitivity.
- Early (feature-level) fusion combines learned features from each modality inside a single network, as in bird's-eye-view fusion.
- Late (object-level) fusion runs an independent detector per modality and merges the resulting object lists. Robust and modular, but discards cross-modal cues that only exist before detection.
3D object detection
Given fused inputs, a detector outputs oriented 3D boxes. Representative families: PointPillars voxelizes the LiDAR point cloud into vertical pillars and runs a 2D backbone for speed; CenterPoint detects object centers in a bird's-eye-view heatmap and regresses box and velocity; BEVFormer lifts multi-camera features into a shared bird's-eye-view grid using spatial and temporal attention, enabling camera-only or camera-LiDAR detection in one frame.
Every fusion method assumes the extrinsic transform between sensors and the camera intrinsics are known and stable. A 1-degree extrinsic error at 50 m is roughly a 0.9 m lateral error: enough to place a LiDAR return for a pedestrian onto the empty road beside them. Most "fusion failures" are calibration or time-synchronization failures in disguise.
To express a LiDAR point in the camera image you chain three transforms. The extrinsic matrix $T_{\text{cam}\leftarrow\text{lidar}}$ (a 4x4 rotation plus translation) moves the point from the LiDAR frame to the camera frame; the intrinsic matrix $K$ (focal lengths $f_x, f_y$ and principal point $c_x, c_y$) projects the 3D camera-frame point to pixel coordinates; finally you divide by depth (the perspective division). Points behind the camera ($z \le 0$) must be discarded before division.
Worked Example
The example projects a LiDAR point $(x, y, z)$ into a camera image using the calibration matrices, the single most common operation in any fusion pipeline.
import numpy as np
# Camera intrinsics K (3x3): focal lengths and principal point in pixels.
K = np.array([[1200.0, 0.0, 960.0],
[ 0.0, 1200.0, 540.0],
[ 0.0, 0.0, 1.0]])
# Extrinsic: rotation R (3x3) and translation t (3,) mapping LiDAR -> camera frame.
# Here the camera looks forward; LiDAR is mounted 0.3 m above and 1.6 m behind it.
R = np.array([[ 0.0, -1.0, 0.0], # LiDAR x (forward) -> camera z
[ 0.0, 0.0, -1.0], # LiDAR z (up) -> -camera y
[ 1.0, 0.0, 0.0]]) # LiDAR y (left) -> camera x
t = np.array([0.0, 0.3, -1.6])
def project_lidar_to_image(point_lidar, R, t, K):
"""Project a LiDAR point (x, y, z) to pixel (u, v); return None if behind camera."""
p_cam = R @ np.asarray(point_lidar, dtype=float) + t # camera-frame 3D point
if p_cam[2] <= 1e-6: # behind the image plane
return None
uvw = K @ p_cam # apply intrinsics
u, v = uvw[0] / uvw[2], uvw[1] / uvw[2] # perspective division
return float(u), float(v)
# A LiDAR return 20 m ahead, 1 m to the left, at ground-ish height.
pt = (20.0, 1.0, -0.5)
pix = project_lidar_to_image(pt, R, t, K)
print("pixel (u, v):", None if pix is None else (round(pix[0], 1), round(pix[1], 1)))
Expected output: a pixel coordinate near the image center-left, around (u, v) = (894.8, 592.2). Swap in a point with LiDAR-forward distance set negative and the function returns None, the guard that prevents projecting points behind the camera onto the image.
In practice use the dataset SDKs that ship calibrated transforms: the nuScenes devkit and the Waymo Open Dataset tools expose per-sensor intrinsics and extrinsics and handle ego-motion compensation. For detectors, MMDetection3D and OpenPCDet provide reference PointPillars, CenterPoint, and BEVFormer implementations. Keep the same projection convention (frame order, distortion model) across your pipeline.
Practical Recipe
- Establish a shared coordinate frame and verify extrinsics by projecting LiDAR onto camera images for known objects.
- Time-synchronize streams (hardware trigger or timestamp interpolation); record the residual time skew.
- Choose a fusion level deliberately: late fusion to start (modular, debuggable), early or sensor-level once calibration is trusted.
- Run a reference 3D detector and report mAP and velocity error against the benchmark.
- Save one artifact: calibration, sync residuals, detector config, and overlaid projection images for visual audit.
A small clock skew between camera and LiDAR makes a fast crossing pedestrian appear smeared or doubled after fusion. The detector confidence drops, the tracker stutters, and the planner over-brakes. Always log the per-frame time skew; it is the first thing to check when fused detections degrade only for fast objects.
A team adding radar to a camera-LiDAR stack should fuse radar Doppler at the object level first: associate radar tracks to detected boxes and use radial velocity to disambiguate stopped versus creeping vehicles. This recovers velocity in heavy rain where LiDAR returns thin out, without rebuilding the detector.
Camera knows what, LiDAR knows where, radar knows how fast. Fusion's job is to keep all three answers about the same object.
Bird's-eye-view fusion (BEVFormer and successors) and occupancy networks are pushing toward a unified scene representation that detection, prediction, and planning can all read. The frontier question is robustness: do these dense representations degrade gracefully under sensor dropout and calibration drift, or do they fail silently?
Can you state, for a single LiDAR point, the two matrices and the one division needed to place it in the image, and the condition under which the projection is invalid? If not, revisit the mechanism box before fusing modalities.
| Tool or Library | Role in the Topic | Builder Advice |
|---|---|---|
| nuScenes devkit, Waymo Open Dataset tools | Calibrated multi-sensor data and transforms | Use their extrinsics rather than re-deriving frames by hand. |
| MMDetection3D, OpenPCDet | Reference 3D detectors (PointPillars, CenterPoint, BEVFormer) | Reproduce a published number before customizing. |
| Projection overlay script | Visual calibration audit | Overlay LiDAR on camera every release to catch extrinsic drift. |
Section 48.1 frames perception inside the full loop, 48.3 consumes these detections for tracking and prediction, and 48.5 explores learned representations that fuse and predict jointly.
Perturb the extrinsic rotation in the worked example by 1 degree and re-project a point at 50 m. Measure the pixel shift, convert it back to a ground-plane lateral error, and label whether it would move a pedestrian off the sidewalk.
Section References
Lang et al., "PointPillars: Fast Encoders for Object Detection from Point Clouds," CVPR 2019. Yin et al., "Center-based 3D Object Detection and Tracking" (CenterPoint), CVPR 2021. Li et al., "BEVFormer: Learning Bird's-Eye-View Representation from Multi-Camera Images," ECCV 2022.
These define the detector families and the bird's-eye-view fusion paradigm referenced above.
Three modalities disagree by design; fusion turns that disagreement into a more complete and robust scene, but only when calibration and time synchronization are correct. Master the LiDAR-to-camera projection first, because every fusion method depends on it.
Extend the projection function to handle a batch of LiDAR points and to color each surviving pixel by depth. Then design a same-panel experiment comparing late fusion (merge two detector outputs) against the painted-point sensor-level approach on identical frames, measuring mAP and the false-negative rate for pedestrians.