Section 48.3: Detection, lane and behavior prediction | Building Embodied AI: From Perception to Autonomous Action

To predict a driver is to admit you do not know what they will do, and then to be ready for every reasonable thing they might.
On trajectory forecasting

Technical illustration for Section 48.3: Detection, lane and behavior prediction. — **Figure 48.3A**: Prediction is fundamentally multi-modal: the same observed history is consistent with several plausible futures, and a good predictor keeps all of them alive until the evidence picks one.

Big Picture

Once perception delivers tracked actors, prediction forecasts where each one will go over the next few seconds. The defining property is multi-modality: a car approaching an intersection might go straight, turn, or stop, and the predictor must represent several distinct futures with calibrated probabilities. This section defines the standard error metrics (minADE and minFDE), shows how to compute them, and introduces lane-graph-conditioned models that ground predictions in the road structure.

This section develops behavior prediction as a measurable contract: ingest agent histories and a vectorized map, output $K$ candidate future trajectories per agent with probabilities, and score them against the realized future. Because the future is uncertain, the standard metrics evaluate the best of $K$ samples rather than a single guess, which rewards covering the true mode without penalizing diversity.

Theory

The displacement metrics

Let an agent have ground-truth future positions $y_1, \dots, y_T$ and $K$ predicted trajectories, the $k$-th being $\hat y_1^{(k)}, \dots, \hat y_T^{(k)}$. The minimum average displacement error over $K$ samples is

$$\text{minADE}_K = \frac{1}{T}\,\min_{k=1}^{K}\sum_{t=1}^{T}\big\lVert \hat y_t^{(k)} - y_t \big\rVert_2.$$

The minimum final displacement error keeps only the last timestep,

$$\text{minFDE}_K = \min_{k=1}^{K}\big\lVert \hat y_T^{(k)} - y_T \big\rVert_2.$$

Both take the best of $K$ predictions: the model is credited if any one of its samples is close to the truth. A related metric, the miss rate, counts the fraction of agents for which even the best sample's final point exceeds a threshold (commonly 2 m).

Lane-graph-conditioned models

VectorNet represents both agent histories and map elements (lane centerlines, crosswalks, boundaries) as polylines of vectors, then reasons over them with a graph network so predictions respect road geometry. MTR (Motion Transformer) uses a set of learnable motion-mode queries with a transformer to produce diverse, map-consistent trajectories and led the Waymo motion-prediction benchmark. The shared idea is conditioning on the lane graph so that a predicted path follows drivable lanes rather than cutting across medians.

Best-Of-K Rewards Coverage, Not Confidence

Because minADE and minFDE take the minimum over samples, a predictor is scored on whether it covered the true future, not on whether its top sample was right. This is intentional: downstream planning must be safe against every plausible mode, so a predictor that keeps the correct mode alive (even at low probability) is more useful than one that commits confidently to the wrong single path.

Mechanism

The danger of best-of-K is mode collapse: a model that minimizes average error tends to predict $K$ near-identical trajectories down the most common mode (going straight). It then scores well on routine data but misses exactly the rare, safety-critical maneuvers (a sudden lane change). Diversity-promoting losses, anchor trajectories, and mode-specific queries (as in MTR) exist precisely to prevent this collapse.

Worked Example

The example computes $\text{minADE}_6$ given six predicted trajectory samples and one ground-truth future.

import numpy as np

# Ground truth: T=5 future (x, y) positions of one agent.
gt = np.array([[1.0, 0.0], [2.0, 0.1], [3.0, 0.3], [4.0, 0.6], [5.0, 1.0]])

# Six predicted trajectories, shape (K=6, T=5, 2).
# Sample 0 hugs the truth; the rest are alternative modes (lane changes, stops).
rng = np.random.default_rng(0)
preds = np.stack([
    gt + rng.normal(0, 0.05, gt.shape),          # near-correct mode
    gt + np.array([0, 1.5]),                      # left lane change
    gt + np.array([0, -1.5]),                     # right lane change
    gt * np.array([0.6, 1.0]),                    # braking / slowing
    gt + rng.normal(0, 1.0, gt.shape),            # noisy
    gt + np.array([0.0, 3.0]),                    # far-off mode
])

def min_ade_k(preds, gt):
    """minADE_K: best-of-K mean per-timestep L2 displacement."""
    # Per-sample, per-timestep L2 distance: shape (K, T).
    disp = np.linalg.norm(preds - gt[None, :, :], axis=2)
    ade_per_sample = disp.mean(axis=1)            # average over T -> (K,)
    return float(ade_per_sample.min())            # best of K

def min_fde_k(preds, gt):
    """minFDE_K: best-of-K final-timestep L2 displacement."""
    final = np.linalg.norm(preds[:, -1, :] - gt[-1], axis=1)  # (K,)
    return float(final.min())

print("minADE_6:", round(min_ade_k(preds, gt), 3), "m")
print("minFDE_6:", round(min_fde_k(preds, gt), 3), "m")

Expected output: a small minADE_6 (roughly 0.04 m) and minFDE_6, because sample 0 closely tracks the truth. Remove sample 0 from the stack and both metrics jump, demonstrating that the score depends entirely on whether the correct mode was among the $K$ candidates.

Library Shortcut

The Argoverse 2 and Waymo Open Motion Dataset evaluation kits compute minADE, minFDE, miss rate, and probabilistic metrics with the official protocol. Use nuScenes prediction challenge tooling for the lane-graph map API. Reference models: VectorNet, LaneGCN, and MTR are available in open implementations. Always score with the dataset's own kit so your numbers are comparable.

Practical Recipe

Define the horizon $T$, the number of modes $K$, and the miss threshold up front; they make numbers comparable.
Condition on the vectorized lane graph, not just agent history, so predictions stay on the road.
Report minADE_K, minFDE_K, miss rate, and a mode-coverage check together; one metric alone hides collapse.
Stress-test on rare maneuvers (cut-ins, sudden stops), not just the routine straight-ahead majority.
Save one artifact: per-agent predictions, probabilities, the realized future, and the four metrics.

Common Failure Mode

Mode collapse on a high-speed lane change. A model trained to minimize average error predicts six near-parallel straight paths; when a fast vehicle suddenly changes lanes, none of the six samples covers the new lane, minFDE spikes, and the planner gets no warning. Good aggregate minADE on routine data can hide this, so always evaluate the rare-maneuver slice separately.

Practical Example

A prediction team sees collisions concentrated at merges. Logs show the predictor assigned 95 percent probability to "stay in lane" and 5 percent to "merge," but the planner used only the top mode. The fix is twofold: planning must consume the full multi-modal set, and the predictor needs a diversity loss so the merge mode is geometrically distinct, not a near-copy of the lane-keep mode.

Memory Hook

ADE is how wrong you were on average; FDE is how wrong you were at the end. The "min" in front of both means: out of your guesses, your best one counts.

Research Frontier

Joint multi-agent prediction (forecasting interacting agents together rather than independently) and goal-conditioned models that reason about intent are the active frontier. The hard part is calibration: producing probabilities a planner can trust, not just diverse trajectories that game best-of-K.

Self Check

Can you explain why minADE_6 can look excellent while the predictor is unsafe, and what additional measurement would expose the problem? If not, revisit the mode-collapse mechanism.

Practical Tool Choices For This Section

Tool or Library	Role in the Topic	Builder Advice
Argoverse 2, Waymo Open Motion Dataset	Benchmarks with official prediction metrics	Score with their kits for comparable minADE / minFDE.
VectorNet, LaneGCN, MTR	Lane-graph-conditioned predictor families	Reproduce a benchmark number before customizing.
Rare-maneuver evaluation slice	Mode-collapse detection	Always report it alongside aggregate metrics.

Cross-References

Section 48.2 provides the tracked actors this section forecasts, Section 48.4 and 48.8 consume the multi-modal predictions for planning, and 48.5 explores world models that predict scene evolution directly.

Mini Lab

Start from the worked example, delete the near-correct sample, and replace it with a sixth near-straight path. Recompute minADE_6 and minFDE_6 and confirm that engineered mode collapse inflates error exactly when the truth is an off-mode maneuver.

Section References

Gao et al., "VectorNet: Encoding HD Maps and Agent Dynamics from Vectorized Representation," CVPR 2020. Shi et al., "Motion Transformer with Global Intention Localization and Local Movement Refinement" (MTR), NeurIPS 2022. Wilson et al., "Argoverse 2: Next Generation Datasets for Self-Driving Perception and Forecasting," NeurIPS 2021.

These define lane-graph prediction models and the benchmarks whose metrics this section computes.

Key Takeaway

Prediction is multi-modal forecasting scored by best-of-K displacement. Strong minADE_K means the right future was among your samples; guarding against mode collapse on rare maneuvers, and feeding all modes to the planner, is what turns that score into safety.

Exercise 48.3.1

Implement the miss rate (fraction of agents whose best final point exceeds 2 m) and add it to the worked example. Then design a same-panel comparison of two predictors on a cut-in scenario set, reporting minADE_6, minFDE_6, and miss rate, and argue which predictor a planner should prefer.