Section 48.9: Closed-Loop Driving Evaluation And Safety Assurance

Open-loop scores measure how well you predict the world; closed-loop scores measure how the world changes once you act in it, and only the second kind keeps you safe.

On evaluating a driving stack
Big Picture

Driving evaluation must distinguish open-loop dataset skill from closed-loop driving behavior. A planner that predicts well can still brake too late, choose the wrong gap, or exploit a benchmark metric while becoming less safe on the road.

Closed-Loop Driving Evaluation And Safety Assurance conceptual illustration
Figure 48.9.1: A field-facing mental model for autonomous driving assurance. The illustration connects sensing, state estimation, prediction, planning, control, safety, and the evidence artifacts needed to defend a real driving claim.

Open-Loop Metrics Are Necessary, Not Sufficient

Modern driving stacks are usually evaluated in at least three modes. Open-loop evaluation checks perception or prediction against recorded data. Closed-loop simulation checks how the full stack behaves when its own decisions perturb the future. Safety assurance asks a different question again: which hazards remain, what mitigations exist, and what evidence shows that the mitigations actually work inside the stated operational design domain?

The first two modes are often confused. A forecasting model can improve average displacement error while still making the planner too confident in the wrong interaction pattern. A smoother trajectory generator can reduce jerk while cutting safety margins. A benchmark score can rise even while the vehicle becomes easier to deadlock or easier to surprise in rare conditions.

Closed Loop Changes The Distribution

Once the stack acts, it changes what happens next. That is why route completion, infractions, blocked-agent events, fallback activations, and controller saturation must be part of the evaluation, not mere debugging extras.

Driving Assurance Mathematics

Closed-loop evaluation often reduces the stack to a score, but the useful object is a tuple of behavior and risk:

$$E_{\mathrm{drive}} = (c_{\mathrm{route}}, p_{\mathrm{infractions}}, m_{\mathrm{comfort}}, m_{\mathrm{interaction}}, d_{\mathrm{ODD}}).$$

Here $c_{\mathrm{route}}$ is route completion, $p_{\mathrm{infractions}}$ aggregates collisions and rule violations, $m_{\mathrm{comfort}}$ tracks jerk and acceleration, $m_{\mathrm{interaction}}$ tracks negotiation quality such as deadlock or unsafe gap choice, and $d_{\mathrm{ODD}}$ records how close the scenario lies to the edge of the validated operational design domain.

A useful benchmark formula is therefore not just "higher is better." One representative decomposition is:

$$J_{\mathrm{closed}} = c_{\mathrm{route}} \cdot p_{\mathrm{infractions}}, \qquad \text{subject to } d_{\mathrm{ODD}} \in \mathcal{D}_{\mathrm{validated}}.$$

This captures a core lesson from current leaderboard design: route progress matters, but it is only meaningful when infractions remain low and the scenario still sits inside the claimed ODD. Once the system leaves the validated domain, the benchmark score is no longer a deployment argument, only a research observation.

Driving assurance evidence flow A block diagram linking open-loop metrics, closed-loop simulation, safety-case arguments, and deployment decisions. Open-loop tests detection, tracking, forecasting Closed-loop driving route completion, infractions, recovery Scenario coverage ODD boundaries, rare interactions Safety case hazards, mitigations, defeaters Decision deploy, limit, or deny Evidence artifact: scenario manifest, metric script, closed-loop trace, intervention log, and written defeater analysis A deployable claim requires every layer above, not just a leaderboard screenshot or a one-off demo route.
Figure 48.9.2: Driving assurance is an evidence flow from open-loop signals to closed-loop behavior to a safety-case decision.
Algorithm: closed-loop assurance pass
  1. Freeze the ODD card: road type, weather, lighting, speed range, traffic density, sensor assumptions, and fallback policy.
  2. Evaluate perception and prediction open-loop, but keep those metrics tagged as supporting evidence only.
  3. Run the full stack closed-loop on one scenario panel with one metric script and one taxonomy of infractions.
  4. Write the strongest defeater for every apparent success: what realistic condition would break this claim?
  5. Promote only the claims that survive both the metric panel and the defeater review.

Current Toolchain And What It Is For

The practical tool stack for this section is: CARLA Leaderboard, nuPlan, Waymo Open Dataset, Waymo Sim Agents or Waymax, CommonRoad, Autoware, ASAM OpenSCENARIO, OpenDRIVE. Each tool answers a different assurance question. CARLA Leaderboard tests closed-loop route completion with explicit infraction metrics. nuPlan and CommonRoad structure planning evaluation. Waymo's public challenges and simulator tools expose interaction realism. Autoware makes stack interfaces concrete. OpenSCENARIO and OpenDRIVE keep the scenario and map assumptions portable.

Driving assurance checklist
Evidence layerWhat to specifyWhat to save
ODDRoad type, weather, lighting, speed range, traffic density, map quality, fallback policy.ODD card with explicit exclusions.
Scenario panelFunctional hazards, logical parameter ranges, concrete seeds, and actor scripts.Scenario manifest plus reproducible seeds.
Closed-loop metricsRoute completion, infractions, comfort, blocked-agent events, planner timeouts, controller interventions.One construct-matched metric artifact.
Safety caseHazards, mitigations, assumptions, and strongest defeaters.Annotated argument and failure traces.
DecisionDeploy, limit, or deny by ODD slice.Written promotion decision with supporting evidence.
Failure Modes To Test

Stress the system with interaction deadlock, occlusion, map error, braking-distance miscalibration, planner-induced near misses, metric gaming, ODD creep, sensor degradation, and stale fallback assumptions. These are not decorative stressors. They are the ordinary ways an AV stack becomes dangerous while still looking competent on a dashboard.

Practical Example

An AV stack may improve open-loop trajectory forecasting on a merge benchmark, then perform worse closed-loop because the planner trusts the forecast too much and enters gaps that humans would reject. The fix may live in uncertainty calibration, behavior planning, or fallback design rather than in the predictor's average error alone.

Code And Evidence

We can put this into practice with a tiny assurance record. The goal is not to compress the whole safety case into one object. The goal is to force the stack to admit which metric improved, which infraction still happened, and whether the claim is actually promotable.

# Store one closed-loop assurance summary for a driving stack.
# Keep the route metric, infraction metric, and deployment decision together.
from dataclasses import dataclass, asdict

@dataclass
class AssuranceRecord:
    scenario_family: str
    route_completion_pct: float
    infraction_penalty: float
    blocked_events: int
    strongest_defeater: str
    decision: str

    def as_row(self) -> dict[str, object]:
        return asdict(self)

record = AssuranceRecord(
    scenario_family="unprotected_left_turn",
    route_completion_pct=96.0,
    infraction_penalty=0.91,
    blocked_events=1,
    strongest_defeater="late response to fast oncoming motorcycle in glare",
    decision="limit_to_daylight_until_glare_panel_passes",
)
print(record.as_row())
{'scenario_family': 'unprotected_left_turn', 'route_completion_pct': 96.0, 'infraction_penalty': 0.91, 'blocked_events': 1, 'strongest_defeater': 'late response to fast oncoming motorcycle in glare', 'decision': 'limit_to_daylight_until_glare_panel_passes'}
Code Fragment 1: This assurance record keeps route success, infractions, interaction breakdowns, and the strongest remaining defeater in one place. The output matters because it prevents a high route-completion number from hiding the exact condition that still blocks deployment.

Expected output: the printed record should contain both performance and denial information, not just a success metric. If your assurance artifact cannot name the strongest remaining defeater and the resulting deployment limit, it is a benchmark summary rather than a safety argument.

Library Shortcut

Use CARLA Leaderboard for closed-loop route and infraction metrics, nuPlan and CommonRoad for structured planning evaluation, Waymo Sim Agents or Waymax for interaction realism studies, and Autoware to make the stack boundaries concrete. The shortcut is valuable because these tools expose public interfaces and public metrics, not because they make the safety problem disappear.

Recipe For Builders

  1. Write the ODD card before comparing models. A model that works in daylight urban traffic is not automatically a model for glare, heavy rain, rural roads, or emergency-vehicle interactions.
  2. Use open-loop perception and prediction metrics as supporting evidence, then rerun the exact claim in closed-loop simulation.
  3. Keep route completion, infractions, comfort, blocked-agent events, and intervention logs in one artifact.
  4. Write the strongest defeater for every apparent success and convert it into a scenario or an explicit deployment limit.
  5. Only promote the stack where the metrics and the written safety argument agree.
Memory Hook

A strong driving stack is not the one with no scary examples in the slide deck. It is the one whose scary examples have already been turned into explicit defeaters, limits, and regression tests.

Self Check

Can you explain why a planner with better open-loop prediction metrics might still deserve a narrower ODD after closed-loop evaluation?

Research Frontier

The frontier in autonomous driving is shifting from isolated module wins toward world-model simulation, interaction-realistic agent generation, and explicit assurance arguments. The open problem is how to connect those richer models to safety cases without letting generative realism hide unvalidated assumptions.

Key Takeaway

Closed-Loop Driving Evaluation And Safety Assurance belongs in the book because it teaches the difference between a good score and a defensible driving claim. In embodied AI, that difference is the whole game.

Exercise 48.9.1

Take one scenario family, compute prediction metrics and closed-loop outcomes on the same panel, then write the safety argument and the strongest remaining defeater. End with a deployment decision that is narrower than the full ODD unless the evidence really supports the full claim.

Section References

CARLA Autonomous Driving Leaderboard. https://leaderboard.carla.org/

Current closed-loop benchmark with explicit route-completion and infraction metrics, including the 2.1 scoring update.

Waymo Open Dataset challenges. https://waymo.com/open/challenges/

Official 2026 note that leaderboards remain active even when formal challenge cycles pause, useful for current benchmark positioning.

Waymo Sim Agents challenge. https://waymo.com/research/the-waymo-open-sim-agents-challenge/

Public reference for interaction-realistic simulation metrics in autonomous driving.

Waymax. https://waymo.com/research/waymax/

Waymo's accelerated simulator for behavior research on motion-dataset scenarios.

Autoware documentation. https://autowarefoundation.github.io/autoware-documentation/main/home/

Open-source driving-stack documentation from sensing through control and simulation.

ISO 21448, Safety of the Intended Functionality. https://www.iso.org/standard/77490.html

Reference standard for reasoning about unsafe behavior without a component failure.

UL 4600 overview. https://users.ece.cmu.edu/~koopman/ul4600/index.html

Accessible overview of an assurance-oriented safety standard for autonomy.