Section 48.6: Scenario testing and safety cases | Building Embodied AI: From Perception to Autonomous Action

A passing benchmark says the car did well on the test you wrote; a safety case says you wrote the right tests, and what is left when they pass.
On safety assurance for autonomous driving

Technical illustration for Section 48.6: Scenario testing and safety cases. — **Figure 48.6A**: A safety case is an argument, not a score: it traces each hazard through its cause and mitigation down to a residual risk a regulator and an engineer can both inspect.

Big Picture

Safety assurance for AVs rests on three pillars. The Operational Design Domain (ODD) states the conditions under which the vehicle is certified to operate. ISO 21448 (SOTIF, Safety Of The Intended Functionality) addresses hazards from performance limitations and foreseeable misuse rather than component faults. The safety case is the structured argument linking each hazard to its cause, its mitigation, and the residual risk that remains. Scenario testing supplies the evidence that feeds this argument.

This section develops safety as a measurable contract: define the ODD, enumerate hazards within it, mitigate them, and quantify the residual risk with scenario evidence tied to closed-loop logs. The recurring discipline is that a benchmark number is evidence inside a safety case, never a substitute for one.

Theory

Operational Design Domain

The ODD is the explicit envelope of conditions the system is built and validated for: road types, speed limits, weather, lighting, geography, and traffic rules. As a concrete example, Waymo's robotaxi ODD has historically been structured roads in geofenced metro areas, below roughly 70 mph, in defined weather. A vehicle outside its ODD (an unmapped construction reroute, a snowstorm beyond the certified envelope) must detect the boundary and execute a minimal-risk maneuver rather than continue.

SOTIF (ISO 21448)

Classic functional safety (ISO 26262) handles faults: a sensor breaks, a wire shorts. SOTIF handles the harder AV problem of hazards with no fault at all, where every component works as designed but the function is insufficient. Its categories include performance limitations (the perception system genuinely cannot resolve a pedestrian in glare) and foreseeable misuse (a driver over-trusts a driver-assist feature). SOTIF drives you to shrink the space of unknown-unsafe scenarios through analysis and testing.

Safety case structure

A safety case is a traceable chain: hazard (unintended braking on the highway) leads to cause (false-positive detection of a phantom obstacle) leads to mitigation (multi-frame confirmation plus radar cross-check before braking) leads to residual risk (the quantified, accepted remainder after mitigation). Every link must point to evidence: a scenario suite, closed-loop logs, and a metric, not merely a claim.

Paper Spotlight: Waymo Open Dataset

"Scalability in Perception for Autonomous Driving: Waymo Open Dataset" (Sun et al., CVPR 2020). This dataset and benchmark show how a real AV company structures evaluation: large-scale, diverse, multi-sensor (LiDAR plus multiple cameras) data with high-quality 3D labels, split across geographies and conditions, and scored with metrics that weight difficult cases (distant, occluded, rare classes). The lesson for safety is methodological: scale and diversity of evaluation data, plus difficulty-aware metrics, are how companies turn "it works in the demo" into quantitative evidence about the long tail. Real safety measurement is dominated by the rare, hard cases the aggregate score can hide.

Coverage Is The Argument, Not The Leaderboard

A stack can top a perception leaderboard and still lack a defensible safety case if its test scenarios do not cover the real ODD. The decisive question is not "what is the score?" but "which hazards in the ODD are exercised by the evidence, and what residual risk remains for the rest?" Tie every test to a hazard, and every hazard to a mitigation with logged evidence.

Mechanism

Scenario testing operationalizes the safety case. A scenario binds an ODD slice, an actor configuration, a map, weather and lighting, an ego behavior, a safety metric (minimum gap, time-to-collision, collision flag), and an evidence artifact (the closed-loop log). Coverage is measured against the hazard list and the ODD parameter space; residual risk is the portion of that space that remains untested or only mitigated probabilistically. A safety case should point at the logs, not just the scenario names.

Worked Example

The example builds a tiny safety-case checker: it scores closed-loop scenario logs against a hazard list, flags scenarios whose minimum time-to-collision violates the safety threshold, and reports ODD coverage so an untested slice cannot pass silently.

# Each scenario log records its ODD slice, the min time-to-collision (s),
# and which hazard it was designed to exercise.
logs = [
    {"id": "S1", "odd": "urban_day_rain",  "min_ttc": 2.4, "hazard": "phantom_brake"},
    {"id": "S2", "odd": "urban_day_clear", "min_ttc": 0.8, "hazard": "cut_in"},
    {"id": "S3", "odd": "highway_day",     "min_ttc": 3.1, "hazard": "lead_brake"},
]
required_odd = {"urban_day_clear", "urban_day_rain", "highway_day", "urban_night"}
TTC_FLOOR = 1.5  # seconds; below this is a residual-risk hazard

# Evaluate each scenario against the safety threshold.
violations = [s for s in logs if s["min_ttc"] < TTC_FLOOR]
for s in logs:
    status = "FAIL" if s["min_ttc"] < TTC_FLOOR else "ok"
    print(f"{s['id']} [{s['odd']:>16}] hazard={s['hazard']:<13} "
          f"min_ttc={s['min_ttc']:.1f}s -> {status}")

# ODD coverage: which certified slices have NO evidence at all.
covered = {s["odd"] for s in logs}
uncovered = required_odd - covered
print("\nResidual-risk violations:", [s["id"] for s in violations])
print("Uncovered ODD slices (no evidence):", sorted(uncovered))
print("Safety case PASSES:", not violations and not uncovered)

Expected output: S2 fails (a cut-in with 0.8 s time-to-collision, below the 1.5 s floor), and urban_night is reported as an uncovered ODD slice. The safety case does not pass, correctly, because there is both a quantified violation and a slice of the certified domain with no evidence at all. This is the structural point: passing tests plus missing coverage is still a failing safety case.

Library Shortcut

Use CARLA ScenarioRunner and the OpenSCENARIO standard to author reproducible scenarios, CommonRoad for benchmark scenarios with formal metrics, and the Waymo Open Dataset for difficulty-aware perception evaluation. Safety-case structure can be expressed with the Goal Structuring Notation (GSN) used in assurance-case tooling. Keep one artifact schema linking scenario, metric, and log.

Practical Recipe

Write the ODD explicitly: road types, speeds, weather, geography, traffic rules.
Enumerate hazards (HAZOP plus SOTIF analysis) and map each to a cause and a mitigation.
Author scenarios that exercise each hazard across the ODD parameter space, including the long tail.
Run closed-loop, log per-scenario safety metrics, and compute coverage against the hazard list and ODD.
Quantify residual risk for what remains untested or only probabilistically mitigated; record it in the safety case.

Common Failure Mode

The benchmark passes while the safety case lacks evidence for the real ODD. A team reports 99 percent scenario pass rate, but every scenario is clear-day urban; night, rain, and construction zones are absent. The aggregate number hides an uncovered slice where residual risk is unknown. Always report coverage gaps as loudly as pass rates.

Practical Example

Construction-zone map staleness is a recurring SOTIF performance limitation: the HD map shows a lane that is now coned off. The safety case must list this hazard, mitigate it (online detection of cones and lane closures overriding the map), and quantify residual risk (the fraction of construction layouts the detector still mishandles), with scenario logs as evidence.

Memory Hook

Hazard, cause, mitigation, residual risk: four links in one chain. A safety case is only as strong as the link with no evidence behind it.

Research Frontier

Scenario generation that automatically mines the long tail (adversarial and rare-event search), and formal methods that bound residual risk over a continuous ODD rather than a finite test list, are the frontier of AV assurance. The goal is coverage arguments that scale beyond hand-authored scenarios.

Self Check

Can you trace one concrete highway hazard through cause, mitigation, and residual risk, and name the evidence each link needs? If not, the safety case is still a slogan, not an argument.

Practical Tool Choices For This Section

Tool or Library	Role in the Topic	Builder Advice
CARLA ScenarioRunner, OpenSCENARIO	Reproducible scenario authoring	Tie each scenario to a hazard in the safety case.
CommonRoad	Benchmark scenarios with formal metrics	Use for comparable closed-loop safety metrics.
Waymo Open Dataset	Difficulty-aware perception evaluation	Report long-tail and per-difficulty metrics, not only aggregates.

Cross-References

Section 48.1 frames the closed loop a safety case must cover, Section 48.5 supplies world models for generating rare scenarios, and Section 48.9 turns this assurance structure into closed-loop evaluation metrics.

Mini Lab

Extend the worked checker to weight violations by exposure (how often each ODD slice occurs in deployment) and produce a single residual-risk number. Then add an urban_night log and confirm the safety case flips to passing only when both coverage and the threshold are satisfied.

Section References

Sun et al., "Scalability in Perception for Autonomous Driving: Waymo Open Dataset," CVPR 2020. ISO 21448:2022, "Road Vehicles: Safety of the Intended Functionality (SOTIF)." Koopman and Wagner, "Challenges in Autonomous Vehicle Testing and Validation," SAE 2016.

These define the benchmark methodology, the SOTIF standard, and the validation-coverage challenges underlying AV safety cases.

Key Takeaway

Safety assurance is an argument that traces every ODD hazard through cause, mitigation, and residual risk, backed by closed-loop scenario evidence. A high benchmark score is one piece of that evidence, never a replacement for coverage of the real operational domain.

Exercise 48.6.1

Pick one ODD (geofenced urban, below 45 mph, day, light rain). Enumerate five hazards, map each to a cause and mitigation, design a scenario per hazard with a safety metric and threshold, and state the residual risk that remains if all five scenarios pass.