Section 53.3: Out-of-distribution detection | Building Embodied AI: From Perception to Autonomous Action

A robust robot is not the one that never sees surprise, it is the one that notices surprise early enough to act differently.
A Runtime Monitoring Engineer

Big Picture

OOD detection asks whether the current input or state belongs to the support on which the system was tuned and evaluated. In embodied settings, this matters because extrapolating silently can create physically irreversible mistakes.

Out-of-distribution detection illustration for Chapter 53. — **Figure 53.3.1**: An OOD detector draws a boundary around known operating support and routes suspicious states toward caution or human review.

Why This Matters

Out-of-distribution detection is useful only when it distinguishes disturbance sources and ties them to specific corrective actions. Robustness is not one scalar, it is a map from perturbation class to degraded behavior, detection delay, and residual risk.

Given an OOD score $s(x)$, a detector triggers when $$s(x) > \tau,$$ where $\tau$ is chosen against a cost tradeoff between missed OOD events and unnecessary interventions. The right threshold depends on what action becomes available when the alert fires.

Key Insight

OOD detection is not useful because it labels novelty abstractly. It is useful because it decides when to slow down, replan, switch sensors, or hand control to a safer subsystem.

Algorithmic View

Choose an OOD score, such as energy, reconstruction error, distance in feature space, or conformal nonconformity.
Build an in-support panel and at least one clearly out-of-support panel tied to deployment concerns.
Measure false positive and false negative costs in action terms, not only in ROC space.
Attach a behavior policy to the alert: degrade, stop, seek more information, or escalate to a human.
Review false alarms to see whether the support definition is wrong or the score is too noisy.

Worked Example

A warehouse robot that sees a reflective floor patch unlike anything in training should not continue as if the scene were ordinary. Even a blunt OOD alert can be valuable if it triggers a lower-speed navigation mode.

scores = [0.12, 0.18, 0.22, 0.74, 0.81]
threshold = 0.5
flags = [score > threshold for score in scores]
print({"threshold": threshold, "flags": flags, "flag_rate": sum(flags) / len(flags)})

{'threshold': 0.5, 'flags': [False, False, False, True, True], 'flag_rate': 0.4}

Code Fragment 53.3.1 applies a simple OOD threshold, illustrating the interface between score and intervention policy.

Expected output: Two states are flagged as OOD. The real question is what the robot does next, which is why the detector should always be evaluated together with its downstream intervention logic.

Library Shortcut

Feature-space detectors, PyOD-style baseline suites, conformal wrappers, and replay dashboards reduce the friction of threshold sweeps and failure review. The detector still needs a task-specific cost model to be meaningful.

Concrete stack anchors for this chapter include PyTorch or JAX feature extractors for saving embeddings, OpenCV and Open3D replay views for checking whether novelty is visual or geometric, PyOD-style OOD baselines and FAISS-like indices for score comparison, Weights & Biases or TensorBoard for threshold sweeps, and ROS 2 diagnostics when the OOD signal triggers stop, slow, relocalize, or human-review behavior.

OOD Tool Anchors

Score Family	Typical Tooling	Why It Helps
Distance-based	Feature banks and nearest-neighbor search, often backed by FAISS-like indices.	Fast checks for whether the current embedding resembles known support.
Energy or margin-based	Simple model-output baselines, often compared inside PyOD-style evaluation suites.	Cheap deployment monitors for silent extrapolation.
Conformal	Coverage-oriented wrappers around an existing predictor.	Makes threshold choice legible in terms of misses versus conservative alerts.

OOD signals are especially useful when paired with task context. A scene can be novel but harmless, or common-looking but dangerous because the action consequences are unusual. The detector should be interpreted through the active task and control state. In practice, teams often compare simple distance-based, energy-based, and reconstruction-based scores before promoting one score into the runtime monitor.

The deployed detector should preserve three linked records: the PyTorch or JAX feature vector that produced the score, the OpenCV or Open3D replay evidence that explains the scene, and the ROS 2 event that changed behavior. Without those links, an OOD threshold is hard to tune and nearly impossible to debug after a near miss.

The classic mistake is to report AUROC without defining the operational meaning of false alarms and misses. In deployment, the question is whether the alert changes behavior appropriately, not whether a curve looks elegant.

Cross-References

This section pairs naturally with Section 53.4 on runtime monitoring and Section 54.1 on embodied safety, because OOD alerts often become one input to a broader safety supervisor.

Lab Recipe

Choose one OOD score for a robot perception or planning state, define a threshold on a development panel, and inspect whether the resulting alerts would have prevented any previously observed failures.

Failure Mode

Do not evaluate OOD detectors on synthetic novelty only if your deployment failures come from timing, wear, or control mismatch. Novelty needs to be defined around the real support boundary that matters.

Practical Example

For drones, unfamiliar weather or lighting may be the relevant OOD family. For manipulation, unusual object compliance or unusual contact configuration may matter more than pixel novelty alone.

Research Frontier

The frontier includes sequential OOD detection, active information gathering after novelty alerts, and joint novelty scores that combine perception, dynamics, and map uncertainty rather than treating each in isolation.

Self Check

If your detector fires, what exact behavior changes? If the answer is vague, the detector is still an analytic curiosity rather than a deployment tool.

Key Takeaway

OOD detection matters when it defines a boundary of trust and hands control to a safer behavior before silent extrapolation becomes damage.

Exercise 53.3.1

Define an OOD notion for your platform and propose a threshold policy. Then describe one false alarm you would accept and one missed alarm you would consider unacceptable.

Section References

Hendrycks, D., and Gimpel, K. "A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks." (2017). https://arxiv.org/abs/1610.02136

A foundational starting point for simple OOD scoring.

Liu, W. et al. "Energy-based Out-of-distribution Detection." (2020). https://arxiv.org/abs/2010.03759

A widely used modern OOD score family.

What's Next

Section 53.4 closes the chapter by integrating uncertainty and OOD signals into runtime monitors and fail-safe state transitions.