"A perception failure is understood only when its downstream action failure is named."
A Patient Embodied AI Agent
When perception failures become action failures maps visual errors to downstream consequences. The goal is not merely to say that perception failed, but to identify whether calibration, recognition, uncertainty, latency, tracking, or the action interface caused the robot to choose badly.
Problem First: Why This Representation Exists
The section treats perception bugs as system bugs. A wrong label, stale mask, scale error, or delayed estimate matters because it changes a trajectory, grasp, stop decision, or recovery behavior.
The contract here maps perception evidence to failure attribution: raw input, intermediate representation, action consumer, chosen command, observed failure, and the earliest detectable warning.
A failure taxonomy earns value when it tells the team whether to fix sensing, calibration, uncertainty propagation, planning assumptions, or controller safeguards.
Figure 27.7.1 shows this section's perception-to-action contract. Read each edge as a concrete interface that must name units, frame, timestamp, uncertainty, and the consumer that is allowed to act on it.
Mathematical Core
A perception error matters when it crosses an action boundary or consumes the available timing margin.
$\mathrm{fail}= \mathbf 1[d(\hat s,s)>\epsilon_{\mathrm{action}}]\lor \mathbf 1[\Delta t>\Delta t_{\max}]\lor \mathbf 1[\Sigma_{\hat s}\not\subseteq \Sigma_{\mathrm{allowed}}]$
This expression separates magnitude error, latency error, and uncertainty-interface error. A small estimate error can be harmless far from a decision boundary; the same error can be catastrophic near contact.
- Replay the raw sensor stream and verify calibration, timestamps, and transforms.
- Compare model output with a task-level counterfactual action.
- Check whether uncertainty was published and consumed by the planner or controller.
- Assign the failure to sensing, representation, timing, action selection, control, or evaluation.
| Design Choice | Use When | Control Risk |
|---|---|---|
| Sensing failure | Blur, glare, missing depth, dropped frames | Bad raw evidence enters every downstream module. |
| Representation failure | Wrong mask, pose, flow, or affordance | Planner receives a plausible but false state. |
| Interface failure | No uncertainty, wrong frame, stale timestamp | Correct perception is consumed incorrectly. |
Worked Miniature
Code Fragment 27.7.1 classifies failures by comparing state error, latency, and uncertainty width against action thresholds. This is the kind of small rule that should appear in replay dashboards.
# Label whether perception crossed an action-relevant failure boundary.
# Separate geometry error, latency error, and uncertainty-interface error.
state_error_m = 0.045
action_margin_m = 0.030
latency_ms = 115
max_latency_ms = 80
uncertainty_m = 0.055
allowed_uncertainty_m = 0.040
labels = []
if state_error_m > action_margin_m:
labels.append("geometry_error")
if latency_ms > max_latency_ms:
labels.append("stale_perception")
if uncertainty_m > allowed_uncertainty_m:
labels.append("uncertainty_too_wide")
print(labels)
This expected output list means the failure is multi-causal, so retraining one vision model would not close the loop by itself. Each label names a different intervention path: recalibrate or refit geometry, reduce latency, or widen the action margin under uncertainty.
A production pipeline should emit these labels from ROS 2 diagnostics, tracing tools, and model telemetry. Frameworks can collect timestamps and message metadata automatically, but the team must define the action boundary and failure taxonomy.
The worst failure label is `bad vision`. It hides the specific interface that broke and makes the next experiment less informative.
When an autonomous vehicle brakes late, the audit should separate missed detection, wrong object velocity, delayed perception, planner threshold, and actuator response. Only one of those is solved by retraining a detector.
For When perception failures become action failures, the perception result must answer what action changed, what uncertainty changed, and what log would reproduce the decision. Otherwise the output is still visualization, not embodied evidence.
Debugging And Evaluation
Evaluate failure cases with replayable causal records: record sensor stream, perception output, uncertainty, planner input, command, physical outcome, and the smallest counterfactual check.
Perturb one suspected cause at a time, such as calibration, latency, recognition, or tracking, then verify whether the same downstream action failure appears.
As perception stacks absorb foundation models, multimodal prompts, and learned world models, failure attribution becomes more important. The frontier is building evaluation artifacts that reveal when a model was wrong, when it was late, and when downstream code ignored its uncertainty.
Chapter 28 lifts everything learned here into three dimensions: the same failure-attribution discipline now applies to point clouds, voxel maps, and neural scene representations, where geometry errors propagate to collision and contact decisions rather than to class labels.
Section References
NVIDIA. Isaac ROS Visual SLAM documentation. https://nvidia-isaac-ros.github.io/repositories_and_packages/isaac_ros_visual_slam/index.html
Illustrates real-time perception components whose odometry output must be monitored for latency and reliability.
OpenCV. Camera calibration and 3D reconstruction documentation. https://docs.opencv.org/4.x/d9/d0c/group__calib3d.html
Calibration failures are a frequent root cause of action-level perception failures.
Can you name the representation, the consuming action, the uncertainty or freshness field, and the failure label for When perception failures become action failures? If any one is missing, the section is not yet ready for a robot replay log.
A perception failure becomes useful engineering evidence only after it is mapped to the action boundary it crossed and the interface that allowed it through.
Take a failed robot rollout and assign three labels: first bad signal, first bad state estimate, and first bad action. Explain how the fix differs for each label.