Section 42.6: Failure detection and recovery

"Robustness is a recovery policy with receipts."

A Debugger of Real Robots
Illustration for Section 42.6: Failure detection and recovery
Figure 42.6A: Recovery begins with typed failure signals, not with vague claims that the policy will figure it out online.
Big Picture

Manipulation systems fail for ordinary reasons: missing the object, colliding, slipping, drifting, timing out, or entering an unrecoverable contact mode. Good systems detect those states early and route to bounded recovery.

This section turns recovery into a first-class subsystem. The robot needs residual tests, timeout tests, progress tests, and contact tests that trigger reobserve, retry, regrasp, or abort.

It links manipulation to safety and evaluation. Recovery quality is where impressive one-shot demos and trustworthy embodied systems finally part ways.

Action Is The Test

A manipulation stack without explicit recovery is not a robust system. It is a success-only hypothesis that will eventually meet a box, mug, cable, or drawer that refuses to cooperate.

Loop diagram for Section 42.6DetectresidualsDiagnoselabel failureRecoverretry or abortVerifyresume or stop
Figure 42.6.1: Recovery begins with typed failure signals, not with vague claims that the policy will figure it out online.

Theory

The cleanest abstraction is a failure-state machine layered on top of the manipulation policy. Residuals and progress metrics trigger state transitions, and each transition maps to a bounded recovery primitive.

This structure matters because many manipulation failures are easier to classify than to avoid. Missing the handle and slipping off the handle may both look like task failure, but they require different next actions and different future data collection.

$$ z_t = [e_{\text{pose}}, e_{\text{force}}, e_{\text{vision}}, \Delta q, \Delta x_o],\qquad y_t = \mathbf{1}[r(z_t) > \tau],\qquad b_{t+1} = \mathrm{recover}(b_t, y_t) $$

Mechanism

The detector fuses pose, force, grasp width, visual residual, and progress features into a failure label. The recovery layer maps that label into a safe next action such as back out, reopen, reobserve, or skip. Crucially, the evidence artifact stores the first label that fired and the branch that followed.

Algorithm: Recovery Router
  1. Define residual features and timeouts before hardware testing begins.
  2. Map each high-confidence failure label to a bounded recovery primitive.
  3. Require every recovery branch to produce a new observation or new configuration before retry.
  4. Abort after repeated identical failures and log the case for replay-driven debugging.

Worked Example

# Route a manipulation failure to a bounded recovery branch.
failure = {"slip_score": 0.82, "occlusion": 0.15, "progress": 0.02}

if failure["slip_score"] > 0.7:
    branch = "regrasp"
elif failure["occlusion"] > 0.5:
    branch = "reobserve"
elif failure["progress"] < 0.05:
    branch = "back_out_and_retry"
else:
    branch = "continue"

print({"recovery_branch": branch})
{'recovery_branch': 'regrasp'}
Code Fragment 42.6.1 turns failure features into a bounded recovery branch instead of leaving the system to thrash under the original command.

Expected output: The expected result routes to regrasp because the slip signal dominates. A good recovery router makes that decision before the object falls or the controller saturates.

Library Shortcut

BehaviorTree.CPP, ROS 2 actions, and task-execution frameworks are often the right level for recovery orchestration. Learned policies can suggest actions, but the branching and safety limits should stay inspectable.

Practical Recipe

  1. Log residual and progress features at the same rate as the control loop or a fixed decimated rate.
  2. Define a failure taxonomy that is small enough to use but rich enough to guide recovery.
  3. Associate every recovery branch with a cost budget in time, retries, and risk.
  4. Store repeated-failure signatures so the same case can be replayed offline.
  5. Measure recovery success separately from nominal task success.
Common Failure Mode

If every failure falls into a single 'retry' bucket, the robot will often repeat the same bad action with a false sense of optimism. Recovery needs new information or a changed configuration.

Practical Example

Shelf picking systems often recover by changing the wrist viewpoint, not by grasping again immediately. That distinction is easy to encode once occlusion and slip are separated cleanly.

Memory Hook

Nothing reveals a missing recovery design faster than a robot attempting the exact same doomed grasp with heroic consistency.

Research Frontier

Current work explores learned failure predictors and language-annotated recovery. The enduring engineering requirement is still a bounded branch table that an operator can inspect and trust.

Self Check

Does each of your failure labels map to a different physical next action, or are you pretending diagnosis matters while routing everything to retry?

Recovery exposes one of the deepest embodied-AI differences from static inference. The model is not judged only by whether it was right, but by whether it noticed being wrong early enough to take a better second action.

For teaching, this section is a natural place to introduce failure ledgers. Students learn quickly when every failure trace must include label, branch, outcome, and whether the second attempt failed for the same or a different reason.

Practical Tool Choices For This Section
Tool or LibraryRole in the TopicBuilder Advice
BehaviorTree.CPPRecovery orchestrationUse it when you want human-readable branching and preemption semantics.
ROS 2 actionsInterruptible executionHelpful for reporting progress, cancellation, and task-level retries.
Replay logsPostmortem analysisTreat replayability as a requirement, not a nice extra.
Mini Lab

Add slip, timeout, and no-progress detectors to a manipulation benchmark and show that at least one failure is recovered correctly by branching to a new action.

The first question is whether the detector fired early enough. If not, improve signals. If yes, check whether the branch changed information, geometry, or contact state before retrying.

Section References

BehaviorTree.CPP ROS 2 integration

Practical framework for readable task branching and recovery.

ROS 2 actions tutorial

Official action semantics for interruptible and monitorable task execution.

MoveIt 2 Documentation

Useful reference for execution feedback and monitorable motion stages in manipulation pipelines.

Key Takeaway

Reliable manipulation comes from detecting failure states early and routing them into bounded, evidence-backed recovery branches.

Exercise 42.6.1

Write a four-label manipulation failure taxonomy and a matching recovery table. For each label, specify the next observation you need before retrying.