"Robustness is a recovery policy with receipts."
A Debugger of Real Robots
Manipulation systems fail for ordinary reasons: missing the object, colliding, slipping, drifting, timing out, or entering an unrecoverable contact mode. Good systems detect those states early and route to bounded recovery.
This section turns recovery into a first-class subsystem. The robot needs residual tests, timeout tests, progress tests, and contact tests that trigger reobserve, retry, regrasp, or abort.
It links manipulation to safety and evaluation. Recovery quality is where impressive one-shot demos and trustworthy embodied systems finally part ways.
A manipulation stack without explicit recovery is not a robust system. It is a success-only hypothesis that will eventually meet a box, mug, cable, or drawer that refuses to cooperate.
Theory
The cleanest abstraction is a failure-state machine layered on top of the manipulation policy. Residuals and progress metrics trigger state transitions, and each transition maps to a bounded recovery primitive.
This structure matters because many manipulation failures are easier to classify than to avoid. Missing the handle and slipping off the handle may both look like task failure, but they require different next actions and different future data collection.
$$ z_t = [e_{\text{pose}}, e_{\text{force}}, e_{\text{vision}}, \Delta q, \Delta x_o],\qquad y_t = \mathbf{1}[r(z_t) > \tau],\qquad b_{t+1} = \mathrm{recover}(b_t, y_t) $$
The detector fuses pose, force, grasp width, visual residual, and progress features into a failure label. The recovery layer maps that label into a safe next action such as back out, reopen, reobserve, or skip. Crucially, the evidence artifact stores the first label that fired and the branch that followed.
- Define residual features and timeouts before hardware testing begins.
- Map each high-confidence failure label to a bounded recovery primitive.
- Require every recovery branch to produce a new observation or new configuration before retry.
- Abort after repeated identical failures and log the case for replay-driven debugging.
Worked Example
# Route a manipulation failure to a bounded recovery branch.
failure = {"slip_score": 0.82, "occlusion": 0.15, "progress": 0.02}
if failure["slip_score"] > 0.7:
branch = "regrasp"
elif failure["occlusion"] > 0.5:
branch = "reobserve"
elif failure["progress"] < 0.05:
branch = "back_out_and_retry"
else:
branch = "continue"
print({"recovery_branch": branch})
Expected output: The expected result routes to regrasp because the slip signal dominates. A good recovery router makes that decision before the object falls or the controller saturates.
BehaviorTree.CPP, ROS 2 actions, and task-execution frameworks are often the right level for recovery orchestration. Learned policies can suggest actions, but the branching and safety limits should stay inspectable.
Practical Recipe
- Log residual and progress features at the same rate as the control loop or a fixed decimated rate.
- Define a failure taxonomy that is small enough to use but rich enough to guide recovery.
- Associate every recovery branch with a cost budget in time, retries, and risk.
- Store repeated-failure signatures so the same case can be replayed offline.
- Measure recovery success separately from nominal task success.
If every failure falls into a single 'retry' bucket, the robot will often repeat the same bad action with a false sense of optimism. Recovery needs new information or a changed configuration.
Shelf picking systems often recover by changing the wrist viewpoint, not by grasping again immediately. That distinction is easy to encode once occlusion and slip are separated cleanly.
Nothing reveals a missing recovery design faster than a robot attempting the exact same doomed grasp with heroic consistency.
Current work explores learned failure predictors and language-annotated recovery. The enduring engineering requirement is still a bounded branch table that an operator can inspect and trust.
Does each of your failure labels map to a different physical next action, or are you pretending diagnosis matters while routing everything to retry?
Recovery exposes one of the deepest embodied-AI differences from static inference. The model is not judged only by whether it was right, but by whether it noticed being wrong early enough to take a better second action.
For teaching, this section is a natural place to introduce failure ledgers. Students learn quickly when every failure trace must include label, branch, outcome, and whether the second attempt failed for the same or a different reason.
| Tool or Library | Role in the Topic | Builder Advice |
|---|---|---|
| BehaviorTree.CPP | Recovery orchestration | Use it when you want human-readable branching and preemption semantics. |
| ROS 2 actions | Interruptible execution | Helpful for reporting progress, cancellation, and task-level retries. |
| Replay logs | Postmortem analysis | Treat replayability as a requirement, not a nice extra. |
Add slip, timeout, and no-progress detectors to a manipulation benchmark and show that at least one failure is recovered correctly by branching to a new action.
The first question is whether the detector fired early enough. If not, improve signals. If yes, check whether the branch changed information, geometry, or contact state before retrying.
Section References
BehaviorTree.CPP ROS 2 integration
Practical framework for readable task branching and recovery.
Official action semantics for interruptible and monitorable task execution.
Useful reference for execution feedback and monitorable motion stages in manipulation pipelines.
Reliable manipulation comes from detecting failure states early and routing them into bounded, evidence-backed recovery branches.
Write a four-label manipulation failure taxonomy and a matching recovery table. For each label, specify the next observation you need before retrying.