Section 3.8: Failure modes of each architecture | Building Embodied AI: From Perception to Autonomous Action

A Careful Control Loop

Big Picture

Failure modes of each architecture is one lens on embodied system architectures. We study it because an embodied agent needs decisions that survive contact with noisy sensors, delayed effects, and changing environments.

Figure 3.8. Failure modes of each architecture is easiest to reason about as a closed-loop evidence, decision, consequence pattern: each architecture fails in a characteristic place.

This section develops the technical contract for failure modes of each architecture into a usable mental model. First we define the object of study, then we connect it to the agent loop, then we test it with a compact implementation.

The key question in Failure modes of each architecture is practical: what must the agent know, what can it observe, what action is available, and what evidence shows that the action worked under the stated conditions?

Action Is The Test

A representation earns its place when it changes the measurable action interface. In failure modes of each architecture, the reader should keep asking which decision becomes easier, safer, or more reliable.

Theory

For Failure modes of each architecture, the practical design rule is to make the interface inspectable before optimization begins: inputs, outputs, units, latency, bounds, and failure labels should all be visible in the saved artifact.

Failure analysis is architecture-specific because each architecture hides uncertainty in a different place. A modular stack exposes many interfaces but can lose performance through handoff errors. An end-to-end policy removes handoffs but hides internal causes. A hierarchy improves long-horizon structure but creates precondition and termination failures. A dual-system design adds a router, which can become the most important component in the system.

Failure Signatures By Architecture

Architecture	Likely first suspect	Evidence to inspect	Best perturbation
Modular pipeline	Interface mismatch	frame, timestamp, covariance, message schema	Replay a corrected upstream message.
End-to-end policy	Data coverage or action convention	nearest training episodes, action scale, horizon	Hold the scene fixed and vary goal wording or initial pose.
Hierarchy	Skill precondition or termination	selected skill, precondition check, stop reason	Force the same skill with corrected preconditions.
Dual-system	Routing threshold	uncertainty, risk, selected path, deliberation time	Sweep uncertainty around the escalation threshold.

The practical goal is not to produce a dramatic failure label. It is to produce the smallest intervention that flips the outcome while leaving the rest of the run unchanged. That intervention identifies the architectural boundary where the fix belongs.

Mechanism

The mechanism in Failure modes of each architecture is the contract between representation and action. Name what enters the module, what leaves it, which assumptions make that transformation valid, and which log would reveal a bad handoff.

Worked Example

The section argues that a good failure analysis finds the smallest intervention that flips the outcome. The example makes that operational: a rollout depends on four stage outputs (pose, plan, skill boundary, routing), exactly one is corrupted, and a search over single oracle substitutions recovers which one. This is the three-pass diagnostic compressed into a few lines.

# A rollout succeeds only if every stage output is correct.
ORACLE = {"pose": "good", "plan": "good", "skill": "good", "route": "S2"}

def rollout(stages):
    return all(stages[k] == ORACLE[k] for k in ORACLE)

# The failing episode: one corrupted stage (the planner chose a bad plan).
broken = {"pose": "good", "plan": "BAD", "skill": "good", "route": "S2"}
print("baseline success:", rollout(broken))

# Pass 2: try each single oracle substitution; report the minimal fix.
for k in ORACLE:
    patched = dict(broken, **{k: ORACLE[k]})
    if rollout(patched):
        print(f"minimal fix: substitute oracle '{k}' -> success")

# Pass 3: a real cause should flip a PANEL, not one hand-picked case.
panel = [dict(broken), dict(broken, pose="BAD"), dict(broken, route="S1")]
flips = sum(rollout(dict(ep, plan="good")) for ep in panel)
print(f"'plan' oracle flips {flips}/{len(panel)} panel cases")

Code Fragment 3.8.1 searches single oracle substitutions to find the minimal intervention that flips a failed rollout, then checks whether that cause generalizes across a small panel rather than one case.

Expected output: the baseline fails, the search identifies plan as the single substitution that restores success, and the panel check shows the plan oracle flips only the episodes whose sole defect was the plan. The discipline is the takeaway: a cause that flips one cherry-picked case is a clue, while a cause that flips a panel is evidence. The table above tells you which oracle to try first for each architecture, and this search tells you whether the guess was right.

Library Shortcut

For Failure modes of each architecture, the hand-built fragment is a visibility tool. Production work should move to maintained stacks such as Hugging Face Transformers, open VLMs, OpenVLA, openpi, LeRobot, and tool-calling planners once the section has made the interface, logging contract, and failure recovery path explicit.

Practical Recipe

Write the observation, action, and success metric before choosing a model.
Build a baseline that is simple enough to debug by inspection.
Add the library implementation only after the baseline behavior is understood.
Record failures as structured cases: perception error, state error, planning error, control error, or evaluation error.
Run at least one perturbation test before trusting the result.

Common Failure Mode

The common mistake in Failure modes of each architecture is to celebrate the component score before checking the closed-loop handoff. The failure usually appears at the boundary: stale state, wrong frame, delayed action, saturated actuator, or metric that ignores the real task cost.

Practical Example

A robotics team using failure modes of each architecture should log not only final success, but intermediate observations, chosen actions, controller status, and recovery events. The logs reveal whether the method is solving the task or merely passing the easiest episodes.

Fun Note

Architecture diagrams look tidy because they do not include the arrow labeled 'everyone assumed someone else checked that'.

Research Frontier

For Failure modes of each architecture, treat frontier claims as hypotheses until they expose enough detail to reproduce the result: data boundary, embodiment, controller interface, evaluation panel, and failure cases.

Self Check

Can you name the observation, state estimate, action, success metric, and most likely failure mode for failure modes of each architecture? If not, the system boundary is still too vague.

Failure modes of each architecture becomes useful when it is tied to a closed-loop contract for how perception, estimation, planning, learning, and control are arranged into a system. The contract names the observation stream, the action representation, the timing budget, the safety boundary, and the result artifact. That is the bridge between a readable concept and a system a skeptical builder can test.

For Failure modes of each architecture, separate the conceptual claim, the systems claim, and the evidence claim. A good explanation, a clean API, and one successful rollout are different kinds of evidence, and the section should keep them distinct.

Tool or Library	Role in This Topic	Builder Advice
ROS 2	separates system modules while preserving message contracts and timing	Use it when the hand-built contract is clear and the experiment needs repeatable runs.
MuJoCo	gives architecture choices a repeatable simulated world for stress tests	Use it when the hand-built contract is clear and the experiment needs repeatable runs.
LeRobot	anchors modern policy architectures in reusable datasets and policy APIs	Use it when the hand-built contract is clear and the experiment needs repeatable runs.

For Failure modes of each architecture, a robust implementation starts with one inspectable baseline whose artifact records observations, actions, units, timestamps, seeds, termination reasons, and the perturbation applied. The maintained-tool version is useful only if it preserves that schema and lets the comparison remain construct-matched.

Write a one-paragraph task contract with observation, action, success, failure, and safety fields.
Start with the smallest simulator, dataset, or wrapper that exposes the task contract faithfully.
Run one deterministic smoke test and one perturbation test before scaling.
Save one artifact containing configuration, seed, metrics, traces, and failure labels.
Compare methods only when the same script evaluates the same panel, split, seed set, and metric.

When Failure modes of each architecture fails, avoid labeling the whole method as weak. First assign the failure to perception, state estimation, planning, control, timing, data coverage, or evaluation. Then rerun one controlled perturbation that isolates the suspected cause. This pattern turns a disappointing rollout into a reusable diagnostic asset.

Use a three-pass diagnostic. First, classify the architecture, because the likely hidden variable depends on the design. Second, replay the episode with one oracle substitution, such as corrected pose, corrected plan, corrected skill boundary, or forced System 2 routing. Third, rerun the same intervention across a small panel of failures. A cause that flips one hand-picked case is a clue; a cause that flips a panel is evidence.

Hands-On Lab: Build a Section Evidence Trace

Duration: ~65 minutesDifficulty: Intermediate

Objective

Turn Failure modes of each architecture into a small artifact that compares a hand-built baseline with a maintained-tool shortcut under one perturbation.

What You'll Practice

Define an observation, action, metric, and perturbation contract
Build a minimal baseline trace
Preserve the same schema for the library shortcut
Write a failure postmortem from the evidence record

Setup

pip install numpy pandas

Code Fragment 3.8.L1 installs NumPy and pandas for the section lab trace.

Steps

Step 1: Define the contract

Write the fields that make two runs comparable.

Step 2: Record the baseline

Save one deterministic result before adding noise or latency.

Step 3: Add the shortcut

Run or sketch the maintained-tool version while keeping the artifact schema fixed.

Step 4: Apply one perturbation

Change exactly one condition and preserve the same logging fields.

Expected Output

The completed lab produces one table with baseline, shortcut, and perturbed rows, plus a short note explaining which comparison is valid because all metrics were co-computed under one schema.

Stretch Goals

Add a second seed and report mean and spread.
Write a one-paragraph postmortem that separates root cause from symptom.

Complete Solution

# Complete compact evidence trace for the section lab.
# Extend these records with values produced by your actual environment or simulator.
import pandas as pd

records = [
    {"run": "baseline", "seed": 0, "success": 0.72, "failure_label": "none"},
    {"run": "library_shortcut", "seed": 0, "success": 0.78, "failure_label": "none"},
    {"run": "baseline_perturbed", "seed": 0, "success": 0.54, "failure_label": "latency"},
]
print(pd.DataFrame(records))

Code Fragment 3.8.L2 creates a complete same-schema evidence table for the section lab.

Key Takeaway

Failure modes of each architecture is useful when it makes the perception-action loop more reliable, not when it merely adds a more impressive model name.

Exercise 3.8.1

Design a method-matched experiment for Failure modes of each architecture. Specify the environment, observation schema, action interface, metric, and one perturbation that targets the section's core assumption.

What's Next?

Chapter 4 begins Part II by giving these systems a geometric language for space and motion.

Bibliography & Further Reading

Foundational References For This Section

Quigley, M. et al.. "ROS: an open-source Robot Operating System." (2009). https://www.ros.org/

The systems reference for modular robot software and message-passing architecture.

Todorov, E., Erez, T., and Tassa, Y.. "MuJoCo: A physics engine for model-based control." (2012). https://mujoco.org/

A widely used simulator for architecture and control experiments.

Brohan, A. et al.. "RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control." (2023). https://arxiv.org/abs/2307.15818

A central reference for locating VLM and VLA models in embodied control stacks.