Section 37.5: Sample-efficiency advantages and failure modes | Building Embodied AI: From Perception to Autonomous Action

The most persuasive sample-efficiency plot is the one that survives contact with failure analysis.
A Budget-Conscious MPC Loop

A model-based learning curve reaching target performance early, alongside a failure ledger listing model bias, optimizer collapse, and timing overruns. — **Figure 37.5A**: Sample efficiency is only one side of the story. A serious audit pairs early learning gains with a ledger of failure modes and hidden costs.

Big Picture

Model-based RL often wins on early learning efficiency because the same real transition can be reused many times through planning or imagination. But those gains can disappear if the model is biased, the planner is too slow, or the method collapses under shifts the benchmark barely exposes.

Key Insight

Efficiency claims are incomplete until they are paired with a failure ledger. Saved episodes mean little if the saved method breaks when latency, contact, or shift actually matter.

Where The Efficiency Comes From

Model-based methods reuse real experience by either planning through a learned model or generating synthetic targets from it. In rough terms, one transition can contribute to multiple policy-improvement updates rather than being consumed only once. That is the efficiency story.

But the same mechanism creates new failure terms:

$$ \text{deployment risk} \approx \text{model bias} + \text{optimizer error} + \text{timing overrun} + \text{uncertainty misuse}. $$

A serious evaluation must report both sample efficiency and this risk ledger on the same matched panel.

This is where many benchmark stories become misleading. A method can hit target return quickly because the benchmark rewards an easy strategy that never exposes the hard parts of the dynamics, such as sparse collisions, rare slips, or recovery after contact surprises. If the evaluation panel does not include those cases, the sample-efficiency gain is real but incomplete. For embodied systems, incomplete often means unsafe.

Common Failure Modes

Failure mode	Typical symptom	Diagnostic artifact
Model bias	Planner prefers impossible trajectories	Held-out rollout traces and model-versus-real overlays
Optimizer collapse	Costs vary wildly across replans	Candidate-score histograms and latency logs
Timing overrun	Stale first action reaches the robot	Controller period versus planning time chart
Uncertainty misuse	Unsafe confidence in unseen states	Coverage audit and override log

Worked Probe

The next code fragment prints a compact evidence card for one benchmark comparison. This is the minimum artifact that should accompany a "sample efficient" claim.

# Build one evidence card for a sample-efficiency claim.
from dataclasses import asdict, dataclass

@dataclass
class EvidenceCard:
    target_return: float
    real_episodes_to_target: int
    planner_ms: int
    heldout_rollout_error: float
    dominant_failure: str

    def as_row(self) -> dict[str, object]:
        return asdict(self)

card = EvidenceCard(
    target_return=0.80,
    real_episodes_to_target=6,
    planner_ms=18,
    heldout_rollout_error=0.041,
    dominant_failure="model_bias_under_contact_shift",
)
print(card.as_row())

{'target_return': 0.8, 'real_episodes_to_target': 6, 'planner_ms': 18, 'heldout_rollout_error': 0.041, 'dominant_failure': 'model_bias_under_contact_shift'}

Read the deployment numbers as a runtime budget: model inference, optimization, safety filtering, and actuator command must fit inside the control period with margin for logging and fault handling.

Code Fragment 37.5.1: A good evidence card reports the efficiency gain together with the cost and the failure. The expected reading is that sample efficiency without context is not publication-grade evidence.

Library Shortcut

Use versioned JSON or dataclass exports for evidence cards, and store them next to replay videos or plotted traces. Pair that with Weights & Biases, MLflow, or a plain artifact directory keyed by seed and panel. This habit makes it much easier to compare planners, simulators, or datasets without losing the failure story.

How To Audit The Claim

A convincing audit compares model-based and model-free baselines on the same reset panel, with the same observation contract, same seed count, and same target-return threshold. Then it adds deployment-facing fields that most papers omit: control period, average planner milliseconds, percentage of aborted rollouts, and the first failure mode observed during shift. That single table often explains more than several reward curves.

Readers should also separate efficiency from engineering burden. A method that uses fewer real episodes but requires days of model tuning, fragile horizon schedules, and constant calibration babysitting may still be the right choice for scarce-data robotics, but the trade should be stated explicitly.

Audit Rule

For every efficiency claim, save target return, real interaction count, planner latency, held-out model error, and at least one tagged failure episode. If any of those fields are missing, the comparison is incomplete.

Warning

Benchmark gains can hide deployment regressions. A model-based method that learns fast in simulation but overruns the control period or misranks rare contact states is not ready just because its reward curve rose sooner.

Practical Example

A drone policy that reaches competent flight with half the real data of a model-free baseline may still be unacceptable if its planner occasionally stalls under wind-gust outliers. A warehouse arm may learn faster but remain unusable if uncertainty is narrow exactly when the box geometry changes.

Cross-References

This section connects to deployment and safety material in Chapter 54 and Chapter 55.

Research Frontier

The field is moving toward larger latent world models and stronger planners, but the evaluation bar must rise with it. Recent systems can look excellent on aggregate returns while still failing on calibration, latency, or real-robot shift. Those failure channels need first-class reporting.

Self Check

If a model-based method reaches target return with fewer episodes but twice the planner latency and worse shift robustness, would you still call it better? What additional evidence would you need?

Memory Hook

Sample efficiency is the opening argument. Failure analysis is the cross-examination.

Key Takeaway

Model-based RL often earns its place through data efficiency, but only a joint audit of efficiency, bias, uncertainty, and timing tells you whether the method is truly better.

Exercise

Design an evidence card for a model-based benchmark in your domain. Which fields are mandatory before you would believe the sample-efficiency claim?

Bibliography & Further Reading

Primary References And Tools

Reference Chua, K. et al.. "Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models." (2018). https://arxiv.org/abs/1805.12114

A standard reference for strong sample efficiency under uncertainty-aware planning.

Reference Janner, M. et al.. "When to Trust Your Model: Model-Based Policy Optimization." (2019). https://arxiv.org/abs/1906.08253

A practical efficiency reference that also foregrounds model-trust limits.

Reference Hansen, N. et al.. "TD-MPC2: Scalable, Robust World Models for Continuous Control." (2023). https://arxiv.org/abs/2310.16828

A modern frontier baseline worth studying for both gains and remaining risks.