The most persuasive sample-efficiency plot is the one that survives contact with failure analysis.
A Budget-Conscious MPC Loop
Model-based RL often wins on early learning efficiency because the same real transition can be reused many times through planning or imagination. But those gains can disappear if the model is biased, the planner is too slow, or the method collapses under shifts the benchmark barely exposes.
Efficiency claims are incomplete until they are paired with a failure ledger. Saved episodes mean little if the saved method breaks when latency, contact, or shift actually matter.
Where The Efficiency Comes From
Model-based methods reuse real experience by either planning through a learned model or generating synthetic targets from it. In rough terms, one transition can contribute to multiple policy-improvement updates rather than being consumed only once. That is the efficiency story.
But the same mechanism creates new failure terms:
$$ \text{deployment risk} \approx \text{model bias} + \text{optimizer error} + \text{timing overrun} + \text{uncertainty misuse}. $$
A serious evaluation must report both sample efficiency and this risk ledger on the same matched panel.
This is where many benchmark stories become misleading. A method can hit target return quickly because the benchmark rewards an easy strategy that never exposes the hard parts of the dynamics, such as sparse collisions, rare slips, or recovery after contact surprises. If the evaluation panel does not include those cases, the sample-efficiency gain is real but incomplete. For embodied systems, incomplete often means unsafe.
| Failure mode | Typical symptom | Diagnostic artifact |
|---|---|---|
| Model bias | Planner prefers impossible trajectories | Held-out rollout traces and model-versus-real overlays |
| Optimizer collapse | Costs vary wildly across replans | Candidate-score histograms and latency logs |
| Timing overrun | Stale first action reaches the robot | Controller period versus planning time chart |
| Uncertainty misuse | Unsafe confidence in unseen states | Coverage audit and override log |
Worked Probe
The next code fragment prints a compact evidence card for one benchmark comparison. This is the minimum artifact that should accompany a "sample efficient" claim.
# Build one evidence card for a sample-efficiency claim.
from dataclasses import asdict, dataclass
@dataclass
class EvidenceCard:
target_return: float
real_episodes_to_target: int
planner_ms: int
heldout_rollout_error: float
dominant_failure: str
def as_row(self) -> dict[str, object]:
return asdict(self)
card = EvidenceCard(
target_return=0.80,
real_episodes_to_target=6,
planner_ms=18,
heldout_rollout_error=0.041,
dominant_failure="model_bias_under_contact_shift",
)
print(card.as_row())
{'target_return': 0.8, 'real_episodes_to_target': 6, 'planner_ms': 18, 'heldout_rollout_error': 0.041, 'dominant_failure': 'model_bias_under_contact_shift'}
Read the deployment numbers as a runtime budget: model inference, optimization, safety filtering, and actuator command must fit inside the control period with margin for logging and fault handling.
Use versioned JSON or dataclass exports for evidence cards, and store them next to replay videos or plotted traces. Pair that with Weights & Biases, MLflow, or a plain artifact directory keyed by seed and panel. This habit makes it much easier to compare planners, simulators, or datasets without losing the failure story.
How To Audit The Claim
A convincing audit compares model-based and model-free baselines on the same reset panel, with the same observation contract, same seed count, and same target-return threshold. Then it adds deployment-facing fields that most papers omit: control period, average planner milliseconds, percentage of aborted rollouts, and the first failure mode observed during shift. That single table often explains more than several reward curves.
Readers should also separate efficiency from engineering burden. A method that uses fewer real episodes but requires days of model tuning, fragile horizon schedules, and constant calibration babysitting may still be the right choice for scarce-data robotics, but the trade should be stated explicitly.
For every efficiency claim, save target return, real interaction count, planner latency, held-out model error, and at least one tagged failure episode. If any of those fields are missing, the comparison is incomplete.
Benchmark gains can hide deployment regressions. A model-based method that learns fast in simulation but overruns the control period or misranks rare contact states is not ready just because its reward curve rose sooner.
A drone policy that reaches competent flight with half the real data of a model-free baseline may still be unacceptable if its planner occasionally stalls under wind-gust outliers. A warehouse arm may learn faster but remain unusable if uncertainty is narrow exactly when the box geometry changes.
This section connects to deployment and safety material in Chapter 54 and Chapter 55.
The field is moving toward larger latent world models and stronger planners, but the evaluation bar must rise with it. Recent systems can look excellent on aggregate returns while still failing on calibration, latency, or real-robot shift. Those failure channels need first-class reporting.
If a model-based method reaches target return with fewer episodes but twice the planner latency and worse shift robustness, would you still call it better? What additional evidence would you need?
Sample efficiency is the opening argument. Failure analysis is the cross-examination.
Model-based RL often earns its place through data efficiency, but only a joint audit of efficiency, bias, uncertainty, and timing tells you whether the method is truly better.
Design an evidence card for a model-based benchmark in your domain. Which fields are mandatory before you would believe the sample-efficiency claim?
Bibliography & Further Reading
Primary References And Tools
Chua, K. et al.. "Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models." (2018). https://arxiv.org/abs/1805.12114
A standard reference for strong sample efficiency under uncertainty-aware planning.
Janner, M. et al.. "When to Trust Your Model: Model-Based Policy Optimization." (2019). https://arxiv.org/abs/1906.08253
A practical efficiency reference that also foregrounds model-trust limits.
Hansen, N. et al.. "TD-MPC2: Scalable, Robust World Models for Continuous Control." (2023). https://arxiv.org/abs/2310.16828
A modern frontier baseline worth studying for both gains and remaining risks.