"Model-based control is what happens when learning and planning agree to share the same clock budget."
A Budget-Conscious MPC Loop
Model-Based RL and MPC joins learned dynamics with online planning. The chapter asks when learning a model beats direct policy fitting, how uncertainty should gate planner trust, and what robotics engineers must save to defend a sample-efficiency claim.
The strongest model-based systems do not plan farther by default. They plan only as far as the model is trustworthy, then hand the rest to feedback, value estimation, or replanning.
Chapter Overview
Chapter 37 moves from trade-offs to implementation. It compares model-free and model-based learning, explains ensembles and uncertainty, derives shooting-style MPC with CEM and MPPI, studies imagination rollouts, and closes on sample efficiency together with failure modes that matter in robotics.
The practical thread points to real libraries and papers: MuJoCo MPC, TD-MPC, TD-MPC2, PETS, MBPO, and standard simulation stacks. The theory thread keeps returning to deployment realities such as actuation delay, model bias, planner compute budgets, and the difference between online improvement and offline demos.
Prerequisites
Readers should be comfortable with RL objectives, value functions, control costs, and short-horizon optimization. Chapter 7 and Chapter 16 make this chapter much easier to digest.
Chapter Roadmap
- 37.1 Model-free vs. model-based trade-offsFrames the regime question: when data, compute, and model bias make planning worth the trouble.
- 37.2 Learning dynamics models; ensembles and uncertaintyBuilds the predictive core used by planners, with explicit attention to epistemic uncertainty and support mismatch.
- 37.3 Planning with learned models; MPC and CEM/MPPIDerives receding-horizon planning over learned dynamics and compares major optimizer families.
- 37.4 Imagination rolloutsShows how short model rollouts can improve value learning while avoiding the worst compounding-error traps.
- 37.5 Sample-efficiency advantages and failure modesAudits what model-based methods gain in data efficiency and where they fail in practice.
For concrete builds, reach first for MuJoCo or MuJoCo MPC when real-time predictive control matters, Gymnasium for experiment contracts, and codebases such as tdmpc or tdmpc2 when you want a modern latent-MPC baseline rather than a from-scratch planner.
Hands-On Lab: Build A Learned-Dynamics MPC Benchmark
Objective
Train a small ensemble dynamics model, attach a shooting-based MPC loop, and compare it with a model-free baseline on one robot-control task under the same episode and seed budget.
Skills
- Fit predictive models and evaluate calibration.
- Implement CEM or MPPI planning with a real compute budget.
- Diagnose failures as model bias, optimizer failure, or interface mismatch.
Prerequisites
Python, NumPy or JAX, a simulator with state access, and basic familiarity with control costs and rollout buffers.
Steps
Step 1: Collect transitions
Generate a fixed exploration dataset and reserve a held-out panel for evaluating one-step and multi-step prediction.
Step 2: Fit an ensemble model
Train several bootstrap members that predict state deltas or latent transitions.
Step 3: Add a planner
Use CEM or MPPI to optimize short action sequences under the learned model and execute only the first action.
Step 4: Compare with a baseline
Evaluate against a reactive controller or model-free agent using the same success metric and episode budget.
Step 5: Audit failure cases
For at least five bad episodes, decide whether failure came from the model, the optimizer, uncertainty gating, or control execution.
Expected Result
A reproducible folder containing dataset metadata, model checkpoints, held-out error tables, planner traces, planner timing, and a short diagnosis for each failed episode.
Stretch Goals
Swap CEM for MPPI or add a terminal value function, then compare whether the extra structure improves regret, latency, or action smoothness on the same matched panel.
This chapter is strong material for a capstone week because students can feel the trade-offs immediately: longer horizon helps only while the model is trusted, bigger ensembles help only if the planner reads them correctly, and fancy optimization still fails if the control loop misses its timing budget.
Computational budget should be treated as part of the scientific argument here. Sample count, rollout horizon, warm-start logic, and controller period all constrain whether MPC is elegant theory or a deployable decision loop on a real robot, so the chapter index should say that plainly.
Before leaving the chapter, the reader should be able to explain one situation where model-based RL is the right tool, one where it is not, one artifact needed to justify a sample-efficiency claim, and one concrete failure mode caused by model bias.
A chapter on model-based RL is successful when students stop treating the model as an oracle and start treating it as another fallible subsystem with interfaces, costs, and failure modes.
Bibliography & Further Reading
Foundational Papers, Tools, and References
Deisenroth, M., and Rasmussen, C.. "PILCO: A Model-Based and Data-Efficient Approach to Policy Search." (2011). https://dl.acm.org/doi/10.5555/3104482.3104583
PILCO is the classical sample-efficiency anchor for uncertainty-aware model-based control.
Chua, K. et al.. "Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models." (2018). https://arxiv.org/abs/1805.12114
PETS remains the clearest uncertainty-aware ensemble baseline for model-based RL.
Janner, M. et al.. "When to Trust Your Model: Model-Based Policy Optimization." (2019). https://arxiv.org/abs/1906.08253
MBPO is the key reference for short trusted imagination rollouts.
Hansen, N., Wang, X., and Su, H.. "Temporal Difference Learning for Model Predictive Control." (2022). https://arxiv.org/abs/2203.04955
TD-MPC is the clean bridge between latent dynamics, online planning, and terminal value learning.
Hansen, N. et al.. "TD-MPC2: Scalable, Robust World Models for Continuous Control." (2023). https://arxiv.org/abs/2310.16828
TD-MPC2 is the modern frontier baseline for scalable latent model-based control.
DeepMind. "MuJoCo MPC." (accessed 2026). https://github.com/google-deepmind/mujoco_mpc
MJPC is a practical framework for real-time predictive control with multiple planner families.