Chapter 37: Model-Based RL and MPC | Building Embodied AI: From Perception to Autonomous Action

"Model-based control is what happens when learning and planning agree to share the same clock budget."
A Budget-Conscious MPC Loop

Big Picture

Model-Based RL and MPC joins learned dynamics with online planning. The chapter asks when learning a model beats direct policy fitting, how uncertainty should gate planner trust, and what robotics engineers must save to defend a sample-efficiency claim.

Remember This Chapter

The strongest model-based systems do not plan farther by default. They plan only as far as the model is trustworthy, then hand the rest to feedback, value estimation, or replanning.

Chapter Overview

Chapter 37 moves from trade-offs to implementation. It compares model-free and model-based learning, explains ensembles and uncertainty, derives shooting-style MPC with CEM and MPPI, studies imagination rollouts, and closes on sample efficiency together with failure modes that matter in robotics.

The practical thread points to real libraries and papers: MuJoCo MPC, TD-MPC, TD-MPC2, PETS, MBPO, and standard simulation stacks. The theory thread keeps returning to deployment realities such as actuation delay, model bias, planner compute budgets, and the difference between online improvement and offline demos.

Prerequisites

Readers should be comfortable with RL objectives, value functions, control costs, and short-horizon optimization. Chapter 7 and Chapter 16 make this chapter much easier to digest.

Chapter Roadmap

37.1 Model-free vs. model-based trade-offsFrames the regime question: when data, compute, and model bias make planning worth the trouble.
37.2 Learning dynamics models; ensembles and uncertaintyBuilds the predictive core used by planners, with explicit attention to epistemic uncertainty and support mismatch.
37.3 Planning with learned models; MPC and CEM/MPPIDerives receding-horizon planning over learned dynamics and compares major optimizer families.
37.4 Imagination rolloutsShows how short model rollouts can improve value learning while avoiding the worst compounding-error traps.
37.5 Sample-efficiency advantages and failure modesAudits what model-based methods gain in data efficiency and where they fail in practice.

Tooling Note

For concrete builds, reach first for MuJoCo or MuJoCo MPC when real-time predictive control matters, Gymnasium for experiment contracts, and codebases such as tdmpc or tdmpc2 when you want a modern latent-MPC baseline rather than a from-scratch planner.

Hands-On Lab: Build A Learned-Dynamics MPC Benchmark

Duration: about 100 minutesDifficulty: Advanced

Objective

Train a small ensemble dynamics model, attach a shooting-based MPC loop, and compare it with a model-free baseline on one robot-control task under the same episode and seed budget.

Skills

Fit predictive models and evaluate calibration.
Implement CEM or MPPI planning with a real compute budget.
Diagnose failures as model bias, optimizer failure, or interface mismatch.

Prerequisites

Python, NumPy or JAX, a simulator with state access, and basic familiarity with control costs and rollout buffers.

Steps

Step 1: Collect transitions
Generate a fixed exploration dataset and reserve a held-out panel for evaluating one-step and multi-step prediction.
Step 2: Fit an ensemble model
Train several bootstrap members that predict state deltas or latent transitions.
Step 3: Add a planner
Use CEM or MPPI to optimize short action sequences under the learned model and execute only the first action.
Step 4: Compare with a baseline
Evaluate against a reactive controller or model-free agent using the same success metric and episode budget.
Step 5: Audit failure cases
For at least five bad episodes, decide whether failure came from the model, the optimizer, uncertainty gating, or control execution.

Expected Result

A reproducible folder containing dataset metadata, model checkpoints, held-out error tables, planner traces, planner timing, and a short diagnosis for each failed episode.

Stretch Goals

Swap CEM for MPPI or add a terminal value function, then compare whether the extra structure improves regret, latency, or action smoothness on the same matched panel.

This chapter is strong material for a capstone week because students can feel the trade-offs immediately: longer horizon helps only while the model is trusted, bigger ensembles help only if the planner reads them correctly, and fancy optimization still fails if the control loop misses its timing budget.

Computational budget should be treated as part of the scientific argument here. Sample count, rollout horizon, warm-start logic, and controller period all constrain whether MPC is elegant theory or a deployable decision loop on a real robot, so the chapter index should say that plainly.

Readiness Check

Before leaving the chapter, the reader should be able to explain one situation where model-based RL is the right tool, one where it is not, one artifact needed to justify a sample-efficiency claim, and one concrete failure mode caused by model bias.

Teaching Takeaway

A chapter on model-based RL is successful when students stop treating the model as an oracle and start treating it as another fallible subsystem with interfaces, costs, and failure modes.

Bibliography & Further Reading

Foundational Papers, Tools, and References

Reference Deisenroth, M., and Rasmussen, C.. "PILCO: A Model-Based and Data-Efficient Approach to Policy Search." (2011). https://dl.acm.org/doi/10.5555/3104482.3104583

PILCO is the classical sample-efficiency anchor for uncertainty-aware model-based control.

Reference Chua, K. et al.. "Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models." (2018). https://arxiv.org/abs/1805.12114

PETS remains the clearest uncertainty-aware ensemble baseline for model-based RL.

Reference Janner, M. et al.. "When to Trust Your Model: Model-Based Policy Optimization." (2019). https://arxiv.org/abs/1906.08253

MBPO is the key reference for short trusted imagination rollouts.

Reference Hansen, N., Wang, X., and Su, H.. "Temporal Difference Learning for Model Predictive Control." (2022). https://arxiv.org/abs/2203.04955

TD-MPC is the clean bridge between latent dynamics, online planning, and terminal value learning.

Reference Hansen, N. et al.. "TD-MPC2: Scalable, Robust World Models for Continuous Control." (2023). https://arxiv.org/abs/2310.16828

TD-MPC2 is the modern frontier baseline for scalable latent model-based control.

Reference DeepMind. "MuJoCo MPC." (accessed 2026). https://github.com/google-deepmind/mujoco_mpc

MJPC is a practical framework for real-time predictive control with multiple planner families.

Chapter Overview

Prerequisites

Chapter Roadmap

Hands-On Lab: Build A Learned-Dynamics MPC Benchmark

Objective

Skills

Prerequisites

Steps

Step 1: Collect transitions

Step 2: Fit an ensemble model

Step 3: Add a planner

Step 4: Compare with a baseline

Step 5: Audit failure cases

Expected Result

Stretch Goals

Bibliography & Further Reading

Foundational Papers, Tools, and References