Section 37.3: Planning with learned models; MPC and CEM/MPPI

A planner earns trust by choosing better first actions before the clock runs out.

A Budget-Conscious MPC Loop
Candidate action sequences being rolled through a learned model and rescored by MPC, with CEM and MPPI variants highlighted.
Figure 37.3A: MPC over learned dynamics is a loop of sampling, scoring, executing one action, and replanning. CEM and MPPI differ mainly in how they search the action-sequence space.
Big Picture

Planning with learned models is where model-based RL becomes online decision-making instead of offline curve fitting. The planner must optimize action sequences quickly enough to matter, while remaining robust to model error and sensor staleness.

Key Insight

The planner wins only if it returns a better first action before the control clock expires. Search quality and timing are inseparable parts of the method.

Shooting-Based MPC Over Learned Dynamics

Given a learned latent or physical transition model, a shooting planner samples or optimizes an action sequence and scores the predicted trajectory under task cost:

$$ J(a_{t:t+H-1}) = \sum_{k=0}^{H-1} c(\hat s_{t+k}, a_{t+k}) + V(\hat s_{t+H}). $$

CEM iteratively refits a search distribution around elite sequences. MPPI keeps many trajectories and reweights them by exponentiated cost. Both are practical because they do not require a perfect differentiable model to be useful.

The planner interface is where many systems quietly fail. The action parameterization must reflect what the actuator can actually execute, the horizon must fit inside the control period, and the terminal value must be defined on the same state representation produced by the rollout model. If the optimizer proposes commands that the low-level controller clips away, the apparent planner quality can be mostly illusion.

Planner Families
PlannerStrengthTypical weakness
CEMSimple, robust, easy to parallelizeCan waste samples in high-dimensional action spaces
MPPISmooth control updates, strong with stochastic control costsSensitive to temperature and noise scale
Differentiable shooting or iLQGFast local refinement when gradients are goodBrittle under bad models or poor initialization

Worked Probe

The compact example below runs one CEM-style elite update. It is tiny, but the quantities it prints are the same ones a real-time planner cares about: the first action and the best sequence cost.

# One CEM-style elite selection step for a short-horizon planner.
candidates = {
    "u0": [0.10, 0.12, 0.10],
    "u1": [0.18, 0.18, 0.18],
    "u2": [0.14, 0.15, 0.16],
    "u3": [0.20, 0.05, 0.05],
}

def score(seq):
    x = 0.0
    cost = 0.0
    for u in seq:
        x += u * 0.2
        cost += (1.0 - x) ** 2 + 0.01 * (u ** 2)
    return round(cost, 4)

scored = {name: score(seq) for name, seq in candidates.items()}
best_name = min(scored, key=scored.get)
print({"best_plan": best_name, "first_action": candidates[best_name][0], "score": scored[best_name]})

{'best_plan': 'u1', 'first_action': 0.18, 'score': 2.4506}

Read the best-plan name and first action as the receding-horizon contract: the planner evaluated the full three-step cost for every candidate and selected u1 because its steady, uniform actions kept the cumulative distance-to-goal penalty lowest. Only the first action, 0.18, is actually sent to the actuator; the rest of the sequence is discarded and the planner will rescore from the next real observation.

Code Fragment 37.3.1: The planner cares about the full sequence score, but the controller executes only the first action before the next replan. That receding-horizon structure is why imperfect rollouts can still help.
Library Shortcut

Use mujoco_mpc when you need production-grade predictive sampling or derivative-based planners. Use tdmpc or tdmpc2 when you want a learned latent model plus an online optimizer that already handles the value tail. For vehicle-style domains, acados and CasADi remain strong anchors when you need explicit constraint handling beside the learned model.

Search Diagnostics

A practical planner trace should record the candidate-score distribution, the elite-set variance, and the first-action variance across replans. If CEM returns radically different first actions on nearly identical states, the problem may be search instability rather than model error. If MPPI keeps producing smooth but poor commands, the temperature, exploration noise, or terminal-value scaling may be wrong.

These diagnostics make planner choice much less mystical. They let readers see whether the failure lived in optimization, representation, or the cost design itself, which is exactly the kind of mechanism-level reasoning a book like this should teach.

Pseudo-Algorithm

Observe the current state, sample action sequences, roll them out through the learned model, score task cost plus risk, execute the first action from the best sequence, then repeat from the next real observation.

Warning

Planner timing is part of the method. A beautiful optimizer that misses the control period is worse than a simpler optimizer that returns stable actions on time.

Practical Example

For a mobile manipulator pushing open a heavy door, CEM may be good enough if the door dynamics are smooth and rollouts are cheap. For a quadruped balancing on uncertain footholds, MPPI or predictive sampling can behave better because many noisy candidate controls are evaluated around a nominal command.

Cross-References

This section follows directly from Section 36.5 and sets up latent-MPC systems such as TD-MPC and TD-MPC2 discussed again in Chapter 38.

Research Frontier

Real systems increasingly mix planner families: sampling for global exploration, gradients for local refinement, and learned value functions for long tails beyond the explicit horizon. The engineering frontier is hybrid planning under fixed clock budgets.

Self Check

Why does executing only the first action make MPC more tolerant of model error than committing to the whole sequence?

Memory Hook

MPC is not prophecy. It is repeated short-horizon course correction with a model in the loop.

Key Takeaway

Planning with learned models succeeds when the rollout model, optimizer, and control period are matched tightly enough that better sequence ranking becomes better real behavior.

Exercise

Choose CEM, MPPI, or a differentiable planner for one robot task and defend the choice. What state is rolled out, what cost is scored, and what timing budget must the optimizer meet?

Bibliography & Further Reading

Primary References And Tools

Reference DeepMind. "MuJoCo MPC." (accessed 2026). https://github.com/google-deepmind/mujoco_mpc

A practical toolkit with predictive sampling and derivative-based planning methods.

Reference Howell, T. et al.. "Predictive Sampling: Real-time Behaviour Synthesis with MuJoCo." (2022). https://arxiv.org/abs/2212.00541

A recent reference for shooting-based predictive control in real-time.

Reference Hansen, N., Wang, X., and Su, H.. "Temporal Difference Learning for Model Predictive Control." (2022). https://arxiv.org/abs/2203.04955

The cleanest modern bridge between learned models and online MPC.