Section 37.1: Model-free vs. model-based trade-offs

Model-free methods buy less modeling pain. Model-based methods buy more structure. Neither purchase is free.

A Budget-Conscious MPC Loop
A trade-off chart balancing data budget, planner compute, model bias, and asymptotic performance between model-free and model-based learning.
Figure 37.1A: The real question is not which family is better in the abstract, but which family best fits your data budget, compute budget, and deployment constraints.
Big Picture

Model-free and model-based RL occupy different parts of the engineering trade-off surface. Model-free methods often tolerate model bias by avoiding explicit dynamics learning, while model-based methods can be much more sample efficient if the learned model is trustworthy enough for planning.

Key Insight

The real comparison is not policy family versus policy family. It is whether planning gain outweighs model bias and latency on the task you actually care about.

What Changes Across The Trade-Off

Model-free RL estimates policy or value objects directly from experience. Model-based RL learns a transition model $\hat p(s_{t+1}\mid s_t, a_t)$ and uses it for planning, data generation, or value improvement. The attraction is sample efficiency, because the same collected transition can support many imagined rollouts. The risk is model bias.

A useful back-of-the-envelope comparison is

$$ J_{\text{effective}} \approx J_{\text{planner}} - \text{bias penalty}(\hat p) - \text{latency penalty}. $$

If planning gain is smaller than model bias plus latency overhead, explicit modeling does not pay off.

This accounting should be done task by task, not by slogan. On a drone flying in gusts, planner latency and state-estimation delay may dominate, so a direct reactive policy can outperform a slower but more informed planner. On a dexterous manipulation task with costly resets, the extra structure from a learned model may be worth substantial engineering overhead because each real contact trial is expensive. The correct comparison is therefore a budget sheet over data, compute, reset cost, and safety margin.

When Each Family Tends To Win
ConditionModel-free tends to winModel-based tends to win
Real data is expensiveRarelyOften, if the model can be trusted locally
Planner compute is tinyOftenOnly with very short horizons or cached plans
Dynamics are structured and smoothSometimesOften
Out-of-support states are frequentSometimes saferRisky unless uncertainty is handled well

Worked Probe

The probe below compares how many real episodes two hypothetical methods need before reaching a target return. It is deliberately simple, because the design lesson is about budget accounting.

# Compare episode budgets for two toy learning curves.
target_return = 0.80
model_free_curve = [0.12, 0.21, 0.34, 0.46, 0.59, 0.68, 0.77, 0.82]
model_based_curve = [0.18, 0.35, 0.52, 0.67, 0.78, 0.83]

mf_steps = next(i + 1 for i, r in enumerate(model_free_curve) if r >= target_return)
mb_steps = next(i + 1 for i, r in enumerate(model_based_curve) if r >= target_return)

print({"model_free_episodes": mf_steps, "model_based_episodes": mb_steps})

{'model_free_episodes': 8, 'model_based_episodes': 6}

Read the model-based RL output as a check on whether planning improves sample efficiency without inventing unreachable states. The decision consequence is the balance between real rollouts, imagined rollouts, and model validation episodes.

Code Fragment 37.1.1: The toy result illustrates the usual promise of model-based RL: fewer real episodes to reach a target level. It says nothing yet about asymptotic quality, compute, or robustness, which is why the rest of the chapter exists.
Library Shortcut

Use tdmpc2 as a modern model-based baseline and a strong model-free baseline such as SAC or PPO from a maintained library. CleanRL, skrl, and rl_games are useful when you want transparent baselines with stable training scripts. The important part is matched evaluation, not which benchmark script is trendiest.

Failure Patterns Readers Should Expect

Model-free systems usually fail by wasting data, overfitting rewards, or requiring enormous domain randomization before transfer. Model-based systems add three new failure channels: planner overrun, model exploitation of blind spots, and confidence mismatch between the predictive model and the control stack. Readers building real systems should learn to ask which failure channel is cheaper to manage in their domain.

A useful experiment card therefore includes reset cost, control frequency, average planner milliseconds, and whether the method uses privileged simulator state during training or evaluation. Without that information, trade-off claims collapse into benchmark theater.

Decision Rule

Choose model-based RL when real interaction is expensive, local model learning is plausible, and the control loop can afford online planning or short imagined rollouts. Choose model-free baselines when planning latency is unacceptable or model bias dominates.

Warning

Do not call a method sample efficient because it trains faster in simulator wall-clock while silently consuming much more planner compute or using privileged state. Real interaction budget, compute budget, and information budget all need to be disclosed together.

Practical Example

For a dexterous real-hand manipulation task with expensive hardware resets, a trustworthy local model can dramatically reduce real trials. For a giant offline game benchmark with cheap simulation and huge parallel compute, direct policy learning may be simpler and more robust.

Cross-References

This section ties back to policy-gradient and off-policy methods in Chapter 15 and Chapter 16, then feeds into the learned-model details of Section 37.2.

Research Frontier

TD-MPC2 and related latent planners have narrowed the gap between classic model-based sample efficiency and strong final performance. The modern question is less whether planning can help, and more where the compute, data, and bias balance lands for a given robot stack.

Self Check

For a task with scarce real data but ample GPU inference, which trade-off axis makes model-based RL attractive? What extra risk arrives with that choice?

Memory Hook

Model-free spends data to avoid learning the world. Model-based spends modeling effort so data can be reused many times.

Key Takeaway

The trade-off is not ideology. It is an accounting problem over data, compute, model bias, and deployment latency.

Exercise

Choose a robotics task and argue whether you would start from a model-free or model-based baseline. List the data budget, compute budget, and dominant failure risk that drive your choice.

Bibliography & Further Reading

Primary References And Tools

Reference Sutton, R. S., and Barto, A. G.. "Reinforcement Learning: An Introduction." (2018). http://incompleteideas.net/book/the-book-2nd.html

The standard foundation for framing model-free objectives and baselines.

Reference Janner, M. et al.. "When to Trust Your Model: Model-Based Policy Optimization." (2019). https://arxiv.org/abs/1906.08253

A practical model-based policy optimization reference focused on rollout trust.

Reference Hansen, N. et al.. "TD-MPC2: Scalable, Robust World Models for Continuous Control." (2023). https://arxiv.org/abs/2310.16828

A key modern model-based baseline for continuous control.