Imagination helps only while the imagined data still resembles a world the policy will actually visit.
A Budget-Conscious MPC Loop
Imagination rollouts reuse real data by letting the learner branch short synthetic trajectories from trusted states. The gain is sample efficiency. The danger is that long synthetic rollouts can poison value learning with model fantasy.
Synthetic data is useful only while it stays tethered to states the model understands. Horizon control is what keeps imagination from turning into dataset corruption.
Short Rollouts, Big Consequences
In MBPO-style learning, real states from the replay buffer seed short model-generated rollouts. Those imagined transitions augment policy learning while limiting compounding error. The core trade-off is simple: more imagined data can accelerate learning, but only if rollout length stays inside the model's trusted region.
One useful mental model is
$$ \mathcal{D}_{\text{train}} = \mathcal{D}_{\text{real}} \cup \mathcal{D}_{\text{model}}^{(h)}, $$
where the imagination horizon $h$ is deliberately small. This keeps model-generated states near the support of real experience.
That support argument is the mechanism, not a stylistic preference. The learner is allowed to recycle real states into nearby imagined futures because the model has seen enough neighboring transitions to stay locally coherent. Once synthetic states start seeding further synthetic states, the training set drifts toward parts of state space that were never grounded by real interaction, and value estimates can become systematically optimistic.
Branching from real buffer states is a bias-control trick. It keeps the synthetic rollout close to regions where the model has at least some evidence.
Worked Probe
The probe below logs how many synthetic transitions are produced from a replay batch under different imagination horizons. It shows why horizon choice changes dataset composition so quickly.
# Count imagined transitions produced from one replay batch.
replay_batch = 128
horizons = [1, 3, 5]
imagined = {h: replay_batch * h for h in horizons}
ratio_to_real = {h: round(imagined[h] / replay_batch, 1) for h in horizons}
print({"imagined_transitions": imagined, "ratio_to_real": ratio_to_real})
{'imagined_transitions': {1: 128, 3: 384, 5: 640}, 'ratio_to_real': {1: 1.0, 3: 3.0, 5: 5.0}}
Read the imagined-transitions counts and ratios as a dataset-composition signal: at horizon 1 the synthetic set exactly matches the real batch, but at horizon 5 it is five times larger. That ratio tells you how much weight model-generated data already carries in training before any explicit mixing ratio is set, which is why horizon is not a cosmetic hyperparameter but a direct control on how much model bias enters the learner.
When you implement imagination rollouts, log the real-to-model transition ratio, the rollout branching source, and the maximum horizon. These three numbers explain a large fraction of success or failure in practice. mbrl-lib and Dreamer-style codebases are useful references because they make the replay-to-imagination contract visible rather than hiding it inside one giant trainer.
Common Failure Modes
A short-horizon imagination pipeline can still go wrong in three ways. The model may be locally biased around precisely the states that matter for reward improvement. The policy may overfit to synthetic states that look easy under the model but are rarely visited in reality. Or the training loop may silently let synthetic transitions dominate the replay mixture. All three failures create the same surface symptom, a policy that looks data efficient but degrades sharply under real rollouts.
The fix is not to abandon imagination, but to instrument it. Save the source state for each imagined rollout, the horizon used, the ratio of synthetic to real updates, and at least one replayed failure trajectory where the imagined branch misled the learner. That turns a vague trust problem into a concrete evidence trail.
Seed model rollouts from real states, keep the horizon short, monitor held-out model error, and reduce or stop imagination when calibration deteriorates or synthetic data overwhelms the real buffer.
Synthetic transitions can quietly dominate training and pull the learner toward impossible states. If your synthetic-to-real ratio climbs without a corresponding held-out model audit, you may be optimizing on fantasy data.
For a tabletop pushing task, two or three imagined steps branched from real states may be enough to accelerate value learning. For long-horizon autonomous driving, naive long synthetic rollouts can easily invent lane states or contact events the real car would never produce.
This section connects directly to the rollout-horizon caution in Section 36.3 and to MBPO in the bibliography below.
Modern imagination-based agents increasingly mix short synthetic rollouts with strong value models or latent planners. The open research problem is adaptive trust: deciding rollout length from confidence rather than from a fixed schedule.
Why is branching from replay-buffer states safer than initializing long synthetic rollouts from synthetic states created by earlier imagination?
Imagination helps when it stays tethered to reality. Cut the tether, and the learner starts studying its own fiction.
Imagination rollouts are valuable because they multiply data use, but only when the rollout horizon is kept inside the model's trusted neighborhood.
Design an MBPO-style training loop for a robot task. What states seed imagination, what horizon would you start with, and what metric would trigger shortening the rollout?
Bibliography & Further Reading
Primary References And Tools
Janner, M. et al.. "When to Trust Your Model: Model-Based Policy Optimization." (2019). https://arxiv.org/abs/1906.08253
The essential reference for short trusted imagination rollouts.
Hafner, D. et al.. "Mastering Diverse Domains through World Models." (2023). https://arxiv.org/abs/2301.04104
DreamerV3 is a broad latent imagination baseline worth contrasting with explicit MBPO-style branching.
Hansen, N. et al.. "TD-MPC2: Scalable, Robust World Models for Continuous Control." (2023). https://arxiv.org/abs/2310.16828
Useful for comparing latent short-horizon planning with synthetic-data augmentation approaches.