Section 36.3: Error accumulation and horizon | Building Embodied AI: From Perception to Autonomous Action

Long rollouts are persuasive until their first small bias compounds into a fake future.
A Horizon-Aware Predictor

Predicted and true trajectories diverging over time, showing a small initial bias that compounds into large planning error across the horizon. — **Figure 36.3A**: A tiny one-step bias can become a disastrous long-horizon forecast when the planner repeatedly feeds predictions back into itself.

Big Picture

Compounding error is the main reason predictive control trusts short horizons first. Every rollout step consumes the model's own previous output, so small one-step mistakes can grow geometrically or drift into regions with no data support.

Key Insight

Rollout horizon is a trust budget. Each extra imagined step spends more of that budget, and eventually the model starts paying for action with fiction instead of evidence.

Why Rollout Error Grows

If the true dynamics are $f$ and the learned model is $\hat f$, then a one-step state error of size $\varepsilon$ can grow after $H$ rollout steps roughly like

$$ \lVert s_{t+H} - \hat s_{t+H} \rVert \lesssim \sum_{k=0}^{H-1} L^k \varepsilon, $$

where $L$ is an effective sensitivity constant of the dynamics and controller. If $L > 1$, the planner cannot assume that a small local fit implies a good long imagined future.

In practice, three mechanisms usually dominate this blow-up. First, state-estimation noise enters the first prediction and is then recycled as if it were clean state. Second, actuator delay or rate limits make the model optimistic about how quickly the robot can correct itself. Third, contact mode switches, wheel slip, or near-singular arm poses create local dynamics with much larger sensitivity than the average transition error suggests. That is why a model that looks excellent on one-step loss can still rank action sequences badly after only a few imagined steps.

Builder Consequence

Longer horizon is valuable only until model bias dominates. After that point, more imagination produces less trustworthy control.

Worked Probe

The next probe logs how a small underestimation of acceleration error drifts over five rollout steps. The pattern is simple enough to inspect by eye, which is exactly what a first debugging panel should allow.

# Roll out a biased model and compare cumulative error by horizon.
true_pos = 0.0
true_vel = 1.0
model_pos = 0.0
model_vel = 0.98
dt = 0.1

horizon_error = []
for step in range(1, 6):
    true_pos += true_vel * dt
    model_pos += model_vel * dt
    horizon_error.append(round(true_pos - model_pos, 4))
    true_vel *= 0.99
    model_vel *= 0.97

print({"horizon_error": horizon_error, "final_error": horizon_error[-1]})

{'horizon_error': [0.002, 0.0049, 0.0086, 0.013, 0.0181], 'final_error': 0.0181}

Read the horizon-error list as a compounding signal: each entry is larger than the last because the model's velocity underestimate accumulates step by step. The final error is roughly nine times the one-step error, which is exactly the drift pattern that makes long-horizon planning unreliable even when the one-step fit looks acceptable.

Code Fragment 36.3.1: The exact numbers matter less than the monotone drift. What the planner should notice is that trusting horizon five is harder than trusting horizon one, even in a nearly trivial system.

Library Shortcut

In production, save horizon-conditioned metrics explicitly. A single scalar MSE hides whether the model is excellent at one to three steps and unusable after ten, which is often the real control-relevant story. In PyTorch or JAX training loops, log per-horizon error tables and sequence-ranking accuracy alongside loss. In mujoco_mpc, Isaac Lab, or Drake rollouts, log whether the same horizon cap preserves closed-loop success under the same seed panel.

Failure Analysis That Actually Helps

When a long-horizon planner fails, the first useful artifact is not another scalar plot. It is an overlay of real and imagined trajectories aligned by time, with the first meaningful divergence marked explicitly. On a manipulator, that marker is often the first contact frame where normal force or slip onset differs. On a drone, it may be the first step where unmodeled drag or attitude lag changes the commanded recovery. A good debugging ledger therefore stores the horizon at which divergence becomes decision-relevant, not just numerically visible.

Tool support matters here. wandb or TensorBoard can store horizon-tagged traces, but the deeper point is methodological: keep the rollout panel fixed while comparing horizon caps, model versions, and controller variants. If the panel changes, the claim about horizon trust stops being defensible.

Practical Recipe

Fit one-step transitions first, evaluate multi-step rollouts on held-out episodes, then cap planner horizon where task performance still improves. If longer horizons help only on paper, shorten the rollout and rely more on feedback or terminal values.

Warning

Do not compare a short-rollout method and a long-rollout method using metrics produced by different seed panels or different reset logic. Horizon claims are extremely sensitive to data support and termination conditions.

Practical Example

A highway-driving planner may want a multi-second horizon in free space, but a manipulator inserting a connector often trusts only a few contact-relevant steps before replanning. The correct horizon is the one that keeps model bias below the action-threshold the robot cares about.

Cross-References

This section reinforces the rollout caution that appears in Section 37.4 on imagination rollouts and complements the stability discussion in Chapter 7.

Research Frontier

MBPO is a key modern lesson here: short branched model rollouts often beat naive long hallucinations. Current work on trust-aware model usage and value-guided rollouts continues pushing the idea that rollout horizon should be adaptive, not fixed by taste.

Self Check

Suppose your model is excellent for two steps and unreliable after eight. What planning strategy would still exploit it, and what evidence would you save to defend that choice?

Memory Hook

Rollout horizon is like credit on a shaky map: spend only as much as the map deserves.

Key Takeaway

The longest imagined future is rarely the best one. Strong model-based systems plan only across horizons the model has earned.

Exercise

Design a held-out evaluation that reports one-step, three-step, and ten-step rollout error for a robot task of your choice. Which horizon would you trust for planning, and why?

Bibliography & Further Reading

Primary References And Tools

Reference Janner, M. et al.. "When to Trust Your Model: Model-Based Policy Optimization." (2019). https://arxiv.org/abs/1906.08253

The practical reference for short trusted rollouts instead of blindly long model usage.

Reference Chua, K. et al.. "Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models." (2018). https://arxiv.org/abs/1805.12114

PETS is a canonical baseline for short-horizon planning under learned dynamics.

Reference Hansen, N., Wang, X., and Su, H.. "Temporal Difference Learning for Model Predictive Control." (2022). https://arxiv.org/abs/2203.04955

TD-MPC is a strong illustration of combining short-horizon planning with a learned terminal value.