The best predictive target is the one the controller would pay to know one step earlier.
A Horizon-Aware Predictor
State-space prediction is often better aligned with control, because it evolves the quantities that costs and constraints actually read. Observation prediction is useful when the observation itself contains decision-critical structure that is hard to summarize by hand, such as occluders, deformables, or visual contact cues.
Prediction targets are engineering choices, not aesthetics. The right target is the one that gives the controller earlier access to the variable that changes its decision.
Prediction Targets And Control Interfaces
A forward model can predict physical state, latent state, observation, reward, contact flags, or some mixture of them. A common latent-state factorization is
$$ z_{t+1} = f_\theta(z_t, a_t), \qquad \hat o_{t+1} = g_\phi(z_{t+1}). $$
If the planner reasons directly in latent space, the decoder is optional at decision time. If the planner needs image-space occupancy, object masks, or human-interpretable diagnostics, the decoder becomes operationally important rather than decorative.
| Target | Strength | Risk |
|---|---|---|
| Physical state or delta state | Cheap rollout, clean constraints, easy cost design | Misses hidden scene factors if the state is underspecified |
| Latent state | Compresses perception and control into one interface | Harder to debug when the latent drops task-relevant detail |
| Pixel or depth observation | Keeps scene detail for occlusion and contact reasoning | High compute cost, easy to optimize the wrong visual details |
Worked Probe
The code below contrasts a latent-state predictor with an observation predictor on a toy pushing task. The latent model predicts object position directly; the observation model predicts a rendered pixel coordinate and then recovers position from it.
# Compare a direct state predictor with an observation-space predictor.
# The state model predicts object position; the observation model predicts
# a pixel coordinate that must be converted back into world space.
from math import fabs
x_t = 0.40
action = 0.12
true_next = x_t + action
state_pred = x_t + 0.95 * action
pixel_scale = 320.0
predicted_pixel = pixel_scale * (x_t + 0.90 * action) + 2.0
obs_pred = predicted_pixel / pixel_scale
print(
{
"true_next": round(true_next, 3),
"state_pred": round(state_pred, 3),
"obs_pred": round(obs_pred, 3),
"state_abs_error": round(fabs(true_next - state_pred), 4),
"obs_abs_error": round(fabs(true_next - obs_pred), 4),
}
)
{'true_next': 0.52, 'state_pred': 0.514, 'obs_pred': 0.519, 'state_abs_error': 0.006, 'obs_abs_error': 0.001}
Read the two absolute errors as a comparison of prediction targets: the state predictor is slightly less accurate here, but it requires no decode step. The observation route recovers extra precision only after converting back through a pixel scale, adding a computation the controller must pay on every step. The useful question is whether that precision gain actually changes a control decision, not merely which number is smaller.
Use PyTorch or JAX for the predictors, Gymnasium for the transition contract, and MuJoCo when the state variable should include contact, velocity, or actuator dynamics rather than only kinematic position.
Predict the smallest variable that preserves the control objective. Add a decoder only when humans, downstream modules, or the planner itself genuinely need observation-space detail.
A decoder with beautiful frames can hide a useless latent. If the planner acts on latent state, audit value error, cost error, and constraint violation, not only image quality.
A drone dodging cables in clutter may need pixel-space or depth-space prediction because the obstacle geometry matters directly. A torque-limited arm tracking a known part usually benefits more from joint-state and contact prediction than from reconstructing the entire camera view.
This section connects prediction targets to the representation choices in Chapter 28, the camera-frame geometry in Chapter 4, and the latent world-model machinery in Chapter 38.
Recent world-model work increasingly drops reconstruction unless it buys something operational. Task-oriented latent models such as TD-MPC and MuDreamer ask whether the state retains enough information for value estimation and local planning, even if it cannot redraw the scene photorealistically.
Name one task where observation prediction is necessary and one where it is wasteful. What information does the controller need in each case, and how would you prove that your chosen target supplies it?
State prediction tells the robot where the world is going. Observation prediction tells it what the sensors will look like when the world gets there.
Prediction targets should be chosen by control relevance, not by visual appeal. The cleanest target is the one that best supports the next decision under the real system budget.
Pick one robot task and specify a state-space predictor and an observation-space predictor for it. Write the exact metric that would tell you which target is more useful for action.
Bibliography & Further Reading
Primary References And Tools
Hafner, D. et al.. "Learning Latent Dynamics for Planning from Pixels." (2019). https://arxiv.org/abs/1811.04551
PlaNet is the classic argument for planning in latent state rather than pixel space.
Hafner, D. et al.. "Mastering Diverse Domains through World Models." (2023). https://arxiv.org/abs/2301.04104
DreamerV3 is a strong modern example of latent predictive learning tied to behavior.
MuDreamer authors. "MuDreamer: Learning Predictive World Models without Reconstruction." (2024). https://arxiv.org/html/2405.15083v1
A useful recent example of reducing or removing full reconstruction when task relevance matters more.