Section 36.2: Forward/dynamics models; state vs. observation prediction | Building Embodied AI: From Perception to Autonomous Action

The best predictive target is the one the controller would pay to know one step earlier.
A Horizon-Aware Predictor

A robot perception stack splitting into a latent state branch and a pixel reconstruction branch, highlighting that different prediction targets serve different control purposes. — **Figure 36.2A**: A state predictor answers, "what variable does the controller need next?" An observation predictor answers, "what might the sensors see next?" The two are related, but not interchangeable.

Big Picture

State-space prediction is often better aligned with control, because it evolves the quantities that costs and constraints actually read. Observation prediction is useful when the observation itself contains decision-critical structure that is hard to summarize by hand, such as occluders, deformables, or visual contact cues.

Key Insight

Prediction targets are engineering choices, not aesthetics. The right target is the one that gives the controller earlier access to the variable that changes its decision.

Prediction Targets And Control Interfaces

A forward model can predict physical state, latent state, observation, reward, contact flags, or some mixture of them. A common latent-state factorization is

$$ z_{t+1} = f_\theta(z_t, a_t), \qquad \hat o_{t+1} = g_\phi(z_{t+1}). $$

If the planner reasons directly in latent space, the decoder is optional at decision time. If the planner needs image-space occupancy, object masks, or human-interpretable diagnostics, the decoder becomes operationally important rather than decorative.

State Versus Observation Prediction

Target	Strength	Risk
Physical state or delta state	Cheap rollout, clean constraints, easy cost design	Misses hidden scene factors if the state is underspecified
Latent state	Compresses perception and control into one interface	Harder to debug when the latent drops task-relevant detail
Pixel or depth observation	Keeps scene detail for occlusion and contact reasoning	High compute cost, easy to optimize the wrong visual details

Worked Probe

The code below contrasts a latent-state predictor with an observation predictor on a toy pushing task. The latent model predicts object position directly; the observation model predicts a rendered pixel coordinate and then recovers position from it.

# Compare a direct state predictor with an observation-space predictor.
# The state model predicts object position; the observation model predicts
# a pixel coordinate that must be converted back into world space.
from math import fabs

x_t = 0.40
action = 0.12
true_next = x_t + action

state_pred = x_t + 0.95 * action

pixel_scale = 320.0
predicted_pixel = pixel_scale * (x_t + 0.90 * action) + 2.0
obs_pred = predicted_pixel / pixel_scale

print(
    {
        "true_next": round(true_next, 3),
        "state_pred": round(state_pred, 3),
        "obs_pred": round(obs_pred, 3),
        "state_abs_error": round(fabs(true_next - state_pred), 4),
        "obs_abs_error": round(fabs(true_next - obs_pred), 4),
    }
)

{'true_next': 0.52, 'state_pred': 0.514, 'obs_pred': 0.519, 'state_abs_error': 0.006, 'obs_abs_error': 0.001}

Read the two absolute errors as a comparison of prediction targets: the state predictor is slightly less accurate here, but it requires no decode step. The observation route recovers extra precision only after converting back through a pixel scale, adding a computation the controller must pay on every step. The useful question is whether that precision gain actually changes a control decision, not merely which number is smaller.

Code Fragment 36.2.1: The observation route is slightly more accurate here, but it pays an extra decode step. In real systems, that trade-off is only worth it when the observation contains control-relevant structure that a smaller state cannot preserve.

Library Shortcut

Use PyTorch or JAX for the predictors, Gymnasium for the transition contract, and MuJoCo when the state variable should include contact, velocity, or actuator dynamics rather than only kinematic position.

Design Rule

Predict the smallest variable that preserves the control objective. Add a decoder only when humans, downstream modules, or the planner itself genuinely need observation-space detail.

Warning

A decoder with beautiful frames can hide a useless latent. If the planner acts on latent state, audit value error, cost error, and constraint violation, not only image quality.

Practical Example

A drone dodging cables in clutter may need pixel-space or depth-space prediction because the obstacle geometry matters directly. A torque-limited arm tracking a known part usually benefits more from joint-state and contact prediction than from reconstructing the entire camera view.

Cross-References

This section connects prediction targets to the representation choices in Chapter 28, the camera-frame geometry in Chapter 4, and the latent world-model machinery in Chapter 38.

Research Frontier

Recent world-model work increasingly drops reconstruction unless it buys something operational. Task-oriented latent models such as TD-MPC and MuDreamer ask whether the state retains enough information for value estimation and local planning, even if it cannot redraw the scene photorealistically.

Self Check

Name one task where observation prediction is necessary and one where it is wasteful. What information does the controller need in each case, and how would you prove that your chosen target supplies it?

Memory Hook

State prediction tells the robot where the world is going. Observation prediction tells it what the sensors will look like when the world gets there.

Key Takeaway

Prediction targets should be chosen by control relevance, not by visual appeal. The cleanest target is the one that best supports the next decision under the real system budget.

Exercise

Pick one robot task and specify a state-space predictor and an observation-space predictor for it. Write the exact metric that would tell you which target is more useful for action.

Bibliography & Further Reading

Primary References And Tools

Reference Hafner, D. et al.. "Learning Latent Dynamics for Planning from Pixels." (2019). https://arxiv.org/abs/1811.04551

PlaNet is the classic argument for planning in latent state rather than pixel space.

Reference Hafner, D. et al.. "Mastering Diverse Domains through World Models." (2023). https://arxiv.org/abs/2301.04104

DreamerV3 is a strong modern example of latent predictive learning tied to behavior.

Reference MuDreamer authors. "MuDreamer: Learning Predictive World Models without Reconstruction." (2024). https://arxiv.org/html/2405.15083v1

A useful recent example of reducing or removing full reconstruction when task relevance matters more.