Section 39.7: Evaluating consistency, controllability, and horizon

"Agents fail at the bottleneck, not at the mean. Evaluate the minimum, not the average."

A World Model Needs More Than One Score
Technical illustration for Section 39.7: Evaluating consistency, controllability, and horizon, showing an embodied agent predicting futures, testing actions, and revising behavior from feedback.
Figure 39.7A: The opener illustration frames evaluating consistency, controllability, and horizon as a closed-loop problem: a prediction is valuable only if it changes action selection and survives contact with reality.
Big Picture

World-model evaluation fails when it collapses many failure modes into one aesthetic score. Consistency, controllability, and usable horizon are different properties, and embodied systems often break on the weakest one rather than on the average.

Builder Route

Treat this section as the chapter's audit sheet. Each metric exists because a different kind of simulator failure misleads a planner or evaluator in a different way.

Key Insight

The minimum of the evaluation axes is usually the most operationally honest number. Agents fail at the bottleneck, not at the mean.

Problem First

Researchers and product teams love single numbers, but generative world models rarely fail in one dimension. A future can look consistent but ignore action, obey action for a few steps but drift later, or stay persistent while breaking task semantics. Evaluation must therefore expose the failure axis, not hide it.

Core Model

A useful evaluation panel separates at least three core properties: $$\text{consistency}: o_t \rightarrow o_{t+1} \text{ stays semantically coherent},$$ $$\text{controllability}: a_t \text{ changes the future in the intended direction},$$ $$\text{usable horizon}: H^* = \max H \text{ such that the generated future remains decision-valid.}$$

For embodied use, the usable horizon is usually the most revealing number. It measures not how long the model can keep drawing plausible frames, but how long the generated future remains trustworthy enough for policy learning, evaluation, or planning.

The evaluation artifact should always include trajectories, not just aggregates. Horizon failure is often visible in one trace long before it meaningfully shifts an average score.

Three-Axis Audit

Run the same initial state and action script through the generator, score semantic continuity, score whether actions had the intended effect, then determine the first time step at which the future stops being decision-valid. Save both the summary numbers and the trace that broke earliest.

Minimal Probe

The following snippet computes the minimum of the three axes directly. That minimum is a better deployment signal than the average because the planner will fail where the weakest property fails.

# Combine consistency, controllability, and horizon into a conservative audit.
# The minimum axis is the bottleneck the deployment team must fix first.
metrics = {
    "consistency": 0.88,
    "controllability": 0.72,
    "usable_horizon": 0.54,
}
bottleneck = min(metrics, key=metrics.get)
print({"bottleneck": bottleneck, "audit_pass": min(metrics.values()) >= 0.7})

{'bottleneck': 'usable_horizon', 'audit_pass': False}

Expected behavior: The audit fails because the usable horizon is too short even though the short-term clip looks coherent and somewhat controllable. That is exactly the point of the panel: a planner or evaluator needs a long-enough trustworthy future, not merely an attractive first second.

Code Fragment 1: This conservative audit surfaces the weakest link in the generated world. Here the bottleneck is usable horizon, which means the team should spend effort on long-rollout stability before celebrating visual or short-step control quality.
Library Shortcut

The audit logic is tiny, but it becomes powerful when paired with maintained generation backends and reproducible evaluation scripts. The right shortcut is not a automatic simulator-score library. It is a stable harness that replays the same seed states and action scripts against every new model version and stores the traces next to the summary table.

Practical Recipe

  1. Report each axis separately and report the bottleneck explicitly.
  2. Keep at least one broken trace in every evaluation packet.
  3. Test horizon under repeated actions and under branching counterfactual actions.
  4. Do not compare models unless they are scored on the same initial-state panel, action scripts, and acceptance thresholds.
Warning

Averaging over axes can hide the very failure that would sink deployment. If one property is below threshold, the world model should fail the gate even when the average looks healthy.

Practical Example

An evaluation team for a warehouse robot may find that a generated world model preserves object identities and follows turns for three seconds, then silently shortens hallways and changes shelf geometry. A short clip looks fine. A usable-horizon audit reveals that the planner's future became untrustworthy exactly where navigation decisions become harder.

Research Frontier

The frontier is automated world-model evals that stay task-grounded. Researchers are pushing toward metrics that capture causal consistency, not just perceptual quality, and toward evaluation loops that connect world-model scores directly to downstream policy improvement or failure.

Cross-Reference Thread

For broader embodied-system evaluation design, revisit Chapter 52. For uncertainty and safety, connect to Chapter 53. For compact latent alternatives with different audit needs, compare against Chapter 38.

Good evaluation is not an afterthought to world-model research. It shapes what progress means. If the field rewards only photorealism, models optimize photorealism. If the field rewards task-grounded controllability and usable horizon, model design and data curation follow those incentives.

This is also why reproducibility matters so much here. A world model can fail on one initial state and look excellent on another. Without saved seed panels and trace artifacts, evaluation quickly collapses back into storytelling. The right artifact makes the failure replayable.

Self Check

If you had to reject a generative world model version today, which axis would you inspect first for your application, and what single broken trace would convince a teammate that the rejection was justified?

Key Takeaway

Evaluate generative world models by their weakest control-relevant property, because the agent will break at the bottleneck, not at the average.

Exercise 39.7.1

Create a three-axis evaluation card for one world-model application. Define the acceptance threshold for each axis and describe the exact trace you would save when the model fails that threshold.

Bibliography & Further Reading

Primary References And Tools

Reference OpenAI. "Video Generation Models as World Simulators." (2024). https://openai.com/index/video-generation-models-as-world-simulators/

The report motivates the simulator framing that this audit section then tightens.

Reference Google DeepMind. "Genie 3: A New Frontier for World Models." (2025). https://deepmind.google/blog/genie-3-a-new-frontier-for-world-models/

A current interactive-world-model reference that makes controllability and horizon questions unavoidable.

Reference NVIDIA. "Physical AI with World Foundation Models." (2026). https://www.nvidia.com/en-us/ai/cosmos/

A platform reference for why evaluation must connect generated worlds to downstream policy development.