Part Overview
This part turns world models from a slogan into an engineering contract. A useful world model predicts something that changes action: a future state, a latent consequence, a controllable video rollout, a representation-space target, or a generated trajectory that can be scored before the robot moves.
The six chapters form a progression: prediction first, then model-based reinforcement learning and MPC, then latent dynamics, then generative video worlds, then JEPA-style predictive representations, then diffusion planning. Each chapter now pairs the mechanism with a same-panel evaluation habit: baseline and candidate share the same configuration, seed panel, split, horizon, metric definition, and saved artifact.
World Models and Model-Based Embodied AI gives the reader the layer that lets an embodied agent think before acting. Later chapters assume this layer when manipulation, locomotion, driving, safety, and deployment systems must predict consequences, plan under uncertainty, and recover from mistakes.
Do not compare a prediction score from one run with a control score from another run. In this part, construct-matched metrics are co-computed in one pass on one configuration so the reader can audit each number against the same model, split, seed panel, and artifact.
This chapter makes prediction operational: what is predicted, over which horizon, with which uncertainty, and for which downstream action.
- 36.1 Why agents need to predict
- 36.2 Forward/dynamics models; state vs. observation prediction
- 36.3 Error accumulation and horizon
- 36.4 Uncertainty in prediction
- 36.5 Planning with predicted futures
This chapter connects learned dynamics to receding-horizon control, including ensembles, CEM, MPPI, imagination rollouts, sample efficiency, and failure modes.
- 37.1 Model-free vs. model-based trade-offs
- 37.2 Learning dynamics models; ensembles and uncertainty
- 37.3 Planning with learned models; MPC and CEM/MPPI
- 37.4 Imagination rollouts
- 37.5 Sample-efficiency advantages and failure modes
This chapter shows why action-relevant latent states can be more useful than pixel-perfect prediction, with DreamerV3, IRIS, and TD-MPC2 as anchors.
- 38.1 Why predict in latent space
- 38.2 Autoencoders and recurrent state-space models (RSSM)
- 38.3 Dreamer to DreamerV3
- 38.4 Transformer world models (IRIS)
- 38.5 TD-MPC2: latent MPC at scale
- 38.6 World models for visual control
This chapter treats generative video systems as simulator candidates only when controllability, consistency, horizon, physical plausibility, and reset behavior are measured.
- 39.1 Generative models as learned simulators
- 39.2 Genie 1-3: interactive, playable world models
- 39.3 Video generation as world simulation: Sora and successors
- 39.4 NVIDIA Cosmos: world foundation models for physical AI
- 39.5 GameNGen and Oasis: neural game engines
- 39.6 Using generative world models for data and evaluation
- 39.7 Evaluating consistency, controllability, and horizon
This chapter develops JEPA-style prediction, from image and video representations to action-conditioned latent planning and robot-control probes.
- 40.1 Predict in representation space, not pixels: the JEPA idea
- 40.2 I-JEPA and V-JEPA
- 40.3 V-JEPA 2 and action-conditioned latent planning
- 40.4 Self-supervised pretraining for control
This chapter explains diffusion planning as trajectory denoising, then tests it against latency, scoring, generated-experience risk, and closed-loop safety.
- 41.1 Diffusion models as planners
- 41.2 Diffuser and Decision Diffuser
- 41.3 Generative trajectory planning and scoring
- 41.4 Generating scenes and synthetic experience
- 41.5 Risks of generated experience
What's Next?
After this part, Part IX: Manipulation, Locomotion, and Embodied Skills extends the stack.