Part VIII: World Models and Model-Based Embodied AI | Building Embodied AI: From Perception to Autonomous Action

Part Overview

This part turns world models from a slogan into an engineering contract. A useful world model predicts something that changes action: a future state, a latent consequence, a controllable video rollout, a representation-space target, or a generated trajectory that can be scored before the robot moves.

The six chapters form a progression: prediction first, then model-based reinforcement learning and MPC, then latent dynamics, then generative video worlds, then JEPA-style predictive representations, then diffusion planning. Each chapter now pairs the mechanism with a same-panel evaluation habit: baseline and candidate share the same configuration, seed panel, split, horizon, metric definition, and saved artifact.

Why This Part Matters

World Models and Model-Based Embodied AI gives the reader the layer that lets an embodied agent think before acting. Later chapters assume this layer when manipulation, locomotion, driving, safety, and deployment systems must predict consequences, plan under uncertainty, and recover from mistakes.

Part VIII Evidence Standard

Do not compare a prediction score from one run with a control score from another run. In this part, construct-matched metrics are co-computed in one pass on one configuration so the reader can audit each number against the same model, split, seed panel, and artifact.

Chapter 36 Predicting the Future

This chapter makes prediction operational: what is predicted, over which horizon, with which uncertainty, and for which downstream action.

36.1 Why agents need to predict
36.2 Forward/dynamics models; state vs. observation prediction
36.3 Error accumulation and horizon
36.4 Uncertainty in prediction
36.5 Planning with predicted futures

Chapter 37 Model-Based RL and MPC

This chapter connects learned dynamics to receding-horizon control, including ensembles, CEM, MPPI, imagination rollouts, sample efficiency, and failure modes.

37.1 Model-free vs. model-based trade-offs
37.2 Learning dynamics models; ensembles and uncertainty
37.3 Planning with learned models; MPC and CEM/MPPI
37.4 Imagination rollouts
37.5 Sample-efficiency advantages and failure modes

Chapter 38 Latent World Models

This chapter shows why action-relevant latent states can be more useful than pixel-perfect prediction, with DreamerV3, IRIS, and TD-MPC2 as anchors.

38.1 Why predict in latent space
38.2 Autoencoders and recurrent state-space models (RSSM)
38.3 Dreamer to DreamerV3
38.4 Transformer world models (IRIS)
38.5 TD-MPC2: latent MPC at scale
38.6 World models for visual control

Chapter 39 Generative and Video World Models

This chapter treats generative video systems as simulator candidates only when controllability, consistency, horizon, physical plausibility, and reset behavior are measured.

39.1 Generative models as learned simulators
39.2 Genie 1-3: interactive, playable world models
39.3 Video generation as world simulation: Sora and successors
39.4 NVIDIA Cosmos: world foundation models for physical AI
39.5 GameNGen and Oasis: neural game engines
39.6 Using generative world models for data and evaluation
39.7 Evaluating consistency, controllability, and horizon

Chapter 40 Predictive Representations and Self-Supervised World Models

This chapter develops JEPA-style prediction, from image and video representations to action-conditioned latent planning and robot-control probes.

40.1 Predict in representation space, not pixels: the JEPA idea
40.2 I-JEPA and V-JEPA
40.3 V-JEPA 2 and action-conditioned latent planning
40.4 Self-supervised pretraining for control

Chapter 41 Diffusion and Generative Planning

This chapter explains diffusion planning as trajectory denoising, then tests it against latency, scoring, generated-experience risk, and closed-loop safety.

41.1 Diffusion models as planners
41.2 Diffuser and Decision Diffuser
41.3 Generative trajectory planning and scoring
41.4 Generating scenes and synthetic experience
41.5 Risks of generated experience

What's Next?

After this part, Part IX: Manipulation, Locomotion, and Embodied Skills extends the stack.