Section 41.1: Diffusion models as planners | Building Embodied AI: From Perception to Autonomous Action

A policy that cannot represent uncertainty about where to put the gripper will put it in the average place, which is often wrong in a bimodal world.
A Multimodal Planner

Technical illustration for Section 41.1: Diffusion models as planners. — Figure 41.1A: A diffusion planner applied to bimodal grasping: the score network denoises a trajectory from Gaussian noise conditioned on the current scene embedding, and the final denoised plan commits to one of the two valid grasp modes rather than averaging them.

Big Picture

Diffusion Models As Planners matters because embodied intelligence is a closed loop. The agent must sense, represent, predict, decide, act, observe the consequence, and revise its belief before the next action.

Action trajectories are multimodal. Faced with a mug on a table, a competent agent has several valid grasps: top handle, side rim, two-finger pinch on the body. A regression policy trained to minimize mean squared error against demonstrations averages those modes, and the average of a left grasp and a right grasp is a collision with the mug. This is the core reason diffusion models entered embodied AI: their iterative denoising naturally represents a multimodal distribution over trajectories rather than a single conditional mean.

The practical question is not whether the model looks impressive. The question is which action becomes easier, safer, more data-efficient, or more recoverable when the method is inserted into the loop. For multimodal action spaces, the answer is that the policy stops committing to the average place and starts committing to one coherent mode.

Action Is The Test

A model earns its place only when it improves action. In Diffusion Models As Planners, the reader should keep asking which decision changes, which uncertainty is exposed, and which failure mode becomes easier to diagnose.

Theory

A diffusion planner treats a trajectory $\tau_0$ (a sequence of actions, states, or both) as a sample from a data distribution it must learn to generate. Training defines a fixed forward process that gradually corrupts a clean trajectory into Gaussian noise, and learning fits a reverse process that walks noise back to a plausible trajectory conditioned on context $c$ (the observation, the goal, or a return target).

The forward process adds noise according to a variance schedule $\beta_1,\dots,\beta_T$. Writing $\alpha_t = 1-\beta_t$ and $\bar\alpha_t = \prod_{s\le t}\alpha_s$, the closed form for the noised trajectory at step $t$ is

$$ q(\tau_t \mid \tau_0) = \mathcal{N}\!\left(\tau_t;\ \sqrt{\bar\alpha_t}\,\tau_0,\ (1-\bar\alpha_t) I\right). $$

As $t$ grows, $\bar\alpha_t \to 0$ and the trajectory dissolves into standard normal noise. The reverse process is also Gaussian and is the object we train:

$$ p_\theta(\tau_{t-1} \mid \tau_t, c) = \mathcal{N}\!\left(\tau_{t-1};\ \mu_\theta(\tau_t, t, c),\ \sigma_t^2 I\right). $$

The network learns the mean $\mu_\theta$ (equivalently, the noise that was added), and sampling chains these reverse steps from $t=T$ down to $t=0$. Because each reverse step is stochastic and conditioned on $c$, repeated sampling from the same observation yields different valid trajectories: this is exactly the multimodal behavior a regression policy cannot produce.

Mechanism

The forward schedule is fixed and parameter-free; all learning lives in the reverse denoiser. Given $\tau_0$ and a sampled $\bar\alpha_t$, you can jump directly to $\tau_t$ in one step (no need to simulate the chain), which is what makes training cheap: sample a timestep, noise the trajectory, ask the network to predict the noise.

Worked Example

The probe below makes the forward and reverse processes concrete on a tiny 2D trajectory that should reach the goal at $(1, 0)$. We apply five steps of forward noising under a schedule $\bar\alpha_t$, then run five reverse steps with a deliberately simple linear denoiser that estimates $\tau_0$ and takes a DDIM-style update. The diagnostic to watch is the endpoint distance to the goal: it should grow during noising and shrink back toward zero during denoising.

# Forward noising then reverse denoising of a 2D action trajectory.
# Forward:  q(tau_t | tau_0) = N(sqrt(abar_t) tau_0, (1 - abar_t) I)
# Reverse:  estimate tau_0 from tau_t, then take a deterministic step toward tau_{t-1}.
import numpy as np

H = 6                                   # waypoints in the trajectory
goal = np.array([1.0, 0.0])
tau0 = np.stack([np.linspace(0.0, 1.0, H), np.zeros(H)], axis=1)   # clean plan

T = 5
abar = np.array([0.85, 0.65, 0.45, 0.25, 0.08])   # signal retained at each step
rng = np.random.default_rng(7)
noise = rng.normal(size=tau0.shape)               # one fixed noise draw

def endpoint_dist(tau):
    return float(np.linalg.norm(tau[-1] - goal))

print("Forward noising q(tau_t | tau_0):")
forward = []
for t in range(T):
    tau_t = np.sqrt(abar[t]) * tau0 + np.sqrt(1.0 - abar[t]) * noise
    forward.append(tau_t)
    print(f"  t={t+1}  abar={abar[t]:.2f}  dist_to_goal={endpoint_dist(tau_t):.3f}")

def denoise_to_tau0(tau_t):
    # Stand-in for the trained denoiser: pull waypoints back onto the y=0 line to (1,0).
    x = np.clip(tau_t[:, 0], 0.0, 1.0)
    return np.stack([np.linspace(x.min(), 1.0, H), tau_t[:, 1] * 0.1], axis=1)

print("\nReverse denoising p_theta(tau_{t-1} | tau_t, goal):")
tau = forward[-1].copy()
for t in reversed(range(T)):
    tau0_hat = denoise_to_tau0(tau)
    if t > 0:                                       # DDIM-style deterministic step
        eps_hat = (tau - np.sqrt(abar[t]) * tau0_hat) / np.sqrt(1.0 - abar[t])
        tau = np.sqrt(abar[t-1]) * tau0_hat + np.sqrt(1.0 - abar[t-1]) * eps_hat
    else:
        tau = tau0_hat
    print(f"  t={t}  dist_to_goal={endpoint_dist(tau):.3f}")

Forward noising q(tau_t | tau_0):
  t=1  abar=0.85  dist_to_goal=0.178
  t=2  abar=0.65  dist_to_goal=0.232
  t=3  abar=0.45  dist_to_goal=0.267
  t=4  abar=0.25  dist_to_goal=0.318
  t=5  abar=0.08  dist_to_goal=0.422

Reverse denoising p_theta(tau_{t-1} | tau_t, goal):
  t=4  dist_to_goal=0.326
  t=3  dist_to_goal=0.282
  t=2  dist_to_goal=0.250
  t=1  dist_to_goal=0.195
  t=0  dist_to_goal=0.016

Code Fragment 41.1.1 runs the forward and reverse diffusion processes on a 2D trajectory and prints the endpoint distance to the goal at every step.

Read the two columns together. Forward noising monotonically drives the endpoint away from the goal as $\bar\alpha_t$ shrinks; reverse denoising walks it back to $0.016$, essentially the goal. A real planner replaces denoise_to_tau0 with a trained network $\epsilon_\theta(\tau_t, t, c)$, and replaces the single noise draw with fresh Gaussian samples so that repeated runs produce different valid plans. The point is not that this toy "solved planning"; it is that the same forward and reverse machinery scales from this 6-point line to a 16-step manipulation action chunk conditioned on a camera observation.

Library Shortcut

For Diffusion models as planners, the hand-built probe exposes the planning assumption; Diffuser-style or Decision-Diffuser-style tooling should preserve the same logging and evaluation fields.

Practical Recipe

Write the observation, action, horizon, and success metric before choosing a model.
Build a baseline that is simple enough to debug by inspection.
Add the maintained implementation only after the baseline behavior is understood.
Save one artifact containing configuration, seed panel, traces, metrics, and failure labels.
Run at least one perturbation test before trusting the result.

Common Failure Mode

For Diffusion models as planners, evaluate the generated or predicted object through the closed loop that consumes it, because interface failures often dominate component scores.

Practical Example: Diffusion Models As Planners

Who: Priya, planning engineer for a bimanual assembly robot. Situation: A 12-robot evaluation shift must decide whether Diffusion models as planners improves closed-loop behavior. Problem: offline scores look promising, but the robot still has to recover from sensor noise, delayed actuation, and rare contact changes. Dilemma: ship the new model after a visual or reward score, or require a matched baseline, one perturbation panel, and manual review of the 20 hardest rollouts. Decision: the lead keeps the baseline and candidate on the same seed panel and logs every observation, action, intervention, and terminal state. How: the run saves one artifact with configuration, metrics, latency, videos, and failure tags. Result: the candidate is accepted only when it improves the chapter metric by 10 percent and does not increase unsafe recoveries. Lesson: Diffusion Models As Planners matters when it changes decisions in the loop, not when it only improves a standalone proxy.

Research Frontier

Diffusion planning is useful when multimodal trajectories matter, but sampling latency, score hacking, and dataset support limits must be made explicit. A generated plan earns trust by surviving the same closed-loop evaluation as any other planner.

Cross-Reference Thread

For Diffusion models as planners, connect diffusion-policy tooling, MPC baselines, and safety constraints by recording the planner input, sampled plan, feasibility check, and executed action.

Self Check

Can you state the observation, state estimate, action, prediction horizon, success metric, and most likely failure mode for Diffusion models as planners? If not, the system boundary is still too vague.

Diffusion Models As Planners becomes useful when it is tied to trajectory denoising, conditional planning, synthetic experience, and generated-action risk control. The contract names the observation stream, latent or physical state, action representation, timing budget, and evaluation artifact before any model comparison is made. This answers the core agent checklist questions: what changes in the loop, why it should help, how it is measured, and when the method should be rejected.

Diffuser, Decision Diffuser, and Diffusion Policy give three different views: planning trajectories, conditioning decisions on return, and generating robot actions from observations. The skeptical-reader test is simple: a claim about Diffusion models as planners must identify the baseline, the shared seed panel, the horizon or task split, and the failure labels saved in one artifact.

Tool or Library	Role in This Topic	Builder Advice
Diffuser	Supports trajectory denoising, conditional planning, synthetic experience, and generated-action risk control.	Use it after the from-scratch probe states the same observation, action, metric, and failure tag.
Decision Diffuser	Supports trajectory denoising, conditional planning, synthetic experience, and generated-action risk control.	Use it after the from-scratch probe states the same observation, action, metric, and failure tag.
Diffusion Policy	Supports trajectory denoising, conditional planning, synthetic experience, and generated-action risk control.	Use it after the from-scratch probe states the same observation, action, metric, and failure tag.
PyTorch	Supports trajectory denoising, conditional planning, synthetic experience, and generated-action risk control.	Use it after the from-scratch probe states the same observation, action, metric, and failure tag.
Gymnasium	Supports trajectory denoising, conditional planning, synthetic experience, and generated-action risk control.	Use it after the from-scratch probe states the same observation, action, metric, and failure tag.

For Diffusion models as planners, keep one inspectable probe for the model assumption, then use maintained libraries without changing the artifact schema used for baseline comparison.

Write the observation, action, state estimate, success metric, and rejection criterion.
Run a deterministic smoke test on one seed and save the complete configuration.
Add one perturbation tied to the section topic: delay, noise, horizon length, contact change, distractor object, or generated-scene shift.
Compare only methods evaluated by the same script, split, seed panel, and metric definition.
Record a postmortem that assigns failures to perception, representation, dynamics, planning, control, data coverage, timing, or evaluation.

When Diffusion models as planners fails, do not collapse the result into a single method verdict. Assign the failure to the interface that broke, rerun one controlled perturbation, and keep the trace next to the metric. That habit turns a disappointing rollout into a reusable diagnostic asset.

Memory Hook

A diffusion planner is a sketch artist for futures: it keeps redrawing the route until the next move looks executable.

Key Takeaway

Diffusion Models As Planners is useful when it improves a measured closed-loop decision, exposes its uncertainty, and leaves behind an artifact that another reader can replay.

Exercise 41.1.1

Design a minimal experiment for Diffusion models as planners. Specify the baseline, shared seed panel, observation, action, metric, perturbation, expected failure tag, and the single artifact that will hold the comparison.

Bibliography & Further Reading

Primary References And Tools

Reference Janner, M. et al.. "Planning with Diffusion for Flexible Behavior Synthesis." (2022). https://arxiv.org/abs/2205.09991

Diffuser is the core trajectory-denoising reference for planning. It shows how sampling and conditioning can replace a hand-designed optimizer in some offline decision problems.

Reference Ajay, A. et al.. "Is Conditional Generative Modeling All You Need for Decision Making." (2022). https://arxiv.org/abs/2211.15657

Decision Diffuser frames decision making as conditional generation. It is useful for comparing return conditioning, goal conditioning, and trajectory feasibility.

Reference Chi, C. et al.. "Diffusion Policy: Visuomotor Policy Learning via Action Diffusion." (2023). https://arxiv.org/abs/2303.04137

Diffusion Policy is the practical robotics anchor for action diffusion. It helps readers connect planning-style denoising with continuous robot control from visual observations.

Reference Huang, Z. et al.. "DiffuserLite: Towards Real-Time Diffusion Planning." (2024). https://arxiv.org/abs/2401.15443

DiffuserLite focuses on planning frequency and sample efficiency. It is relevant whenever a diffusion planner must fit into a real control loop rather than an offline demonstration.

Reference Yang, R. et al.. "What Makes a Good Diffusion Planner for Decision Making." (2025). https://arxiv.org/abs/2503.00535

This large empirical study examines design choices in diffusion planning. It is a useful guardrail against treating denoising as a universal planner without checking architecture, guidance, and evaluation details.