Section 2.6: Markov decision processes; Bellman equations

"The Bellman equation is what happens when a robot asks, 'and then what?' with mathematical persistence."

A Recursive Planning Agent
Big Picture

Markov decision processes; Bellman equations give closed-loop decision making its cleanest mathematical model. An MDP assumes the current state contains all information needed to predict the next state distribution and reward after an action.

Concept map for Section 2.6 A local diagram showing how Bellman backups connect immediate reward to expected future value. Evidence what the agent receives Decision what the system changes Consequence what the next step inherits Closed-loop feedback makes the next input depend on the last action.
Figure 2.6. MDPs and Bellman equations are easiest to reason about as a closed-loop evidence, decision, consequence pattern: each state inherits value from the next state distribution.

This section develops the formal model behind much of reinforcement learning and control. An MDP consists of states, actions, transition dynamics, rewards, and a discount or horizon. Bellman equations then express a powerful idea: the value of a state is immediate reward plus the value of what comes next.

MDPs are not a claim that the real world is simple. They are an approximation tool. The engineering question is whether the state representation is good enough that past history no longer adds important predictive information.

The Markov Assumption Is A Promise

Calling a problem an MDP promises that the chosen state carries the information needed for future prediction. If hidden history still matters, the model is incomplete.

Theory

An MDP is often written as $(\mathcal{S}, \mathcal{A}, P, R, \gamma)$. $\mathcal{S}$ is the state space, $\mathcal{A}$ is the action space, $P(s'|s,a)$ is the transition distribution, $R(s,a)$ is reward, and $\gamma$ discounts future reward. The Markov property says the distribution over the next state depends on the current state and action, not on the full past.

A Bellman backup for a fixed policy can be written as $V(s) = R(s,\pi(s)) + \gamma \sum_{s'} P(s'|s,\pi(s))V(s')$. In words: value equals immediate reward plus discounted expected future value. The summation matters because an action may lead to several possible next states, and the transition model weights each next state by its probability.

The assumption is the load-bearing part. If $s$ contains all prediction-relevant information, the backup can ignore the older history. If unobserved contact, actuator wear, or a previous collision still changes what happens next, the state is not Markov enough and the Bellman target is mixing incompatible situations.

Mechanism

The mechanism is recursive decomposition. Bellman equations let a long-horizon decision be updated from one-step transitions. This is why clean transition records, consistent rewards, and stable state definitions matter so much.

Worked Example

Code Fragment 2.6.1 performs Bellman backups in a tiny three-state MDP. The example is tabular so the recursive target stays visible.

# Section 2.6: runnable checkpoint for MDPs and Bellman equations.
# Keep the output small so the evidence record can be inspected directly.
states = ["far", "near", "done"]
value = {state: 0.0 for state in states}
gamma = 0.9
transition = {
    "far": ("near", -0.1),
    "near": ("done", 1.0),
    "done": ("done", 0.0),
}

for _ in range(3):
    for state in ["far", "near"]:
        next_state, reward = transition[state]
        value[state] = reward + gamma * value[next_state]

print({state: round(score, 3) for state, score in value.items()})
{'far': 0.8, 'near': 1.0, 'done': 0.0}
Code Fragment 2.6.1 performs Bellman backups for far, near, and done states in a tiny reach-the-goal MDP.

Expected output: the near state reaches value $1.0$ because it receives the goal reward on the next transition. The far state reaches $0.8$ because it pays an immediate penalty of $-0.1$ and then inherits $0.9 \times 1.0$ from the near state.

Library Shortcut

The 14-line Bellman backup becomes a value-function update inside RL libraries, CleanRL scripts, and Isaac Lab training workflows. The tools handle sampling, batching, replay, and function approximation. The hand-built backup is still useful because it reveals the target every value learner is approximating.

Practical Recipe

  1. Define state so it is as close to Markov as the task allows.
  2. Specify action set, transition behavior, reward, discount, and horizon.
  3. Write one Bellman backup by hand before using a trainer.
  4. Check whether omitted history changes next-state prediction.
  5. Use simulator diagnostics to test whether the MDP approximation is adequate.
Failure Mode

Calling a problem an MDP does not make it Markov. If actuator wear, object mass, unobserved contact, or past collisions affect the next transition, the state is missing information.

Practical Example

In a simulated reaching task, value estimates were noisy until the team added gripper contact state and last commanded velocity to the state vector. The policy was not suddenly smarter. The state became closer to Markov.

Memorable Shortcut

The Markov property is a strict roommate: it does not want the full past on the couch, but it does expect the current state to bring all relevant luggage.

Research Frontier

Model-based RL and world models relax the need for hand-specified transition models by learning predictive representations. Their success still depends on whether the learned state supports Bellman-style planning, control, or value estimation.

Mini Lab

Add a second action to Code Fragment 2.6.1 that keeps the robot in the same state with a smaller penalty. Run backups and identify the preferred action in each state.

Self Check

Can you name one variable that would break the Markov assumption if it were omitted from your robot state?

MDPs and Bellman equations become useful when they are tied to a closed-loop contract between policy, world, evaluator, and safety constraints. The contract names the state representation, action space, transition model or sampler, reward definition, discount convention, and result artifact. That is the bridge between a readable concept and a system a skeptical builder can test.

For Markov decision processes; Bellman equations, separate the conceptual claim, the systems claim, and the evidence claim. A good explanation, a clean API, and one successful rollout are different kinds of evidence, and the section should keep them distinct.

Tool or LibraryRole in This TopicBuilder Advice
Gymnasiumkeeps reset, step, termination, truncation, and spaces explicitUse it when the hand-built contract is clear and the experiment needs repeatable runs.
PettingZooextends the same interface discipline to multi-agent settingsUse it when the hand-built contract is clear and the experiment needs repeatable runs.
ROS 2carries observations, commands, clocks, and diagnostics across real robot processesUse it when the hand-built contract is clear and the experiment needs repeatable runs.

For Markov decision processes; Bellman equations, a robust implementation starts with one inspectable baseline whose artifact records observations, actions, units, timestamps, seeds, termination reasons, and the perturbation applied. The maintained-tool version is useful only if it preserves that schema and lets the comparison remain construct-matched.

  1. Write a one-paragraph task contract with observation, action, success, failure, and safety fields.
  2. Choose the smallest simulator, dataset, or wrapper that exposes the contract faithfully.
  3. Run one deterministic smoke test and one perturbation test before scaling.
  4. Save one artifact containing configuration, seed, metrics, traces, and failure labels.
  5. Compare methods only when the same script evaluates the same panel, split, seed set, and metric.

When an MDP model fails, avoid labeling the whole method as weak. First assign the failure to state definition, transition modeling, reward specification, discount choice, sampling coverage, or value-function approximation. Then rerun one controlled perturbation that isolates the suspected cause. This pattern turns a disappointing rollout into a reusable diagnostic asset.

Key Takeaway

MDPs and Bellman equations give a clean model for closed-loop value, but only when the state representation carries the information future prediction needs.

Exercise 2.6.1

Define a three-state MDP for a robot approaching a charging dock. Write the transition and reward for each action.

What's Next?

Section 2.7 introduces partially observable MDPs and belief states for hidden state reasoning.

Bibliography & Further Reading

Foundational References For This Section

Bellman, R.. "A Markovian Decision Process." (1957). https://doi.org/10.1515/9781400835386-007

The mathematical origin of the state, action, transition, and reward framing.

Kaelbling, L. P., Littman, M. L., and Cassandra, A. R.. "Planning and acting in partially observable stochastic domains." (1998). https://www.sciencedirect.com/science/article/pii/S000437029800023X

A foundational POMDP reference for belief-state reasoning under partial observability.

Farama Foundation. "Gymnasium Documentation." (2024). https://gymnasium.farama.org/

The maintained reference for reset, step, spaces, termination, truncation, wrappers, and reproducible environments.