A Careful Control Loop
Partially observable MDPs; belief states extend MDPs to the case where the agent cannot see the full state. The agent receives observations, maintains a probability distribution over possible states, and chooses actions using that uncertainty rather than pretending the latest sensor packet is complete.
This section develops the contract for decision making when the current observation is not enough. A POMDP keeps the MDP pieces from Section 2.6, then adds an observation model and a belief state. The belief state is not a guess pasted onto the policy. It is the action interface the policy actually receives.
The practical question is: which hidden variables matter for action, what observations give evidence about them, and when should the agent spend an action to gather information rather than rush toward the goal?
A belief state earns its place when it changes what the robot does under uncertainty. If two histories produce the same camera frame but different slip risk, the belief should help the policy choose whether to grasp, slow down, or reobserve.
Theory
A POMDP is often written as $(\mathcal{S}, \mathcal{A}, P, R, \Omega, O, \gamma)$. The new pieces are $\Omega$, the observation space, and $O(o|s,a)$, the probability of receiving observation $o$ after action $a$ when the world is in state $s$. Instead of acting on an unobserved state, the agent acts on a belief $b_t(s)$, a probability distribution over states.
A one-step belief update has the form $$b_{t+1}(s') = \eta O(o_{t+1}|s',a_t)\sum_s P(s'|s,a_t)b_t(s),$$ where $\eta$ normalizes the probabilities so they sum to one. The summation predicts where the hidden state could move after the action. The observation likelihood then favors states that make the new sensor reading plausible.
The mechanism is predict, observe, correct, then act. Prediction uses the transition model to move the old belief forward. Correction uses the observation model to reweight possible states. The resulting belief should be logged because it is the only way to know whether the agent acted from calibrated uncertainty or from a stale guess.
Worked Example
Code Fragment 2.7.1 implements the belief update for a two-state contact problem. A force spike is weakly compatible with a dry surface and strongly compatible with a slippery surface, so the posterior belief should shift toward slip risk.
# Section 2.7: update a belief state from a force observation.
# Predict hidden contact state, then reweight it by observation likelihood.
states = ["dry_surface", "slippery_surface"]
belief = {"dry_surface": 0.70, "slippery_surface": 0.30}
transition = {
"dry_surface": {"dry_surface": 0.85, "slippery_surface": 0.15},
"slippery_surface": {"dry_surface": 0.20, "slippery_surface": 0.80},
}
likelihood_force_spike = {"dry_surface": 0.15, "slippery_surface": 0.80}
predicted = {
next_state: sum(transition[state][next_state] * belief[state] for state in states)
for next_state in states
}
unnormalized = {
state: likelihood_force_spike[state] * predicted[state]
for state in states
}
normalizer = sum(unnormalized.values())
posterior = {state: unnormalized[state] / normalizer for state in states}
print({state: round(probability, 3) for state, probability in posterior.items()})
slippery_surface, which is the action-relevant hidden variable.Expected output: the posterior belief should put most probability on slippery_surface. The important teaching point is not the exact number alone, but the trace from prior belief to predicted belief to observation-corrected belief.
The from-scratch fragment is for understanding. In a practical system, a discrete POMDP solver, a particle filter, or a ROS 2 state-estimation node can handle likelihood bookkeeping, normalization, resampling, and diagnostic publishing. The shortcut removes boilerplate, but the engineer still must define the hidden state, observation model, and action that uses the belief.
Practical Recipe
- List the hidden variables that change action choice.
- Specify the observation likelihood for each hidden variable, even if it starts as an approximate table.
- Log prior belief, predicted belief, posterior belief, chosen action, and observation timestamp.
- Test information-gathering actions separately from task-completion actions.
- Evaluate calibration under ambiguous observations, occlusion, delayed sensors, and contact changes.
The common mistake is to pass a point estimate to the policy and call it a belief. If the logger cannot show uncertainty, the policy cannot distinguish "the block is safe to grasp" from "the block might be safe, but the evidence is weak."
A mobile manipulation team used the same camera frame for two histories: one where a person had just walked behind the robot and one where the corridor had been empty for several seconds. A belief state over nearby human motion let the planner wait and reobserve in the first case, while continuing in the second.
A belief state is not what happened. It is the agent's best spreadsheet about what might have happened.
Learned world models and recurrent robot policies can act like implicit belief trackers, but their uncertainty is often hard to inspect. A useful research direction is making those latent beliefs legible enough for safety monitors, recovery policies, and post-failure audits.
Can you name the hidden state, the observation likelihood, the belief update, and the action that should change when uncertainty is high?
Partially observable MDPs; belief states becomes useful when it is tied to a closed-loop contract for the contract between policy, world, evaluator, and safety constraints. The contract names the observation stream, the action representation, the timing budget, the safety boundary, and the result artifact. That is the bridge between a readable concept and a system a skeptical builder can test.
For Partially observable MDPs; belief states, separate the conceptual claim, the systems claim, and the evidence claim. A good explanation, a clean API, and one successful rollout are different kinds of evidence, and the section should keep them distinct.
| Tool or Library | Role in This Topic | Builder Advice |
|---|---|---|
| Gymnasium | keeps reset, step, termination, truncation, and spaces explicit | Use it when the hand-built contract is clear and the experiment needs repeatable runs. |
| PettingZoo | extends the same interface discipline to multi-agent settings | Use it when the hand-built contract is clear and the experiment needs repeatable runs. |
| ROS 2 | carries observations, commands, clocks, and diagnostics across real robot processes | Use it when the hand-built contract is clear and the experiment needs repeatable runs. |
For Partially observable MDPs; belief states, a robust implementation starts with one inspectable baseline whose artifact records observations, actions, units, timestamps, seeds, termination reasons, and the perturbation applied. The maintained-tool version is useful only if it preserves that schema and lets the comparison remain construct-matched.
- Write a one-paragraph task contract with observation, action, success, failure, and safety fields.
- Start with the smallest simulator, dataset, or wrapper that exposes the task contract faithfully.
- Run one deterministic smoke test and one perturbation test before scaling.
- Save one artifact containing configuration, seed, metrics, traces, and failure labels.
- Compare methods only when the same script evaluates the same panel, split, seed set, and metric.
When Partially observable MDPs; belief states fails, avoid labeling the whole method as weak. First assign the failure to perception, state estimation, planning, control, timing, data coverage, or evaluation. Then rerun one controlled perturbation that isolates the suspected cause. This pattern turns a disappointing rollout into a reusable diagnostic asset.
Hands-On Lab: Build a Section Evidence Trace
Objective
Turn Partially observable MDPs; belief states into a small artifact that compares a hand-built baseline with a maintained-tool shortcut under one perturbation.
What You'll Practice
- Define an observation, action, metric, and perturbation contract
- Build a minimal baseline trace
- Preserve the same schema for the library shortcut
- Write a failure postmortem from the evidence record
Setup
pip install numpy pandasSteps
Step 1: Define the contract
Write the fields that make two runs comparable.
Step 2: Record the baseline
Save one deterministic result before adding noise or latency.
Step 3: Add the shortcut
Run or sketch the maintained-tool version while keeping the artifact schema fixed.
Step 4: Apply one perturbation
Change exactly one condition and preserve the same logging fields.
Expected Output
The completed lab produces one table with baseline, shortcut, and perturbed rows, plus a short note explaining which comparison is valid because all metrics were co-computed under one schema.
Stretch Goals
- Add a second seed and report mean and spread.
- Write a one-paragraph postmortem that separates root cause from symptom.
Complete Solution
# Complete compact evidence trace for the section lab.
# Extend these records with values produced by your actual environment or simulator.
import pandas as pd
records = [
{"run": "baseline", "seed": 0, "success": 0.72, "failure_label": "none"},
{"run": "library_shortcut", "seed": 0, "success": 0.78, "failure_label": "none"},
{"run": "baseline_perturbed", "seed": 0, "success": 0.54, "failure_label": "latency"},
]
print(pd.DataFrame(records))A POMDP turns hidden state into a maintained belief. That belief is useful when it changes action under uncertainty and leaves a trace that an evaluator can inspect.
Design a method-matched experiment for Partially observable MDPs; belief states. Specify the environment, observation schema, action interface, metric, and one perturbation that targets the section's core assumption.
What's Next?
Section 2.8 closes the chapter by explaining why embodiment is usually partially observable.
Bibliography & Further Reading
Foundational References For This Section
Bellman, R.. "A Markovian Decision Process." (1957). https://doi.org/10.1515/9781400835386-007
The mathematical origin of the state, action, transition, and reward framing.
Kaelbling, L. P., Littman, M. L., and Cassandra, A. R.. "Planning and acting in partially observable stochastic domains." (1998). https://www.sciencedirect.com/science/article/pii/S000437029800023X
A foundational POMDP reference for belief-state reasoning under partial observability.
Farama Foundation. "Gymnasium Documentation." (2024). https://gymnasium.farama.org/
The maintained reference for reset, step, spaces, termination, truncation, wrappers, and reproducible environments.