A Careful Control Loop
Policies and value functions split behavior from forecasting. This matters in embodied systems because a robot can choose a locally attractive action while still entering a low-value physical situation.
This section links back to Chapter 7: Control for AI Practitioners and Chapter 10: Environments with Gymnasium and PettingZoo, then prepares the policy-gradient work in Chapter 15: Policy Gradient Methods and PPO. The central distinction is simple: a policy chooses actions, while a value function estimates the future those actions create.
This section develops the technical contract for policies and value functions. First we define stochastic and deterministic policies, then we connect state values and action values through Bellman equations, then we run a small policy-evaluation example.
The key question is practical: when an embodied agent stands in a state with several possible actions, how does it separate "what I will do" from "how good this situation is if I keep doing it"?
The policy is the behavior contract; the value function is the forecast attached to that behavior. Confusing the two leads to brittle robot systems that know which action looks best locally but cannot estimate the future cost of that choice.
Theory
A policy maps information to actions. In a fully observed MDP, a stochastic policy is written $\pi(a\mid s)$, the probability of choosing action $a$ in state $s$. In a partially observed robot task, the implemented policy often uses $\pi(a\mid o)$ or $\pi(a\mid \hat s)$, where $\hat s$ is the state estimate produced by perception and filtering.
The state-value function for a fixed policy is the expected return from state $s$:
$$V^\pi(s)=\mathbb E_\pi[G_t\mid S_t=s].$$
The action-value function asks the same question after forcing the first action:
$$Q^\pi(s,a)=\mathbb E_\pi[G_t\mid S_t=s,A_t=a].$$
These functions obey Bellman expectation equations because the return can be split into immediate reward plus discounted future value:
$$V^\pi(s)=\sum_a \pi(a\mid s)\sum_{s'}P(s'\mid s,a)\left[R(s,a,s')+\gamma V^\pi(s')\right].$$
The equation is recursive, not circular. It says the value of a state under policy $\pi$ equals the policy-weighted average of one-step outcomes plus the discounted value of the next state. For embodied systems, the hidden assumption is strong: the state or belief must contain enough physical information for the transition model to be meaningful.
Policy evaluation estimates $V^\pi$ while keeping the policy fixed. Policy improvement changes $\pi$ using those estimates. Separating the two lets a builder debug whether the failure is poor forecasting, poor action selection, or poor state estimation.
Worked Example
Code Fragment 1 evaluates a fixed two-state policy for a charging robot. The state `low` means the battery is low, and the state `ready` means the robot can inspect objects; the policy recharges aggressively when low and mostly inspects when ready.
# Evaluate a fixed policy with Bellman expectation backups.
# The values forecast long-run return without changing the policy.
states = ["low", "ready"]
gamma = 0.8
policy = {
"low": {"recharge": 0.9, "inspect": 0.1},
"ready": {"recharge": 0.2, "inspect": 0.8},
}
transitions = {
("low", "recharge"): ("ready", 1.0),
("low", "inspect"): ("low", -2.0),
("ready", "recharge"): ("ready", 0.2),
("ready", "inspect"): ("low", 3.0),
}
values = {state: 0.0 for state in states}
for _ in range(8):
new_values = {}
for state in states:
total = 0.0
for action, prob in policy[state].items():
next_state, reward = transitions[(state, action)]
total += prob * (reward + gamma * values[next_state])
new_values[state] = total
values = new_values
for state in states:
print(f"V({state}) = {values[state]:.3f}")
The numbers are not labels supplied by a dataset. They are self-consistent forecasts under the policy. If the policy changed, the Bellman equations would describe a different behavior and the values would need to be recomputed.
In practical experiments, CleanRL and Stable-Baselines3 keep separate objects for policy networks, value networks, rollout storage, and evaluation logs. That separation mirrors the formal distinction here: action selection and value forecasting are related, but they are not the same object.
Practical Recipe
- Write the observation, action, and success metric before choosing a model.
- Build a baseline that is simple enough to debug by inspection.
- Add the library implementation only after the baseline behavior is understood.
- Record failures as structured cases: perception error, state error, planning error, control error, or evaluation error.
- Run at least one perturbation test before trusting the result.
The common mistake in Policies and value functions is to celebrate the component score before checking the closed-loop handoff. The failure usually appears at the boundary: stale state, wrong frame, delayed action, saturated actuator, or metric that ignores the real task cost.
A mobile robot that scores grasps with $Q(s,a)$ should log the chosen action and the next-state value separately. If the gripper repeatedly chooses risky reaches with high immediate reward but poor recovery value, the issue is visible in $Q$ and $V$ before it becomes a hardware incident.
A policy is a steering wheel. A value function is the road sign that says what the next few turns are likely to cost.
A major frontier is learning value functions that remain calibrated under distribution shift, especially when vision-language-action policies operate far outside their demonstrations. In robot learning, a value estimate that is numerically precise but physically miscalibrated can make unsafe actions look deceptively attractive.
Can you say whether a quantity answers "what action will I take" or "what return should I expect"? If not, you are mixing policy notation with value notation.
Bellman equations are useful because they turn a long-horizon prediction into repeated one-step backups. The backup is only as good as the state representation it conditions on. If a value function sees a camera frame but not the gripper force that predicts slip, it may assign the same value to physically different situations.
For policy-gradient methods in the next chapter, the value function often becomes a baseline or critic. That critic is not an optional decoration: it controls variance, estimates advantage, and can quietly mislead the update when trained on the wrong rollout distribution.
| Object | Question It Answers | Embodied Check |
|---|---|---|
| $\pi(a\mid s)$ | Which action distribution will the agent use? | Check action limits, latency, and whether stochasticity is safe on hardware. |
| $V^\pi(s)$ | How much return follows from this state under this policy? | Check whether the state contains contact, pose, and task progress signals. |
| $Q^\pi(s,a)$ | How much return follows if this action is taken first? | Check whether risky actions are represented separately from safe actions. |
A robust implementation keeps policy outputs, value targets, and evaluation returns separate in logs. This prevents a common confusion where the action selected by the policy is mistaken for evidence that the value estimate was correct.
- Define whether the policy receives $s$, $o$, or $\hat s$.
- Log action probabilities or deterministic actions for each decision.
- Compute Monte Carlo returns from traces before fitting a value network.
- Compare Bellman targets and observed returns on a small batch.
- Audit high-value states for physical plausibility with video or state logs.
When a value function fails, compare three quantities for the same episodes: predicted value at the start, realized discounted return, and the first few Bellman targets. Large disagreement can indicate reward bugs, truncation bugs, distribution shift, or missing state information.
For policies and value functions, evaluate action selection and value calibration in the same rollout artifact: policy checkpoint, value checkpoint, states or observations, action probabilities, rewards, returns, and termination labels.
A policy says what the agent will do. A value function says what that behavior is expected to achieve.
Define a two-state MDP for a mobile robot, choose a fixed stochastic policy, and write the two Bellman equations for $V^\pi$. Solve them by iteration or linear algebra.
What's Next?
This section separated behavior from forecasting. Next, Section 14.3 asks how the policy should gather enough evidence to make those forecasts reliable.
The standard textbook for RL foundations. Read Part I for MDPs, value functions, and the Bellman equations; Part II for TD learning and eligibility traces; Part III for function approximation and policy gradient theory. It is the primary notation reference for this module.
Puterman, M. L. (1994). Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley.
Provides the formal mathematical treatment of MDPs, Bellman equations, and the theory of optimal policies. Read Chapter 4 for policy evaluation and Chapter 6 for policy iteration; this is the reference to check when the intuitions from Sutton and Barto need formal grounding in existence and convergence proofs.
Brockman, G. et al. (2016). OpenAI Gym. arXiv.
Introduced the step/reset/render environment interface that became the standard for RL research. Read for the API contract; nearly every RL library and tutorial assumes this interface, and Gymnasium maintains it with minor extensions. Understanding it is prerequisite to using PettingZoo, Isaac Lab, or MuJoCo.
Towers, M. et al. Gymnasium documentation. Farama Foundation.
The actively maintained successor to OpenAI Gym with bug fixes, consistent seeding, and terminated/truncated distinction. Use this as the environment API reference throughout the chapter; the terminated/truncated split matters for bootstrap targets at episode boundaries.
Todorov, E., Erez, T., and Tassa, Y. (2012). MuJoCo: A physics engine for model-based control. IROS.
Describes the contact physics model, generalized coordinates, and constraint solver that make MuJoCo accurate and fast for robot learning. Read the original paper to understand why smooth contact gradients benefit model-based methods; in practice use the official docs for API, but this paper explains why MuJoCo physics behaves differently from game-engine simulators.