A Careful Control Loop
The policy gradient theorem explains why we can improve a policy by multiplying the gradient of its log probability by a return signal. REINFORCE is the direct Monte Carlo version: sample trajectories, compute returns, and push up the probability of actions that appeared in successful trajectories.
The policy objective is still expected return, $J(\theta)=\mathbb{E}_{\tau\sim\pi_\theta}[G(\tau)]$. The obstacle is that the trajectory distribution contains environment dynamics, contact physics, resets, sensor noise, and reward delays. REINFORCE works because it differentiates the probability of the sampled trajectory with respect to the policy, while treating the environment as a source of samples.
The key identity is the likelihood-ratio trick:
$$\nabla_\theta J(\theta)=\mathbb{E}_{\tau\sim\pi_\theta}\left[G(\tau)\nabla_\theta \log p_\theta(\tau)\right].$$
Because the environment transition probabilities do not depend on $\theta$, the policy-dependent part becomes a sum of action log probabilities:
$$\nabla_\theta J(\theta)=\mathbb{E}_{\tau\sim\pi_\theta}\left[\sum_t \nabla_\theta \log \pi_\theta(a_t\mid s_t)G_t\right].$$
The theorem does not require a differentiable robot, a differentiable simulator, or a differentiable reward sensor. It requires the policy to report the log probability of the action it sampled.
Theory
For a sampled step, $\nabla_\theta \log \pi_\theta(a_t\mid s_t)$ points in the direction that would make the sampled action more likely. Multiplying by $G_t$ says how strongly to move in that direction. Positive high return increases the action probability; low or negative return pushes it down.
A baseline $b(s_t)$ can be subtracted from the return without biasing the expected gradient:
$$\mathbb{E}_{a\sim\pi_\theta}\left[\nabla_\theta \log \pi_\theta(a\mid s)b(s)\right]=b(s)\nabla_\theta\sum_a \pi_\theta(a\mid s)=0.$$
This result is the reason value functions can serve as variance-reduction baselines. They change the noise of the update, not its target direction in expectation.
REINFORCE is an unbiased but noisy estimator. It waits until returns are known, assigns the same delayed evidence to many earlier sampled actions, and therefore needs many trajectories or a good baseline to become stable.
Worked Example
Code Fragment 1 computes a tiny REINFORCE-style update for a two-action policy. The numbers show why a good return attached to a low-probability action produces a large learning signal.
# Compute REINFORCE loss terms for sampled actions.
# The loss uses frozen log probabilities and advantage estimates from rollout data.
import math
actions = ["left", "right", "right"]
action_probs = {"left": 0.25, "right": 0.75}
returns = [1.2, 0.4, -0.3]
baseline = 0.2
for action, return_value in zip(actions, returns):
advantage = return_value - baseline
log_prob = math.log(action_probs[action])
loss_term = -log_prob * advantage
print(action, "advantage:", round(advantage, 2), "loss term:", round(loss_term, 3))
advantage is the return after subtracting the baseline. The rare successful left action has a larger loss_term, which gives the optimizer a stronger reason to increase its future probability.The exact gradient depends on the policy parameterization, but the sign and scale are already visible. The first action was unlikely and better than baseline, so it receives a strong positive update. The final action was likely but worse than baseline, so its probability should decrease.
CleanRL is especially useful for studying this section because its single-file implementations expose log probabilities, returns, baselines, and losses without hiding the estimator. Stable-Baselines3 and RSL-RL package the same bookkeeping for larger experiments once the estimator is understood.
Practical Recipe
- Save the old log probability for every sampled action during rollout collection.
- Compute returns from rewards collected after the action, not from a reward prediction made before the action.
- Subtract a baseline or value estimate to reduce variance before multiplying by the log-probability gradient.
- Normalize advantages within a batch when reward scale changes across tasks or resets.
- Debug with a one-state bandit before applying the estimator to a contact-rich simulator.
The estimator becomes misleading if the stored log probability does not match the action that actually reached the environment. This can happen after action clipping, invalid-action masking, controller overrides, or unit conversions between policy output and robot command.
For a drone landing task, REINFORCE can increase the probability of descent-rate commands that led to smooth touchdowns. A baseline is essential because wind gusts and sensor noise can make two identical commands receive different returns.
REINFORCE is the policy's accountability system: it asks what action the policy chose, how likely that action was, and whether the episode made that choice look wise.
The policy-gradient theorem still underlies many modern robot-learning systems, even when the policy is initialized from demonstrations, pretrained encoders, or large action datasets. Active research asks how to combine low-variance on-policy gradients with data reuse, safety constraints, and real-world sample efficiency.
Can you explain why subtracting a state-dependent baseline leaves the expected policy gradient unchanged? Can you also name one embodied system layer that could break the match between stored log probability and executed action?
The theorem is often written compactly, but the cancellation is the teaching point. A trajectory probability factors into an initial-state term, transition terms, and policy terms. The initial-state and transition terms may decide which data you see, but they do not contain $\theta$ if the policy parameters do not change the simulator or world dynamics directly.
That is why the gradient can be estimated from sampled rollouts. The price is variance: one unlucky slip can assign a poor return to several reasonable earlier actions. Actor-critic methods and GAE in the next section keep the same log-probability mechanism while replacing raw Monte Carlo returns with more local advantage estimates.
| Term | What It Means | Training Role |
|---|---|---|
| $\log \pi_\theta(a_t\mid s_t)$ | How likely the policy made the sampled action. | The differentiable part used for the update. |
| $G_t$ | Return observed after the action. | Scales whether the sampled action should become more likely. |
| $b(s_t)$ | State-dependent baseline. | Reduces variance because its expected score contribution is zero. |
| $G_t-b(s_t)$ | Advantage-like learning signal. | Rewards actions that performed better than expected from that state. |
Code Fragment 2 makes the baseline identity concrete with a two-action policy. The expected score contribution of the baseline sums to zero because the probability derivatives across all actions cancel.
- Verify that action probabilities sum to one before computing log probabilities.
- Compute returns and baselines from the same reward convention and discount factor.
- Check that the mean advantage is close to zero after baseline subtraction on a stable batch.
- Track gradient norm because Monte Carlo returns can produce rare but very large updates.
- Keep rollout data on-policy for REINFORCE; old trajectories require importance correction or a different algorithm.
# Verify that a state baseline has zero expected score contribution.
# This is why baselines reduce variance without changing the policy-gradient target.
prob_left = 0.25
prob_right = 0.75
baseline = 2.0
score_left = 1.0 - prob_left
score_right = 0.0 - prob_left
expected_baseline_term = (
prob_left * score_left * baseline
+ prob_right * score_right * baseline
)
print(round(expected_baseline_term, 6))
expected_baseline_term is zero for this two-action softmax score. The baseline can shrink noisy returns, but it cannot systematically push the policy left or right when averaged under the policy.When REINFORCE fails, classify the failure by estimator pathology before changing the policy network. High variance suggests better baselines, shorter-horizon shaping, or advantage normalization. Biased updates suggest stale log probabilities, off-policy data, action post-processing, or a reward convention mismatch.
For REINFORCE, compare only construct-matched metrics co-computed in one pass on one configuration: same policy checkpoint, same rollout horizon, same baseline definition, same reward scale, and same seed set. Save returns, advantages, gradient norms, entropy, and failure labels in one artifact so estimator noise is not confused with embodied progress.
The policy-gradient theorem turns sampled actions into differentiable evidence. REINFORCE is conceptually clean because it only needs log probabilities and returns, but embodied agents need baselines and careful logging to keep that clean estimator usable.
Take a three-action policy and show algebraically that a state-only baseline has zero expected score contribution. Then identify one robot-control preprocessing step that would invalidate the log probability saved during rollout.
What's Next?
This section derived the REINFORCE estimator and showed why baselines reduce variance without biasing the expected update. Next, Section 15.3 replaces raw returns with actor-critic advantage estimates and GAE.
The original REINFORCE paper deriving the likelihood-ratio policy gradient. Read Section 2 for the REINFORCE update rule and Section 5 for baseline subtraction. This is the direct predecessor to actor-critic and PPO; understanding it makes the clipped surrogate objective in Schulman et al. 2017 concrete.
Formalizes the policy gradient theorem showing that the gradient of expected return can be expressed as an expectation over state-action pairs. Read to understand why on-policy sampling is sufficient for an unbiased gradient estimate and how the baseline reduces variance without introducing bias.
Schulman, J. et al. (2015). Trust Region Policy Optimization. ICML.
Introduces the trust-region constraint that bounds policy update size using KL divergence, providing a monotonic improvement guarantee. Read Section 3 for the surrogate objective and Theorem 1 for the lower bound; PPO simplifies this into a clipped ratio that achieves similar stability with far less implementation complexity.
Derives the generalized advantage estimator (GAE) as an exponentially weighted average of n-step returns, controlled by the lambda parameter. Read Section 3 for the bias-variance trade-off analysis; in practice lambda around 0.95 is the default in most PPO implementations and understanding why requires this paper.
Schulman, J. et al. (2017). Proximal Policy Optimization Algorithms. arXiv.
Introduces the clipped surrogate objective that prevents large policy updates without the second-order KL constraint of TRPO. Read Section 3 for the clipping mechanism and Section 5 for the implementation details including value-function loss coefficient and entropy bonus that appear in nearly every modern PPO codebase.
CleanRL documentation and source code.
Provides single-file, dependency-minimal RL implementations that make every algorithmic choice visible on one screen. Read the PPO and SAC files side by side with the corresponding papers; CleanRL is the fastest way to verify that you understand which implementation details matter versus which are optional.