A Careful Control Loop
Actor-critic methods pair a policy that acts with a value function that judges whether the action was better or worse than expected. Generalized Advantage Estimation (GAE) turns noisy reward sequences into smoother advantage targets for PPO-style updates.
REINFORCE can assign a long delayed return to every action in a trajectory. That is unbiased, but it is noisy for embodied agents where a late slip, contact bounce, or tracking error can dominate the episode return. Actor-critic methods reduce this noise by learning $V_\phi(s)$, an estimate of how much return is expected from a state before the next action is chosen.
The actor update uses an advantage estimate:
$$A_t = Q(s_t,a_t)-V(s_t).$$
In practice, PPO often starts from the temporal-difference residual:
$$\delta_t=r_t+\gamma V(s_{t+1})-V(s_t),$$
then combines residuals with GAE:
$$\hat A_t^{\mathrm{GAE}(\gamma,\lambda)}=\sum_{l=0}^{\infty}(\gamma\lambda)^l\delta_{t+l}.$$
The critic does not tell the actor whether the state is good in absolute terms. It tells the actor whether the sampled action led to a result above or below what the state already promised.
Theory
The parameter $\lambda$ controls a bias-variance tradeoff. When $\lambda=0$, the advantage uses one-step TD evidence, which is low variance but depends heavily on critic accuracy. When $\lambda$ is near 1, the estimate resembles Monte Carlo return, which uses longer evidence but becomes noisier. PPO commonly uses values near 0.95 because that often balances delayed reward evidence with manageable variance.
For embodied control, this tradeoff is visible in time. A one-step residual may miss that a foot placement caused a stumble three frames later. A long-horizon return may blame that foot placement for an unrelated collision after a perception glitch. GAE gives the builder a knob for how far credit should travel backward through the rollout.
The critic supplies a baseline, the TD residual measures local surprise, and GAE smooths those surprises backward through time. The actor then uses the resulting advantage to weight the same log-probability update introduced in REINFORCE.
Worked Example
Code Fragment 1 computes GAE for a short rollout. Notice that the advantages are computed backward because later residuals influence earlier credit.
# Compute Generalized Advantage Estimation from rewards and value predictions.
# The backward recursion sends delayed evidence to earlier sampled actions.
rewards = [0.0, 0.2, 1.0, -0.1]
values = [0.3, 0.4, 0.6, 0.2, 0.0]
gamma = 0.99
lam = 0.95
advantages = []
gae = 0.0
for t in reversed(range(len(rewards))):
delta = rewards[t] + gamma * values[t + 1] - values[t]
gae = delta + gamma * lam * gae
advantages.insert(0, round(gae, 3))
print("advantages:", advantages)
delta term measures one-step surprise against the critic's values. The backward gae recursion makes the positive reward at step 2 influence earlier actions without assigning the full episode return to every step.This trace gives the mental model. Step 3 has a negative advantage because the outcome was worse than the critic expected. Earlier steps remain positive because the later reward still supplies evidence that those actions helped set up a useful state.
Stable-Baselines3, CleanRL, RSL-RL, and rl_games all implement actor-critic PPO with GAE. The implementation details to inspect are gamma, gae_lambda, value loss scaling, advantage normalization, and whether time-limit truncations are bootstrapped correctly.
Practical Recipe
- Train the actor and critic on the same rollout batch so advantages match the behavior policy.
- Bootstrap from $V(s_{t+1})$ when a rollout segment ends by truncation, but not when the episode truly terminates.
- Normalize advantages per batch to prevent reward-scale changes from dominating the policy loss.
- Track explained variance or value error so a broken critic does not silently corrupt the actor update.
- Audit GAE under perturbations, because delayed physical failures change how far credit should travel backward.
Time-limit truncation is often mistaken for termination. If a robot episode ends because the rollout buffer filled, the critic should usually bootstrap from the next value; if it ended because the robot fell, it should not.
In dexterous manipulation, a grasp adjustment may pay off several control steps later when the object stops slipping. GAE lets that delayed evidence reach the adjustment action without letting a much later unrelated collision dominate the whole rollout.
The critic is the agent's skeptical lab partner. It does not celebrate reward by itself; it asks whether the reward was better than the state already predicted.
Recent embodied-policy systems often train critics on privileged simulator state while actors receive deployable observations. That asymmetry can improve training, but it raises a deployment question: does the learned actor still behave well when the critic's privileged information disappears?
Can you identify which terms in a rollout affect the actor loss, which affect the critic loss, and which termination flags decide whether GAE should bootstrap?
The actor and critic optimize different targets from the same experience. The actor asks which sampled actions should become more likely. The critic asks which states predict future return. If these targets are mixed casually, a bug in value learning can masquerade as a policy improvement.
Embodied systems add one more complication: the critic often sees partial observations, not the full physical state. If the value function cannot infer hidden contact state, object slip, battery sag, or actuator temperature, its baseline will be noisy. That does not make actor-critic invalid, but it does mean the value diagnostics must be read alongside physical failure labels.
| Estimator | Strength | Embodied Risk |
|---|---|---|
| Monte Carlo return | Uses complete future reward evidence. | High variance when late physical events dominate the episode. |
| One-step TD | Low variance and fast feedback. | Biased when the critic misses delayed contact consequences. |
| GAE with $\lambda$ near 0.95 | Balances delayed credit and variance. | Sensitive to truncation handling and value-function quality. |
| Normalized advantages | Stabilizes batch scale for PPO. | Can hide reward-scale bugs if raw advantages are never inspected. |
Code Fragment 2 shows the second piece most PPO implementations apply after GAE: advantage normalization. This is not part of the theorem; it is a practical stabilizer that keeps the policy loss scale consistent across batches.
- Compute advantages before shuffling minibatches, using the rollout time order.
- Normalize advantages after GAE, using the full batch mean and standard deviation.
- Keep raw advantages in logs so reward-scale and critic failures remain visible.
- Train the value function on returns compatible with the same bootstrap convention.
- Plot value prediction, return target, and failure labels for several complete episodes.
# Normalize advantages after GAE so PPO sees a stable loss scale.
# Keep raw values for diagnostics because normalization can hide reward bugs.
advantages = [0.746, 0.691, 0.316, -0.300]
mean_advantage = sum(advantages) / len(advantages)
variance = sum((x - mean_advantage) ** 2 for x in advantages) / len(advantages)
std_advantage = variance ** 0.5
normalized = [(x - mean_advantage) / (std_advantage + 1e-8) for x in advantages]
print([round(x, 3) for x in normalized])
normalized advantages keep the positive and negative ordering from GAE while centering the batch around zero. PPO uses this stabilized scale for the actor loss, but raw advantages should still be logged for debugging.When actor-critic training fails, check critic health first. A value loss that falls while episode behavior worsens can mean the critic learned the wrong shortcut, such as predicting timeout length rather than task progress. A value loss that never falls can turn every advantage estimate into high-variance noise.
For actor-critic and GAE, compare only construct-matched metrics co-computed in one pass on one configuration: same rollout horizon, same bootstrap convention, same $\gamma$, same $\lambda$, same value target, and same perturbation suite. Save raw advantages, normalized advantages, value predictions, returns, truncation flags, and failure labels in one artifact.
Actor-critic methods make policy gradients usable by asking a sharper question: was this action better than expected from this state? GAE controls how much delayed embodied evidence flows backward into that answer.
Given a five-step rollout with rewards, values, and termination flags, compute TD residuals, GAE advantages, normalized advantages, and value targets. Mark which steps should bootstrap if the rollout ended by timeout instead of task failure.
What's Next?
This section showed how actor-critic methods and GAE reduce policy-gradient variance while preserving delayed reward evidence. Next, Section 15.4 adds trust-region control so those actor updates do not move too far from the rollout policy.
The original REINFORCE paper deriving the likelihood-ratio policy gradient. Read Section 2 for the REINFORCE update rule and Section 5 for baseline subtraction. This is the direct predecessor to actor-critic and PPO; understanding it makes the clipped surrogate objective in Schulman et al. 2017 concrete.
Formalizes the policy gradient theorem showing that the gradient of expected return can be expressed as an expectation over state-action pairs. Read to understand why on-policy sampling is sufficient for an unbiased gradient estimate and how the baseline reduces variance without introducing bias.
Schulman, J. et al. (2015). Trust Region Policy Optimization. ICML.
Introduces the trust-region constraint that bounds policy update size using KL divergence, providing a monotonic improvement guarantee. Read Section 3 for the surrogate objective and Theorem 1 for the lower bound; PPO simplifies this into a clipped ratio that achieves similar stability with far less implementation complexity.
Derives the generalized advantage estimator (GAE) as an exponentially weighted average of n-step returns, controlled by the lambda parameter. Read Section 3 for the bias-variance trade-off analysis; in practice lambda around 0.95 is the default in most PPO implementations and understanding why requires this paper.
Schulman, J. et al. (2017). Proximal Policy Optimization Algorithms. arXiv.
Introduces the clipped surrogate objective that prevents large policy updates without the second-order KL constraint of TRPO. Read Section 3 for the clipping mechanism and Section 5 for the implementation details including value-function loss coefficient and entropy bonus that appear in nearly every modern PPO codebase.
CleanRL documentation and source code.
Provides single-file, dependency-minimal RL implementations that make every algorithmic choice visible on one screen. Read the PPO and SAC files side by side with the corresponding papers; CleanRL is the fastest way to verify that you understand which implementation details matter versus which are optional.