A Careful Control Loop
Trust regions constrain how far a policy update may move from the behavior that produced the rollout data. TRPO enforces this idea with an explicit KL constraint; PPO keeps the same instinct but uses a simpler clipped objective and KL diagnostics.
A policy-gradient update can be too successful at following its own noisy estimate. One batch says a rare action looked good, the optimizer makes that action much more likely everywhere, and the next rollout discovers that the policy has stepped outside the region where the data were informative. In embodied agents, that can mean falls, collisions, or controllers driven into saturation.
TRPO frames the solution as a constrained optimization problem:
$$\max_\theta \mathbb{E}\left[\frac{\pi_\theta(a_t\mid s_t)}{\pi_{\theta_{\mathrm{old}}}(a_t\mid s_t)}\hat A_t\right]\quad\text{subject to}\quad \mathbb{E}\left[D_{\mathrm{KL}}(\pi_{\theta_{\mathrm{old}}}\Vert \pi_\theta)\right]\le \delta.$$
The probability ratio compares the new policy with the old behavior policy on the same sampled actions. The KL constraint says the new policy should not move too far from the old one in distribution space.
A rollout collected by the old policy is evidence about nearby policies, not a blank check for arbitrary policy changes. Trust-region methods make that validity radius part of the update.
Theory
PPO replaces TRPO's constrained solver with clipping. Define the ratio
$$r_t(\theta)=\frac{\pi_\theta(a_t\mid s_t)}{\pi_{\theta_{\mathrm{old}}}(a_t\mid s_t)}.$$
The clipped surrogate objective is
$$L^{\mathrm{CLIP}}(\theta)=\mathbb{E}_t\left[\min\left(r_t(\theta)\hat A_t,\operatorname{clip}(r_t(\theta),1-\epsilon,1+\epsilon)\hat A_t\right)\right].$$
If the advantage is positive, PPO stops rewarding the update once the new policy makes the action much more likely than the old policy did. If the advantage is negative, PPO stops rewarding the update once the new policy makes the action much less likely. The clip does not enforce a hard KL limit, but it removes incentive for many destructive updates.
The TRPO-to-PPO move is short to state. TRPO maximizes $\mathbb{E}\left[\frac{\pi_\theta}{\pi_{\mathrm{old}}}\hat A\right]$ subject to $D_{\mathrm{KL}} \le \delta$, solved with a conjugate-gradient step and a line search on a second-order trust-region model. PPO keeps the same surrogate ratio but replaces the hard constraint with the clipped objective above, so the same trust-region instinct now fits inside ordinary minibatch stochastic gradient descent.
Proximal Policy Optimization Algorithms (Schulman et al., arXiv 2017) — the clip-ratio objective $\min(r_t \hat A_t, \operatorname{clip}(r_t, 1-\epsilon, 1+\epsilon)\hat A_t)$ keeps updates inside a trust region without second-order optimization. PPO is the default on-policy algorithm for robot locomotion in Isaac Lab, MJX, and Brax.
PPO has two brakes. Clipping limits the per-sample incentive in the surrogate loss, and KL monitoring detects when the whole action distribution has still moved too far. Production PPO implementations often use both.
Worked Example
Code Fragment 1 calculates the unclipped and clipped PPO terms for a few rollout samples. The clipped objective blocks extra credit when the probability ratio moves outside the trust band.
# Compute PPO clipped surrogate terms for sampled ratios and advantages.
# The min operation removes incentive for updates beyond the trust band.
ratios = [0.72, 0.93, 1.08, 1.31]
advantages = [0.6, -0.4, 0.8, 0.5]
epsilon = 0.2
for ratio, advantage in zip(ratios, advantages):
clipped_ratio = min(max(ratio, 1 - epsilon), 1 + epsilon)
unclipped = ratio * advantage
clipped = clipped_ratio * advantage
objective_term = min(unclipped, clipped)
print(ratio, advantage, "clip:", clipped_ratio, "term:", round(objective_term, 3))
The expected output shows two different PPO behaviors. Ratios inside the trust band, such as 1.08, keep their natural objective term, while an oversized ratio like 1.31 is cut back to the clip limit, preventing one minibatch sample from driving an excessively large policy update.
ratio 1.31 and positive advantage is capped at a clipped ratio of 1.2. PPO still permits improvement, but it stops giving extra objective reward for moving that action probability too far in one update.The first row shows a different subtlety. A positive-advantage action became less likely, so clipping does not rescue it. The objective term remains low because the new policy moved against the learning signal.
CleanRL exposes PPO's clipped loss and approximate KL in a compact implementation. Stable-Baselines3, RSL-RL, and rl_games add mature rollout storage, vectorized environments, and hardware-oriented training loops while keeping the same ratio, clipping, and KL-control ideas.
Practical Recipe
- Save old log probabilities during rollout and compute ratios from new log probabilities during training.
- Use clip ranges such as 0.1 to 0.3 as starting points, then tune with KL, entropy, and return stability.
- Stop or shrink updates when approximate KL exceeds the target by a large margin.
- Track the fraction of samples clipped, because a high clip fraction means many gradients are pushing against the trust band.
- Watch embodied safety metrics during updates, not only episodic return.
Clipping is not a safety guarantee. A policy can keep each sampled ratio inside the clip band while the unsampled parts of the action distribution move enough to harm the next rollout.
For a legged robot, a large update can move gait timing just enough to turn stable walking into toe stubbing. PPO's ratio clip and KL stop condition keep the policy near the data that showed stable contact timing.
A trust region is the optimizer's reminder that one lucky rollout is not permission to reinvent the robot's gait in a single update.
Current policy-optimization work often mixes PPO-style updates with safety critics, offline data, and constrained RL. The open problem is preserving the simplicity and scalability of PPO while giving stronger guarantees about distribution shift and unsafe action regions.
Can you explain what a probability ratio above $1+\epsilon$ means for a positive-advantage action, and why a KL spike can matter even when the clipped loss looks stable?
TRPO and PPO are best understood as responses to the same failure: on-policy data become stale quickly. TRPO solves the problem more formally with a constrained update based on KL divergence. PPO accepts a less exact constraint because the clipped loss is simple, scalable, and easy to combine with minibatch stochastic gradient descent.
For embodied systems, the important diagnostic is not only the scalar return. A KL spike can precede visible behavior collapse by one update: the rollout that caused the update still looked good, while the next policy executes a new distribution of motions. Logging approximate KL, clip fraction, entropy, and action standard deviation gives the team early warnings before a robot-level failure becomes expensive.
| Lever | What It Controls | What To Watch |
|---|---|---|
| KL constraint or target KL | How far the new policy may move from the old policy. | Sudden KL jumps after high-advantage minibatches. |
| Clip range $\epsilon$ | How much probability ratios can improve the objective. | Clip fraction near zero or near one for many updates. |
| Entropy bonus | How much exploration pressure remains. | Action standard deviation collapse in continuous control. |
| Number of epochs | How many times the same rollout is reused. | Old data overfitting and rising KL within an update. |
Code Fragment 2 shows a compact KL and clip-fraction diagnostic. These numbers belong next to return curves because they explain whether PPO improved behavior by a controlled update or by a risky jump.
- Compute ratios as
exp(new_log_prob - old_log_prob), not by dividing rounded probabilities. - Report approximate KL for each update epoch, not only at the end of training.
- Stop the epoch loop early when KL exceeds the target threshold.
- Pair every return plot with entropy, KL, and clip fraction.
- Inspect videos or state traces from the update after the largest KL movement.
# Compute PPO diagnostics from old and new log probabilities.
# KL and clip fraction tell you whether the update stayed near the rollout policy.
import math
old_log_probs = [-0.9, -0.4, -1.2, -0.7]
new_log_probs = [-0.7, -0.5, -0.8, -0.2]
epsilon = 0.2
log_ratios = [new - old for old, new in zip(old_log_probs, new_log_probs)]
ratios = [math.exp(log_ratio) for log_ratio in log_ratios]
approx_kl = sum((ratio - 1.0) - log_ratio for ratio, log_ratio in zip(ratios, log_ratios)) / len(ratios)
clip_fraction = sum(abs(r - 1.0) > epsilon for r in ratios) / len(ratios)
print("ratios:", [round(r, 3) for r in ratios])
print("approx_kl:", round(approx_kl, 3))
print("clip_fraction:", round(clip_fraction, 2))
The expected output means three of the four sampled ratios are already outside the nominal PPO trust region, which is why the clip fraction rises to 0.75. A KL of 0.067 in such a tiny example should be read as a warning that additional epochs would likely over-update the policy.
ratios show how much the new policy changed the probability of sampled actions. A clip_fraction of 0.75 warns that most samples are pushing outside the PPO trust band, even before evaluating the next rollout.When PPO collapses after a promising update, inspect the trust-region diagnostics before changing the reward. A high clip fraction suggests the learning rate, epoch count, or advantage scale is too aggressive. A low entropy trace suggests the policy lost exploration. A KL spike suggests old rollout data were overused.
For trust-region and PPO comparisons, compute return, KL, entropy, clip fraction, safety violations, and failure labels in one run on one seed panel. A return improvement without the KL and clip context is not enough evidence that the update is stable for embodied deployment.
PPO clipping is a practical approximation to a trust-region idea: learn from the old rollout, but do not let one noisy batch push the new policy far outside the behavior that generated the evidence.
Given old and new log probabilities plus advantages for eight rollout steps, compute ratios, clipped objective terms, clip fraction, and approximate KL. Identify which samples are no longer useful for increasing the PPO objective.
What's Next?
This section connected TRPO's trust-region constraint to PPO's clipped surrogate and KL diagnostics. Next, Section 15.5 turns those equations into the concrete PPO rollout and training loop.
The original REINFORCE paper deriving the likelihood-ratio policy gradient. Read Section 2 for the REINFORCE update rule and Section 5 for baseline subtraction. This is the direct predecessor to actor-critic and PPO; understanding it makes the clipped surrogate objective in Schulman et al. 2017 concrete.
Formalizes the policy gradient theorem showing that the gradient of expected return can be expressed as an expectation over state-action pairs. Read to understand why on-policy sampling is sufficient for an unbiased gradient estimate and how the baseline reduces variance without introducing bias.
Schulman, J. et al. (2015). Trust Region Policy Optimization. ICML.
Introduces the trust-region constraint that bounds policy update size using KL divergence, providing a monotonic improvement guarantee. Read Section 3 for the surrogate objective and Theorem 1 for the lower bound; PPO simplifies this into a clipped ratio that achieves similar stability with far less implementation complexity.
Derives the generalized advantage estimator (GAE) as an exponentially weighted average of n-step returns, controlled by the lambda parameter. Read Section 3 for the bias-variance trade-off analysis; in practice lambda around 0.95 is the default in most PPO implementations and understanding why requires this paper.
Schulman, J. et al. (2017). Proximal Policy Optimization Algorithms. arXiv.
Introduces the clipped surrogate objective that prevents large policy updates without the second-order KL constraint of TRPO. Read Section 3 for the clipping mechanism and Section 5 for the implementation details including value-function loss coefficient and entropy bonus that appear in nearly every modern PPO codebase.
CleanRL documentation and source code.
Provides single-file, dependency-minimal RL implementations that make every algorithmic choice visible on one screen. Read the PPO and SAC files side by side with the corresponding papers; CleanRL is the fastest way to verify that you understand which implementation details matter versus which are optional.