Section 15.6: Reward shaping and its hazards

A Careful Control Loop
Big Picture

Reward shaping adds intermediate learning signals so a policy can discover useful behavior before sparse task success appears. Its hazard is that the agent may optimize the shaped signal instead of the intended embodied task.

Sparse rewards are hard for policy gradients because the rollout may contain thousands of actions before the first success signal. A robot might need to search, reach, align, grasp, lift, and place before seeing a final reward. Shaping adds smaller signals along the way, such as distance to a goal or uprightness.

The danger is objective mismatch. If a mobile robot receives reward for being close to a doorway, it may learn to hover at the doorway instead of passing through. If a manipulator receives reward for high gripper force, it may crush objects. The shaped reward must be treated as a training instrument, not as the final task definition.

Potential-based shaping is the safest standard form:

$$F(s,a,s')=\gamma\Phi(s')-\Phi(s).$$

Adding $F$ to the reward preserves the optimal policy under the usual discounted MDP assumptions because it changes trajectory returns by a telescoping potential term rather than changing which complete behavior is best.

The Reward Is A Proxy

The agent optimizes the scalar it receives, not the intent in the designer's head. Shaping is useful only when the proxy remains aligned with the task under the policy's future behavior.

Theory

A shaped reward should be decomposed in logs. Instead of storing only total reward, record task reward, progress shaping, energy penalties, contact penalties, safety penalties, and terminal success separately. This lets the team see whether PPO is improving the task or merely harvesting one component.

Embodied reward shaping also needs sensor humility. Distance-to-goal reward depends on pose estimation. Contact penalties depend on tactile or simulator contact models. Energy penalties depend on actuator measurements. If those signals are biased, the policy can learn the measurement artifact.

Mechanism

Shaping changes the advantage estimates that PPO sees. A dense shaping term can dominate sparse success, so the policy gradient follows the proxy unless component weights and evaluation metrics keep task success in charge.

Worked Example

Code Fragment 1 compares a potential-based shaping bonus with a naive distance bonus. The potential term rewards progress between states; the naive term can keep paying the agent for merely being near the goal.

# Compare potential-based shaping with a naive proximity bonus.
# Potential shaping pays for progress, while proximity can reward loitering.
gamma = 0.99
distances = [5.0, 3.0, 2.0, 2.0]

def potential(distance: float) -> float:
    return -distance

for before, after in zip(distances, distances[1:]):
    potential_bonus = gamma * potential(after) - potential(before)
    proximity_bonus = 1.0 / (after + 1.0)
    print(before, "to", after, "potential:", round(potential_bonus, 3), "proximity:", round(proximity_bonus, 3))
5.0 to 3.0 potential: 2.03 proximity: 0.25 3.0 to 2.0 potential: 1.02 proximity: 0.333 2.0 to 2.0 potential: 0.02 proximity: 0.333

The expected output makes the exploit visible in the last row. When the robot stops moving closer, the potential-based term collapses toward zero, but the proximity bonus keeps paying the same amount, which is exactly how a loitering behavior can become locally optimal.

Code Fragment 1: The potential_bonus becomes small when the agent stops making progress from distance 2.0 to 2.0. The proximity_bonus keeps paying at the same location, which can train a policy to loiter near the goal rather than finish the task.

The final row is the warning. A shaped reward that keeps paying without task progress creates a local strategy the optimizer can exploit. PPO will faithfully improve that proxy unless the evaluation artifact separates shaped reward from true success.

Library Shortcut

Gymnasium-style wrappers are a clean place to implement reward decomposition because they can return shaped reward while also adding component diagnostics to info. CleanRL and Stable-Baselines3 can then log those info fields during PPO training.

Practical Recipe

  1. Define true task success before adding shaped reward components.
  2. Prefer potential-based progress terms when a meaningful potential $\Phi(s)$ exists.
  3. Log every reward component separately and plot component totals beside task success.
  4. Run adversarial evaluations where the agent can exploit the shaping term without completing the task.
  5. Keep shaped reward out of final claims unless task success, safety, and robustness metrics improve in the same run.
Common Failure Mode

A reward component can become a hidden controller. If the energy penalty is too large, the robot may learn to do nothing. If the speed bonus is too large, it may learn unsafe impacts. If the distance bonus is too large, it may stop at the edge of success.

Practical Example

For a pick-and-place policy, shaped components might include reaching progress, grasp stability, lift height, placement distance, action smoothness, and collision penalties. The evaluation should still report binary task completion and object damage separately from the shaped training reward.

Memory Hook

A shaped reward is a note to the optimizer. Write it assuming the optimizer will read it literally and ignore every unstated intention.

Research Frontier

Reward design remains a live research problem in robot learning because task success, safety, human preference, and hardware wear are difficult to compress into one scalar. Active systems increasingly combine learned rewards, demonstrations, constraints, and post-training evaluation suites to reduce reward hacking.

Self Check

Can you separate the shaped training reward from the sparse success metric, and can you describe one episode where the shaped reward would be high but the task should fail?

Reward shaping is a curriculum encoded as numbers. It can make hard exploration possible, but it can also teach a policy to satisfy the curriculum without graduating to the task. The safest workflow is to design shaping terms as hypotheses, then try to break them with targeted evaluations.

Potential-based shaping is valuable because it gives a formal condition for policy preservation, but the condition rests on assumptions. The potential must be a function of the Markov state used by the MDP, the discount convention must match the task, and the final evaluation must still use the real task objective. In partially observed embodied systems, the measured potential may be a noisy proxy for the true state potential.

Reward Shaping Failure Modes
Shaping TermIntended HelpFailure To Test
Distance-to-goal bonusGuide exploration toward the target.Loitering near the goal without completing the task.
Velocity or speed bonusEncourage progress.Unsafe impacts, overshoot, or unstable gait.
Energy penaltyEncourage efficient motion.Inaction when task reward is delayed.
Contact penaltyPrevent collisions or damage.Avoiding necessary contact in manipulation.
Pose or style rewardMake motion look natural.Style imitation at the expense of robustness.

Code Fragment 2 sketches the logging pattern that keeps shaping auditable. The total reward is useful for PPO, but the component record is what lets the team discover reward hacking.

  1. Start with sparse task success and add one shaping term at a time.
  2. For each term, write the exploit you expect the policy might discover.
  3. Log shaped components, terminal success, safety flags, and videos in the same artifact.
  4. Evaluate on scenarios where the shaping proxy and the task goal disagree.
  5. Keep the smallest reward that trains reliably under the target perturbation panel.
# Record reward components separately from the scalar reward.
# Component logs reveal whether PPO is optimizing the task or exploiting a proxy.
from dataclasses import dataclass, asdict

@dataclass
class RewardComponents:
    task_success: float
    progress: float
    energy_penalty: float
    collision_penalty: float
    true_success: bool

    def as_row(self) -> dict[str, object]:
        return asdict(self)

components = RewardComponents(
    task_success=0.0,
    progress=0.8,
    energy_penalty=-0.1,
    collision_penalty=0.0,
    true_success=False,
)
total_reward = (
    components.task_success
    + components.progress
    + components.energy_penalty
    + components.collision_penalty
)
print(components.as_row())
print("total_reward:", total_reward)
{'task_success': 0.0, 'progress': 0.8, 'energy_penalty': -0.1, 'collision_penalty': 0.0, 'true_success': False} total_reward: 0.7000000000000001

The expected output shows why shaped reward alone is unsafe as a headline metric. The policy earns a positive total reward from progress despite true_success=False, so the correct interpretation is partial progress without task completion, not a solved episode.

Code Fragment 2: The RewardComponents record shows a high shaped total_reward even though true_success is false. This is the exact pattern to flag when a policy learns progress-shaped behavior without completing the embodied task.

When a shaped-reward policy fails, do not only lower the learning rate or change PPO parameters. First inspect which reward component dominated the advantages before failure. Then run a disagreement test: construct episodes where high shaped reward is possible without true success, and verify that the policy does not prefer that shortcut.

Evaluation Recipe

For reward-shaping claims, co-compute shaped return, sparse task success, safety violations, component totals, exploit-test outcomes, videos, and failure labels in one run on one configuration. A shaped-return improvement is only a paper-worthy result when true success and safety improve in that same artifact.

Key Takeaway

Reward shaping can make PPO learn faster, but it also expands the space of shortcuts. The final judge is task success under perturbations, not the shaped reward curve alone.

Exercise 15.6.1

Design a shaped reward for a pick-and-place task. List each component, the exploit it might create, the diagnostic that would catch the exploit, and the sparse success metric that remains the final evaluation target.

What's Next?

This section closed the chapter by showing that PPO optimizes the reward it receives, not the task intention. Return to the Chapter 15 overview to connect stochastic policies, policy gradients, GAE, PPO clipping, implementation details, and reward design into one training workflow.

References & Further Reading
Foundational Papers, Tools, and Practice References

Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning.

The original REINFORCE paper deriving the likelihood-ratio policy gradient. Read Section 2 for the REINFORCE update rule and Section 5 for baseline subtraction. This is the direct predecessor to actor-critic and PPO; understanding it makes the clipped surrogate objective in Schulman et al. 2017 concrete.

Paper

Sutton, R. S. et al. (1999). Policy Gradient Methods for Reinforcement Learning with Function Approximation. NeurIPS.

Formalizes the policy gradient theorem showing that the gradient of expected return can be expressed as an expectation over state-action pairs. Read to understand why on-policy sampling is sufficient for an unbiased gradient estimate and how the baseline reduces variance without introducing bias.

Paper

Schulman, J. et al. (2015). Trust Region Policy Optimization. ICML.

Introduces the trust-region constraint that bounds policy update size using KL divergence, providing a monotonic improvement guarantee. Read Section 3 for the surrogate objective and Theorem 1 for the lower bound; PPO simplifies this into a clipped ratio that achieves similar stability with far less implementation complexity.

Paper

Schulman, J. et al. (2016). High-Dimensional Continuous Control Using Generalized Advantage Estimation. ICLR.

Derives the generalized advantage estimator (GAE) as an exponentially weighted average of n-step returns, controlled by the lambda parameter. Read Section 3 for the bias-variance trade-off analysis; in practice lambda around 0.95 is the default in most PPO implementations and understanding why requires this paper.

Paper

Schulman, J. et al. (2017). Proximal Policy Optimization Algorithms. arXiv.

Introduces the clipped surrogate objective that prevents large policy updates without the second-order KL constraint of TRPO. Read Section 3 for the clipping mechanism and Section 5 for the implementation details including value-function loss coefficient and entropy bonus that appear in nearly every modern PPO codebase.

Paper

CleanRL documentation and source code.

Provides single-file, dependency-minimal RL implementations that make every algorithmic choice visible on one screen. Read the PPO and SAC files side by side with the corresponding papers; CleanRL is the fastest way to verify that you understand which implementation details matter versus which are optional.

Tool