A Careful Control Loop
Sparse rewards tell the truth late; dense rewards speak early but may lie. A sparse success bonus preserves the task definition, yet it can leave a robot with almost no learning signal. A dense shaping term can guide exploration, yet an arbitrary dense term can teach the agent to optimize the hint instead of the goal.
For Sparse vs. dense; shaping done right, reward design must expose objective term, safety interaction, exploration effect, and deployment risk instead of hiding them inside one scalar return.
This section develops the contract for shaping rewards without rewriting the task. Sparse rewards give credit only at success or failure. Dense rewards add intermediate feedback, such as distance-to-goal progress, uprightness, clearance, or energy use.
The key question is practical: does the extra feedback make learning easier while preserving the policy ranking implied by the real success condition?
Dense shaping should behave like a teacher pointing toward the solution, not like a new exam. If the shaped reward makes a different final behavior optimal, it is no longer a hint; it is a new task.
Theory
Potential-based shaping is the standard way to add dense guidance while preserving optimal policies under the usual discounted Markov decision process assumptions. Choose a scalar potential $\Phi(s)$ that measures progress in a state. Then add the shaping term
$$F(s,a,s') = \gamma \Phi(s') - \Phi(s).$$
The shaped reward is $r'(s,a,s') = r(s,a,s') + F(s,a,s')$. The intuition is telescoping: along a trajectory, the added terms mostly cancel, leaving a boundary term tied to the start and finish rather than a new preference for a particular path. This is why a distance-like potential can speed learning without paying the agent forever for pacing near the goal.
Policy Invariance Under Reward Transformations (Ng, Harada, and Russell, ICML 1999) — potential-based shaping $F(s,s') = \gamma\Phi(s') - \Phi(s)$ is the only reward modification that provably preserves the optimal policy set. For embodied agents it is the safe way to add dense guidance: a progress potential speeds learning without quietly redefining the task the robot is being trained to solve.
Curiosity-driven Exploration by Self-Supervised Prediction (Pathak et al., ICML 2017) — an intrinsic reward equal to forward-model prediction error drives exploration even when the extrinsic reward is absent. For embodied agents in sparse-reward tasks, this lets a policy keep seeking novel states instead of stalling when shaping alone provides no gradient.
The potential $\Phi$ is not the reward. It is a progress gauge used to create a local training signal. The final evaluation should still report the unshaped task reward and embodied metrics such as success rate, collisions, time, energy, and interventions.
Worked Example
Suppose a gripper starts three grid cells from a target. The sparse reward is zero until the final success, but the potential $\Phi(s)=-\text{distance}(s,\text{goal})$ produces a small positive shaping reward whenever distance shrinks. Code Fragment 1 shows the actual numbers.
# Compute potential-based shaping for a short reaching path.
# Progress creates dense feedback while final success remains separate.
gamma = 0.9
distances = [3, 2, 1, 0]
sparse_rewards = [0, 0, 1]
for step, reward in enumerate(sparse_rewards):
phi_now = -distances[step]
phi_next = -distances[step + 1]
shaping = gamma * phi_next - phi_now
shaped_reward = reward + shaping
print(step, "sparse=", reward, "shaping=", round(shaping, 2), "shaped=", round(shaped_reward, 2))
distances sequence turns an otherwise sparse success signal into shaped rewards at each step. The gamma * phi_next - phi_now term rewards progress without replacing the final task reward stored in sparse_rewards.Expected output: intermediate shaped rewards appear before success, but the unshaped success reward remains visible. That separation is what lets the evaluation report learning speed and task success without mixing them.
In production, implement shaping as a Gymnasium wrapper around the environment rather than burying it inside the policy code. The wrapper can log raw reward, shaping term, shaped reward, potential value, and termination cause from the same step call.
Practical Recipe
- Start with the sparse success condition that matches the task.
- Identify the learning bottleneck: exploration, credit assignment, or delayed feedback.
- If adding a dense term, write whether it is potential-based or a deliberate task change.
- Log raw reward, shaping term, and shaped reward separately.
- Report final success using the unshaped metric, then report shaping as a training aid.
A dense distance reward can teach a manipulator to hover near the object because hovering keeps earning progress-like feedback while grasping risks failure. The fix is not to avoid dense rewards entirely. The fix is to make shaping policy-invariant when possible and to audit the final behavior with the sparse success metric.
A legged robot may receive sparse reward for crossing a finish line and potential-based shaping for reducing distance to the line. The evaluation should still include falls, torque, foot slip, and timeout rate, because a shaped distance signal alone cannot say whether the gait is deployable.
Dense reward is the coach shouting from the sideline. Sparse reward is the scoreboard. Do not let the coach secretly move the goalposts.
Current robot learning systems often combine sparse task success with learned shaping terms, curriculum signals, demonstrations, and constraint costs. The open research problem is deciding which auxiliary signals accelerate learning without creating brittle policies that fail when the hint distribution changes.
For a shaped reward you design, can you say whether the shaping is potential-based, what potential it uses, and which unshaped metric will be reported at the end?
The shaped reward should be treated as a training interface, not as the headline metric. During optimization, the learner sees $r'$. During evaluation, the report should expose raw task reward, shaping contribution, safety cost, and embodied success. This prevents a shaped run from looking better simply because it received more arithmetic along the way.
The graduate-level habit is to state the assumption behind the invariant. Potential-based shaping preserves optimal policies for the same discounted Markov decision process when the shaping term has the form $\gamma\Phi(s')-\Phi(s)$. If the potential depends on hidden evaluator state, future information, changing curricula, or nonstationary human hints, the invariance argument no longer applies cleanly.
| Tool or Library | Role in the Topic | Builder Advice |
|---|---|---|
| Gymnasium wrappers | Reward decomposition | Wrap step so raw reward, shaping, and shaped reward are logged together. |
| Safety Gymnasium | Sparse success plus cost | Use separate reward and cost channels when the shaping signal must not hide constraint violations. |
| MuJoCo | Potential features | Compute potentials from object pose, center of mass, velocity, and contact state with clear units. |
| Stable-Baselines3 | Training loop | Train on shaped rewards, then evaluate callbacks on raw success and safety metrics. |
| CleanRL | Inspectable implementation | Use a short single-file run when you need to verify exactly where shaping enters the return. |
A robust implementation keeps shaping mechanically separate from task success. The environment can compute both, the policy can train on the shaped reward, and the evaluator can report success without the shaping bonus.
- Define the sparse success event and termination condition first.
- Choose a potential with physical units, such as negative distance or negative height error.
- Compute the shaping term from consecutive states, not from the action label alone.
- Log a per-step reward decomposition table.
- Compare shaped and unshaped runs on one seed panel and one success definition.
Code Fragment 2 records the fields that should appear beside any shaped-reward experiment.
# Build one shaping audit record for a robot reaching task.
# The invariant field states why the dense term should not change the goal.
from dataclasses import dataclass, asdict
@dataclass
class ShapingAudit:
section: str
sparse_success: str
potential: str
shaping_formula: str
report_metrics: list[str]
def as_row(self) -> dict[str, object]:
return asdict(self)
record = ShapingAudit(
section="18.2",
sparse_success="object within 2 cm of target for 10 consecutive steps",
potential="negative gripper-to-target distance in meters",
shaping_formula="gamma * Phi(next_state) - Phi(current_state)",
report_metrics=["raw_success_rate", "mean_shaping_return", "collision_cost"],
)
print(record.as_row())
ShapingAudit record keeps the sparse success event, potential, formula, and reported metrics in one place. The report_metrics list makes clear that collision cost and raw success must remain visible after shaping is added.When shaping fails, inspect whether the policy optimized a dense term that was easier than success. Then compare the best shaped rollouts against unshaped success, safety cost, and final state. A useful failure label says whether the problem was a bad potential, a nonstationary hint, a missing cost, or a metric report that hid the raw reward.
For shaped rewards, compare raw success, shaped return, safety cost, and time-to-success only when they are co-computed in one pass on one configuration: same environment panel, same policy checkpoint, same seed set, same perturbation suite, and the same success definition. Report shaping as a training aid, not as a replacement for the task metric.
Dense shaping is useful when it improves learning while raw task success and embodied safety metrics remain the final judge.
For a reaching, navigation, or locomotion task, write a sparse success reward and one potential-based shaping term. Then name one dense reward you would reject because it changes the task rather than guiding it.
What's Next?
This section showed how to add dense guidance without losing the sparse task definition. Next, Section 18.3 uses goal conditioning and hindsight replay to make failed trials useful without pretending they reached the original goal.
Ng, A. Y., Harada, D., and Russell, S. (1999). Policy invariance under reward transformations. ICML.
This is the anchor reference for potential-based shaping. It explains why the form $\gamma\Phi(s')-\Phi(s)$ can add dense feedback while preserving optimal policies under the stated discounted-MDP assumptions.
Andrychowicz, M. et al. (2017). Hindsight Experience Replay. NeurIPS.
HER is not the same as shaping, but it addresses the same sparse-feedback pain point. Reading it alongside potential-based shaping helps distinguish relabeling experience from changing per-step reward.
Amodei, D. et al. (2016). Concrete Problems in AI Safety. arXiv.
This paper explains why dense terms need safety review. A shaping term that improves learning can still create side effects if it is not tied to a policy-invariant potential or an explicit task change.
Christiano, P. F. et al. (2017). Deep reinforcement learning from human preferences. NeurIPS.
Preference models can provide dense-looking feedback when hand-written shaping is hard. The section's caution still applies: learned dense signals must be evaluated against raw task success and safety metrics.
Safety Gym is useful when shaped progress must be checked against constraint costs. It keeps the reader from treating a smoother reward curve as proof of safer behavior.
Farama Foundation Safety Gymnasium documentation.
Safety Gymnasium supports experiments where sparse success, dense shaping, and safety costs can be logged from the same rollout. That is the artifact this section asks readers to preserve.