A Careful Control Loop
Rewards are dangerous because they are proxies. A robot does not optimize the human intent directly. It optimizes the number the designer exposed, and in an embodied world that number is entangled with sensors, contacts, timeouts, resets, and hidden safety costs.
For Why rewards are dangerous, reward design must expose objective term, safety interaction, exploration effect, and deployment risk instead of hiding them inside one scalar return.
This section develops the technical contract for reward safety. The object of study is the gap between the specified reward $r_{\text{spec}}(s,a,s')$ and the task utility the builder actually cares about, such as completed grasp, no collision, no human intervention, low wear, and recovery after a slip.
The key question is practical: if a policy earns a high return, what independent evidence shows it solved the task rather than exploiting the reward channel?
A reward is a training interface, not a moral contract or a complete task description. Treat every reward term as a claim that must be checked against an embodied trace: video, contact log, constraint log, reset reason, intervention count, and final state.
Theory
In a Markov decision process the learner maximizes expected discounted return, $\mathbb{E}[\sum_t \gamma^t r_{\text{spec}}(s_t,a_t,s_{t+1})]$. The danger is that the true task utility $U$ is usually wider than the reward. For a mobile manipulator, $U$ may include task completion, clearance from people, gentle contact, battery use, time, and whether the final object pose is usable by the next process.
This mismatch creates two failure families. Omission failures occur when the reward forgets a real requirement, such as penalizing collisions. Channel failures occur when the agent changes the measurement process itself, such as hiding an object from a camera, triggering a reset, or holding a sensor in a state that produces credit without progress.
Reward design is a measurement problem inside a feedback loop. The agent sees which actions increase the number, so any unmeasured cost becomes a free variable and any fragile measurement becomes a lever.
Worked Example
Consider a tabletop reaching task. The specified reward pays for getting the gripper close to the target, but the true task also cares about whether the object remains on the table and whether a safety monitor had to stop the arm. Code Fragment 1 below shows how a high proxy return can hide a weak embodied score.
# Compare proxy reward with embodied evidence fields.
# A high reward is suspicious when costs and interventions rise.
rollouts = [
{"policy": "shortcut", "proxy_return": 9.8, "task_success": 0.7, "safety_cost": 0.6, "interventions": 1},
{"policy": "careful", "proxy_return": 7.2, "task_success": 0.9, "safety_cost": 0.0, "interventions": 0},
]
for run in rollouts:
embodied_score = run["task_success"] - 0.5 * run["safety_cost"] - 0.2 * run["interventions"]
print(run["policy"], "proxy=", run["proxy_return"], "embodied=", round(embodied_score, 2))
shortcut policy wins on proxy_return while losing once safety_cost and interventions are counted. The example makes the audit rule concrete: reward curves and embodied metrics must be computed from the same rollout before they are compared.Expected output: the printed trace should make the discrepancy visible. If the evaluation reports only proxy return, the unsafe shortcut would appear to be the best policy.
Use Gymnasium or Safety Gymnasium to standardize the environment API, but do not outsource the reward audit. The maintained library handles reset, stepping, wrappers, logging hooks, and reproducible seeding; the builder still owns the success metric, cost metric, intervention log, and failure labels.
Practical Recipe
- Write the human intent in ordinary language before writing the reward equation.
- Split the reward into named terms: progress, success, time, energy, contact, reset, and constraint cost.
- For each term, ask what action could raise it while making the real task worse.
- Log at least one independent embodied metric that is not part of the reward.
- Review the best, median, and worst rollouts by trace, not only by return.
A reward that pays for distance reduction can teach a robot to shove, trap, or hover near the object instead of completing the intended manipulation. The failure is not that reinforcement learning is malicious. The failure is that the reward made the wrong behavior measurable and cheap.
A warehouse robot trained to minimize travel time should also log near misses, emergency stops, blocked aisles, and human interventions. If the travel-time reward improves while near misses rise, the policy is optimizing the proxy against the deployment objective.
The simulator may applaud every rollout, but the hardware still asks for the receipt: contacts, resets, interventions, and one failure case that explains what happened.
Reward misspecification remains an active safety problem because larger policies can find more subtle shortcuts. Current work combines constraint learning, preference data, adversarial evaluation, and richer embodied metrics so the reward channel is tested against the behavior people actually wanted.
For any reward term you write, can you name one behavior that would increase the term while making the real task worse? If not, the reward has not been stress tested.
The reward audit becomes useful when it separates three quantities. The specified reward is the number used for learning. The task metric is the number reported to decide whether the system is useful. The safety cost is the number that records damage, risk, constraint violation, or human intervention. A policy can improve one while degrading the others, so the evaluation artifact must carry all three.
The graduate-level habit is to write the reward as a falsifiable hypothesis. A term such as $+1$ for reaching the target says, "This event is a reliable proxy for useful task completion." The audit then tries to falsify that hypothesis with perturbations, held-out layouts, sensor glitches, delayed actuation, and videos of the highest-return rollouts.
| Tool or Library | Role in the Topic | Builder Advice |
|---|---|---|
| Gymnasium | Reward wrapper tests | Use wrappers to log reward terms, termination causes, and independent metrics from the same rollout. |
| Safety Gymnasium | Safety cost tracking | Use cost channels when the reward must be evaluated against hazards, not only task success. |
| ROS 2 | Hardware evidence | Log controller status, emergency stops, and sensor faults alongside reward traces. |
| MuJoCo | Contact-heavy audits | Inspect contacts, object poses, and actuator limits when a reward can be gamed through physics. |
| LeRobot | Dataset review | Compare reward labels to demonstrations and replay videos before training a reward-driven policy. |
A robust implementation starts with an inspectable reward card. The card names each term, its unit, the sensor or simulator field that produces it, the behavior it is meant to encourage, and the exploit it might invite. The library version should write this card into the run artifact so reward curves never travel without their measurement assumptions.
- List every reward term with its source field and unit.
- Add an independent task metric and an independent safety cost before training.
- Run a deterministic smoke test where the intended behavior earns the highest score.
- Run a shortcut probe where a known bad behavior tries to exploit the reward.
- Save reward terms, embodied metrics, seeds, traces, and failure labels in one artifact.
Code Fragment 2 turns that recipe into a small reward-audit record that can travel with a training run.
# Build one reward audit record for a reaching task.
# The card records both the proxy and the missing embodied checks.
from dataclasses import dataclass, asdict
@dataclass
class RewardAudit:
section: str
specified_reward: str
intended_utility: str
exploit_probe: str
independent_metrics: list[str]
def as_row(self) -> dict[str, object]:
return asdict(self)
record = RewardAudit(
section="18.1",
specified_reward="+distance_progress + success_bonus - time_penalty",
intended_utility="object placed, no unsafe contact, no human intervention",
exploit_probe="hover near target without stable grasp",
independent_metrics=["task_success", "safety_cost", "intervention_rate"],
)
print(record.as_row())
RewardAudit record stores the reward equation, the intended utility, a concrete exploit probe, and the independent metrics that will catch the shortcut. Keeping these fields together prevents a later table from reporting return without the evidence needed to interpret it.When a reward-driven policy fails, first decide whether the failure came from omission, channel exploitation, distribution shift, or evaluation leakage. Then rerun one controlled perturbation that isolates the suspected cause. A useful postmortem states which reward term invited the behavior and which independent metric exposed it.
For reward safety, compare return, task success, safety cost, and intervention rate only when they are co-computed in one pass on one configuration: same environment panel, same policy checkpoint, same seed set, same perturbation suite, and the same success definition. Save the result as one artifact with traces, videos or state logs, and failure labels so every number in a later table is backed by the same run.
A reward is safe enough to train against only after it has survived an exploit probe and an independent embodied-metric check.
Choose a robot task and write three fields: the specified reward, the intended utility, and one shortcut behavior that could raise the reward while harming the utility. Add two independent metrics that would expose the shortcut.
What's Next?
This section turned reward danger into a testable audit: separate proxy return from intended utility, run an exploit probe, and keep embodied metrics in the same artifact. Next, Section 18.2 shows how to add dense guidance without changing which policy is optimal.
Ng, A. Y., Harada, D., and Russell, S. (1999). Policy invariance under reward transformations. ICML.
This paper is useful here because it separates safe reward transformations from arbitrary proxy changes. It gives a precise example of when changing rewards preserves the intended policy, which sharpens the warning that most reward edits do not come with that guarantee.
Andrychowicz, M. et al. (2017). Hindsight Experience Replay. NeurIPS.
HER is a reminder that training signals can be relabeled without changing the original evaluation goal. That distinction is central to reward safety: a useful learning trick should not silently redefine success.
Amodei, D. et al. (2016). Concrete Problems in AI Safety. arXiv.
This is the most direct safety framing for the section. Its categories of reward hacking, negative side effects, and safe exploration explain why a high scalar return is not enough evidence for embodied deployment.
Christiano, P. F. et al. (2017). Deep reinforcement learning from human preferences. NeurIPS.
Preference learning appears later in the chapter as one response to brittle hand-written rewards. The paper also shows why learned rewards still require audits, because the learned model becomes the new proxy.
Safety Gym made the reward-versus-cost split concrete for safe exploration. It is relevant here because it operationalizes the idea that task reward and safety evidence should be logged separately.
Farama Foundation Safety Gymnasium documentation.
Safety Gymnasium provides maintained environments where reward and cost channels can be audited together. Use it to test whether a reward improvement survives independent safety-cost measurement.