Section 18.6: Safety-aware and constrained rewards

A Careful Control Loop
Big Picture

Safety-aware reward design separates what the robot wants from what it must not violate. A single scalar reward can hide risk by letting high task return compensate for collisions, near misses, force spikes, or human interventions. Constrained reinforcement learning keeps safety costs visible as their own contract.

For Safety-aware and constrained rewards, reward design must expose objective term, safety interaction, exploration effect, and deployment risk instead of hiding them inside one scalar return.

This section develops the contract for constrained Markov decision processes in embodied agents. The policy still seeks task return, but it must satisfy one or more cost budgets, such as collision count, force threshold violations, off-road time, unsafe proximity, or intervention rate.

The key question is practical: which requirements are negotiable performance objectives, and which are constraints that should not be traded away for more reward?

Costs Are Not Negative Rewards

A safety cost is a separate measurement with its own budget. If it is folded into reward too early, a policy can buy unsafe behavior with enough task success.

Theory

A constrained objective is usually written as

$$\max_{\pi} \; J_R(\pi) = \mathbb{E}_{\pi}\left[\sum_t \gamma^t r_t\right] \quad \text{subject to} \quad J_C(\pi)=\mathbb{E}_{\pi}\left[\sum_t \gamma^t c_t\right] \le d.$$

Here $r_t$ is task reward, $c_t$ is safety cost, and $d$ is the allowed cost budget. The important modeling move is that the constraint remains visible after training. A policy with high reward and cost above budget is not a successful safe policy; it is infeasible.

Mechanism

Many algorithms optimize a Lagrangian such as $J_R(\pi)-\lambda(J_C(\pi)-d)$, adapting $\lambda$ when cost exceeds the budget. The Lagrange multiplier is a training mechanism, not a reason to stop reporting the raw cost.

Worked Example

Suppose a mobile robot can choose between a fast route through a crowded aisle and a slower route around it. The fast route has higher task reward but more unsafe-proximity cost. Code Fragment 1 checks feasibility before choosing the winner.

# Compare task return and safety cost under a fixed budget.
# A high-reward policy is rejected when its cost is infeasible.
policies = [
    {"name": "fast_aisle", "return": 92, "cost": 7.5},
    {"name": "wide_detour", "return": 78, "cost": 1.2},
]
cost_budget = 2.0

for policy in policies:
    feasible = policy["cost"] <= cost_budget
    print(policy["name"], "return=", policy["return"], "cost=", policy["cost"], "feasible=", feasible)
fast_aisle return= 92 cost= 7.5 feasible= False wide_detour return= 78 cost= 1.2 feasible= True
Code Fragment 1: The fast_aisle policy has higher return but violates the cost_budget. The feasibility check makes the safety constraint visible instead of letting task reward compensate for unsafe proximity.

Expected output: a policy should be reported with both return and feasibility. Ranking by reward alone would choose the wrong policy for a constrained deployment.

Library Shortcut

Safety Gymnasium exposes tasks with separate reward and cost channels, which is the right interface for constrained experiments. Use it or a similar wrapper when a project needs budgeted safety metrics rather than reward penalties hidden inside one scalar.

Practical Recipe

  1. Write task reward and safety cost as separate functions.
  2. Set a cost budget before policy selection.
  3. Log reward return, cost return, budget violation rate, and intervention rate.
  4. Reject infeasible policies before comparing reward among feasible policies.
  5. Stress test constraints under sensor noise, actuation delay, and domain shift.
Common Failure Mode

A penalty coefficient is not a safety requirement. If a collision penalty is too small, the policy collides. If it is too large, the policy may freeze. A constraint budget makes the requirement auditable and separates feasibility from reward tuning.

Practical Example

For a delivery robot, reward can measure progress to the destination while cost measures entering restricted zones, near-human proximity, and hard braking events. The deployment decision should first filter policies by the cost budgets, then compare delivery time among the feasible policies.

Memory Hook

Reward says, "get there." Constraint cost says, "and do not knock over the furniture on the way."

Research Frontier

Safety-aware RL is moving toward richer cost models, shielded policies, runtime monitors, verification-inspired constraints, and benchmarks that separate reward from safety costs. The open challenge is transfer: a constraint that is measurable in simulation may need different sensors, margins, and enforcement on hardware.

Self Check

Can you state the reward, the cost, the budget, and the rejection rule for an infeasible policy? If not, the safety requirement is still only a preference.

The hardest part is not writing a cost function. It is choosing a budget that corresponds to a real deployment requirement. A warehouse robot may allow zero emergency stops during evaluation, a bounded number of low-speed proximity warnings, and a maximum force threshold during contact-rich manipulation. Each budget needs a sensor source and a trace field.

The graduate-level habit is to distinguish penalties, constraints, and shields. Penalties shape optimization. Constraints define feasibility. Shields or runtime monitors block actions online. A safety-aware system may use all three, but the evaluation must reveal which layer prevented unsafe behavior.

Practical Tool Choices For This Section
Tool or LibraryRole in the TopicBuilder Advice
Safety GymnasiumReward and cost channelsUse it when the experiment needs explicit cost budgets and hazard metrics.
Gymnasium wrappersCustom constraintsAdd cost fields for collisions, unsafe proximity, action saturation, and intervention.
MuJoCoPhysical cost signalsCompute contact, force, velocity, and pose-limit violations from simulator state.
ROS 2Runtime safety logsRecord emergency stops, monitor interventions, and controller limit events on hardware.
CleanRLInspectable constrained loopUse a short implementation to verify exactly where costs and multipliers enter training.

A robust implementation treats the safety budget as part of the task contract. The policy artifact should be impossible to read without seeing its cost return and violation rate.

  1. Define every safety cost with units and measurement source.
  2. Choose budgets before comparing candidate policies.
  3. Train with a constrained method or a clear penalty baseline.
  4. Save cost traces, violation events, monitor actions, and final task state.
  5. Compare rewards only among policies that satisfy the cost budget.

Code Fragment 2 creates a compact constrained-RL audit record.

# Build one constrained-RL audit record for deployment review.
# Reward, cost, budget, and rejection rule remain separate fields.
from dataclasses import dataclass, asdict

@dataclass
class ConstraintAudit:
    section: str
    reward_metric: str
    cost_metric: str
    budget: float
    rejection_rule: str

    def as_row(self) -> dict[str, object]:
        return asdict(self)

record = ConstraintAudit(
    section="18.6",
    reward_metric="delivery success within time limit",
    cost_metric="discounted unsafe-proximity events per episode",
    budget=2.0,
    rejection_rule="reject any policy with mean cost above budget on the shared seed panel",
)
print(record.as_row())
{'section': '18.6', 'reward_metric': 'delivery success within time limit', 'cost_metric': 'discounted unsafe-proximity events per episode', 'budget': 2.0, 'rejection_rule': 'reject any policy with mean cost above budget on the shared seed panel'}
Code Fragment 2: The ConstraintAudit record stores reward, cost, budget, and rejection rule as separate fields. That separation prevents a high task score from obscuring an infeasible safety result.

When a constrained policy fails, decide whether the issue is cost observability, budget choice, optimization instability, monitor mismatch, or deployment transfer. Then rerun the same policy with cost traces and monitor events side by side.

Evaluation Recipe

For constrained rewards, compare task return, cost return, violation rate, intervention rate, and feasibility only when they are co-computed in one pass on one configuration. Report infeasible policies separately from feasible ones, even when their rewards are higher.

Key Takeaway

Safety constraints are meaningful only when their costs, budgets, and rejection rules remain visible in the final evaluation artifact.

Exercise 18.6.1

For a delivery, drone, or arm-control task, write one reward metric, two safety costs, a budget for each cost, and a rejection rule for policies that exceed either budget.

What's Next?

This section closed the reward-design chapter by separating objectives from constraints. Next, Chapter 19 studies exploration, where the same safety costs matter before the agent has learned what actions are useful.

References & Further Reading
Foundational Papers, Tools, and Practice References

Ng, A. Y., Harada, D., and Russell, S. (1999). Policy invariance under reward transformations. ICML.

Potential-based shaping is a useful contrast to constraints. Shaping changes training feedback, while constrained RL keeps safety cost and feasibility as separate quantities.

Paper

Andrychowicz, M. et al. (2017). Hindsight Experience Replay. NeurIPS.

HER shows how replay can become more sample-efficient, but constrained tasks still need cost budgets. Relabeled success should never hide unsafe exploration in the original task.

Paper

Amodei, D. et al. (2016). Concrete Problems in AI Safety. arXiv.

This paper motivates treating side effects and unsafe exploration as first-class design problems. It supports the section's separation of reward from safety cost.

Paper

Christiano, P. F. et al. (2017). Deep reinforcement learning from human preferences. NeurIPS.

Preference rewards can express soft human judgments, but hard safety budgets may still be needed. This reference helps distinguish learned preference from enforceable constraint.

Paper

Ray, A., Achiam, J., and Amodei, D. (2019). Benchmarking Safe Exploration in Deep Reinforcement Learning. OpenAI.

Safety Gym is the main benchmark reference for reward-plus-cost evaluation. It is directly aligned with the constrained objective and feasibility reporting in this section.

Paper

Farama Foundation Safety Gymnasium documentation.

Safety Gymnasium is the maintained tool reference for experiments with explicit cost channels. It supports the section's recommendation to compare reward only among feasible policies.

Tool