Section 2.4: Rewards, goals, costs, constraints

"The robot maximized the reward exactly as written. That was the first problem."

A Reward Designer With New Gray Hair
Technical illustration for Section 2.4: Rewards, goals, costs, constraints.
Figure 2.4A: Reward, goal, cost, and constraint signals for a navigation task: the agent receives a sparse +1 on reaching the goal, a per-step cost for energy use, and a hard constraint that keeps it away from humans.
Big Picture

Rewards, goals, costs, constraints turn desired behavior into measurable signals. Reward is a scalar training or evaluation signal, a goal is a desired condition, a cost measures undesirable behavior, and a constraint marks behavior that should not be traded away.

Concept map for Section 2.4 A local diagram showing how reward encourages progress while constraints define unacceptable paths. Evidence what the agent receives Decision what the system changes Consequence what the next step inherits Closed-loop feedback makes the next input depend on the last action.
Figure 2.4. Reward functions, task specifications, and constraints is easiest to reason about as a closed-loop evidence, decision, consequence pattern: reward encourages progress while constraints define unacceptable paths.

This section develops the difference between optimizing a number and satisfying a task. Embodied systems operate around people, hardware, and physical limits. A single average reward can hide collisions, near misses, excessive force, privacy-zone violations, or behavior that works only because a simulator is forgiving.

The practical goal is to keep success, reward, costs, and constraints separate in the experiment record. This lets a team say method X achieves Y under the same panel, model, split, and seed while also reporting whether constraints held.

Do Not Hide Safety In A Scalar

Safety constraints should remain visible as constraints. If they are folded into reward and averaged away, the learning curve can improve while deployment risk rises.

Theory

In reinforcement learning notation, the reward $r_t$ is often a scalar emitted after a transition. In robotics, the actual design space is broader. A goal might be "the cup is upright on the tray." A cost might be time, energy, jerk, or distance to humans. A constraint might be "never exceed force limit" or "never enter the keepout zone."

The important distinction is tradeability. Rewards and costs can be balanced when a tradeoff is acceptable. Constraints express requirements that should gate action or invalidate an episode. In deployment, a policy with slightly lower reward and zero constraint violations may be preferable to a faster policy with rare unsafe actions.

Mechanism

The mechanism is metric factorization. Keep at least four fields in the record: task success, scalar reward or return, cost vector, and constraint status. Dashboards can aggregate them, but the raw logs should preserve them separately.

Worked Example

Code Fragment 2.4.1 scores two episodes. Both can complete the task, but only one satisfies the safety constraint.

# Section 2.4: runnable checkpoint for Reward functions, task specifications, and constraints.
# Keep the output small so the evidence record can be inspected directly.
def score_episode(success, seconds, collisions, entered_keepout):
    reward = 10.0 * float(success) - 0.05 * seconds - 2.0 * collisions
    costs = {"time_s": seconds, "collisions": collisions}
    constraints_ok = collisions == 0 and not entered_keepout
    return {"success": success, "reward": reward, "costs": costs, "constraints_ok": constraints_ok}

safe = score_episode(success=True, seconds=42, collisions=0, entered_keepout=False)
fast_unsafe = score_episode(success=True, seconds=20, collisions=1, entered_keepout=False)
print(safe)
print(fast_unsafe)
Code Fragment 2.4.1 separates success, reward, cost fields, and constraint status for two completed episodes.

Expected output: two completed episodes with different safety status. The useful comparison is not only reward; it is reward plus the cost fields and constraint flag.

Library Shortcut

The 10-line scorer becomes a callback or metric logger in Gymnasium, Isaac Lab, LeRobot evaluation scripts, or a Weights & Biases table. The tool handles batching, charts, and comparisons. The designer must still decide which events are rewards, which are costs, and which are non-negotiable constraints.

Practical Recipe

  1. Write the goal in task language before writing a scalar reward.
  2. Separate success, reward, cost vector, and constraint status in logs.
  3. Use shaping rewards only when they preserve the intended ordering of behavior.
  4. Add counterexample episodes that target reward hacking.
  5. Report success with constraint violations, not success alone.
Failure Mode

Average reward can improve while rare unsafe events increase. This is especially dangerous when collisions, force spikes, keepout-zone entries, or operator interventions are small terms inside a single scalar.

Practical Example

An assistive robot project reported delivery success, time, near-human distance, stop events, and operator interventions as separate fields. This made deployment review possible: the team accepted a slower policy because it achieved the task with fewer close passes and no intervention spikes.

Memorable Shortcut

Reward is a suggestion written in math. Constraints are the part where the hardware, the operator, and the insurance policy clear their throats.

Research Frontier

Safe reinforcement learning, preference learning, control barrier functions, and runtime assurance all address failures of simple reward design. Independent closed-loop evaluation remains essential because learned reward models can inherit the blind spots of their data.

Mini Lab

Extend Code Fragment 2.4.1 with an energy cost and a force-limit constraint. Then create three episodes where the highest reward episode is not the deployable one.

Self Check

Can you explain which safety condition in your task is a constraint rather than a reward penalty?

Reward design is safest when it is factorized before it is optimized. A goal describes the intended world condition. A reward provides learning or ranking pressure. A cost measures tradeoffs such as time, energy, jerk, distance to people, or interventions. A constraint names a boundary that should remain visible even when the reward improves.

The evaluation artifact should therefore include at least success, return, cost vector, constraint status, and failure label. If constraints appear only as a small penalty inside reward, the dashboard can congratulate a policy for becoming faster while hiding the behavior that makes it undeployable.

Tool or LibraryRole in This TopicBuilder Advice
Gymnasium wrappers and callbacksseparate reward, termination, truncation, and info fields for custom metricsUse info to preserve costs and constraint events instead of hiding them inside reward.
Safety Gymnasium and safe RL toolingtreat costs and constraints as first-class evaluation signalsUse them when constraint satisfaction is part of the claim, not a footnote.
Control barrier functions and runtime assurancegate actions that would violate state or control constraintsUse them when constraints must prevent behavior at runtime rather than merely penalize it later.

Build a reward audit that can catch reward hacking. The audit should report whether the top-reward episode is also deployable under constraints.

  1. Write the task goal in ordinary language.
  2. Define reward, each cost field, and each hard constraint separately.
  3. Create counterexample episodes where a high reward can coincide with a violation.
  4. Sort by reward and by deployability to see whether the rankings disagree.
  5. Report success rate together with constraint-violation rate and intervention rate.
# Check whether the top reward episode is actually deployable.
episodes = [
    {"name": "safe_slow", "reward": 7.9, "success": True, "collisions": 0, "keepout": False},
    {"name": "fast_close_pass", "reward": 8.7, "success": True, "collisions": 0, "keepout": True},
    {"name": "fast_collision", "reward": 8.2, "success": True, "collisions": 1, "keepout": False},
]

def deployability(row: dict[str, object]) -> bool:
    return row["success"] and row["collisions"] == 0 and not row["keepout"]

top_reward = max(episodes, key=lambda row: row["reward"])
deployable = [row for row in episodes if deployability(row)]
print({"top_reward": top_reward["name"], "top_reward_deployable": deployability(top_reward)})
print({"best_deployable": max(deployable, key=lambda row: row["reward"])["name"]})
Code Fragment 2.4.2 checks whether the highest-reward episode satisfies the hard constraints required for deployment.

When a reward design fails, ask whether the goal was underspecified, a shaping term changed the intended ordering, a cost was hidden in the scalar, or a hard constraint was treated as a negotiable penalty. Fix the specification before tuning the learner.

Key Takeaway

Goals say what should happen. Rewards help learning. Costs expose tradeoffs. Constraints protect the boundaries that should not be optimized away.

Exercise 2.4.1

Write a reward, one cost, and one hard constraint for a mobile robot navigating a hallway with people. Explain which dashboard plot should show each field.

What's Next?

Section 2.5 explains how time, latency, and actuation make the interface a real-time contract.

Bibliography & Further Reading

Foundational References For This Section

Bellman, R.. "A Markovian Decision Process." (1957). https://doi.org/10.1515/9781400835386-007

The mathematical origin of the state, action, transition, and reward framing.

Kaelbling, L. P., Littman, M. L., and Cassandra, A. R.. "Planning and acting in partially observable stochastic domains." (1998). https://www.sciencedirect.com/science/article/pii/S000437029800023X

A foundational POMDP reference for belief-state reasoning under partial observability.

Farama Foundation. "Gymnasium Documentation." (2024). https://gymnasium.farama.org/

The maintained reference for reset, step, spaces, termination, truncation, wrappers, and reproducible environments.