Section 10.3: Reward design and termination

A Careful Control Loop
Technical illustration for Section 10.3: Reward design and termination.
Figure 10.3A: Reward shaping design for a door-opening task: the sparse terminal reward fires only on success, a dense potential-based shaping term provides gradient throughout, and an early-termination condition cuts failed episodes short.
Big Picture

Reward design and termination defines the contract an embodied experiment exposes to learning code: observations, actions, rewards, termination, truncation, rendering, and diagnostic info. Gymnasium handles the single-agent version of that contract, while PettingZoo extends the same discipline to multi-agent interaction.

This section turns the agent-environment interface into reward terms, termination flags, truncation flags, and hidden failure incentives practice, preparing RL training, multi-agent experiments, and benchmark evaluation with one auditable environment contract.

What This Section Builds

Reward design and episode endings become operational when the learning signal and the stopping reason are both explicit. A reward tells the learner what behavior is being reinforced, while terminated and truncated tell the learner why the episode stopped.

The goal is to stop treating all endings as equal. Reaching the goal, dropping the object, violating a safety boundary, and hitting a time limit require different labels in both the learning loop and the experiment report.

The Interface Is The Test

This environment is ready when another reader can reset it with the same seed, inspect reward terms, termination flags, truncation flags, and hidden failure incentives, reproduce the same rollout, and recover the same logged evidence.

Theory

A reward should be tied to the task construct, not to whatever sensor is easiest to measure. In a reach task, distance-to-target may help shape learning, but the success condition should still state the task boundary: target reached within tolerance, object stable, safety constraints respected.

Gymnasium separates two ending flags because they answer different questions. terminated means the task's own terminal state was reached. truncated means an outside condition, usually a time limit, stopped the episode before the task itself ended.

Mechanism

A good environment writes reward terms into info during development: success bonus, distance shaping, collision penalty, control cost, and safety penalty. The scalar reward trains the policy, but the decomposed terms explain why the policy behaves as it does.

Worked Example

Code Fragment 10.3.1 forces a time-limit truncation in CartPole-v1. The pole has not necessarily reached a terminal failure state, but the wrapper stops the episode because the external step budget is exhausted.

# Show that a time limit produces truncation, not task termination.
# Learning code should log which flag caused the episode boundary.
import gymnasium as gym

env = gym.make("CartPole-v1", max_episode_steps=3)
observation, info = env.reset(seed=5)
env.action_space.seed(5)

for step_index in range(5):
    action = env.action_space.sample()
    observation, reward, terminated, truncated, info = env.step(action)
    print(step_index + 1, terminated, truncated)
    if terminated or truncated:
        break

env.close()
1 False False 2 False False 3 False True

The expected output keeps both ending flags false for the first two steps, then flips only truncated to true at step 3. That pattern means the episode stopped because the time budget expired, not because the task dynamics reached a terminal success or failure state.

Code Fragment 10.3.1 creates a short episode budget and shows truncated=True at step 3. The example makes the Gymnasium distinction concrete: the episode ended, but not because the task dynamics declared a terminal state.
Library Shortcut

Gymnasium's TimeLimit behavior and five-value step API remove the need for custom done conventions. The shortcut only works if the experiment artifact preserves the two flags instead of recombining them into one column.

Practical Recipe

  1. Write the terminal success and terminal failure conditions before writing reward shaping.
  2. Use terminated for task-defined endings such as success, irreversible failure, or safety violation.
  3. Use truncated for external cutoffs such as time limits, evaluation budgets, or watchdog stops.
  4. Log reward components in info during debugging, even if the trainer only consumes the scalar reward.
  5. Report success rate, truncation rate, and failure category counts together.
Gymnasium And PettingZoo Practice

A usable environment wrapper for this section records reward terms, termination flags, truncation flags, and hidden failure incentives, plus observation and action spaces, reset seed, info dictionary fields, and reproducible evidence artifacts.

Common Failure Mode

The common mistake is reward hacking by proxy. If the reward pays for high velocity toward the target but ignores stable contact, the robot can learn a dramatic collision that scores well while failing the real task.

Practical Example

In a drawer-opening task, use terminated=True when the handle passes the target displacement and the drawer remains stable. Use truncated=True when the time budget expires while the drawer is still moving. Report those counts separately because they imply different fixes.

Memory Hook

A good embodied system makes reward design and termination visible twice: once in the design sketch and once in the replay artifact. The second view keeps the first one honest.

Research Frontier

Reward design remains an active research problem because embodied tasks often combine sparse success, dense shaping, safety constraints, and human preference signals. Gymnasium's termination and truncation split is a small API detail that supports a larger scientific need: knowing whether a policy solved the task or merely survived the protocol.

Self Check

Can you write one sentence each for success termination, failure termination, and external truncation in your environment? If those sentences blur together, the reward specification is not ready.

Reward design is where a simulator becomes a teacher. If the scalar reward rewards the wrong proxy, the policy will optimize that proxy with more patience than the author expects. The environment should therefore store the reward decomposition and ending cause with each episode.

The graduate-level habit is to treat reward as a measurement model. The scalar is only a compressed signal. The full evidence artifact should retain the terms that explain the compression, especially when comparing methods across seeds or perturbations.

Practical Tool Choices For This Section
Tool or LibraryRole in the TopicBuilder Advice
rewardScalar learning signalKeep it aligned with the task construct, not only easy-to-measure proxies.
terminatedTask-defined episode endingUse for success, failure, or safety states that belong to the environment dynamics.
truncatedExternal protocol cutoffUse for time limits, evaluation budgets, and watchdog stops.
infoReward and ending diagnosticsStore reward terms, failure labels, and distance-to-goal traces for debugging.
Evaluation reportAggregate evidencePublish success, failure, and truncation rates together.

A robust implementation keeps reward computation auditable. Even when the trainer sees only a scalar, the environment should emit enough diagnostic information to reconstruct why that scalar was produced.

  1. Write success and failure predicates before tuning reward weights.
  2. Put every reward term into a named variable before summing.
  3. Return the scalar reward to the trainer and the term breakdown in info.
  4. Write unit tests for one success case, one task failure, and one time-limit truncation.
  5. Audit evaluation tables for construct-matched success metrics, not only mean reward.
# Keep reward terms named before summing them into one scalar.
# The info dict preserves why the reward was assigned.
distance_to_goal = 0.04
collision = False
success = distance_to_goal <= 0.05 and not collision

reward_terms = {
    "success_bonus": 10.0 if success else 0.0,
    "distance_penalty": -distance_to_goal,
    "collision_penalty": -5.0 if collision else 0.0,
}
reward = sum(reward_terms.values())
terminated = success or collision
truncated = False
info = {"reward_terms": reward_terms, "success": success}

print(round(reward, 2), terminated, truncated)
print(info)
9.96 True False {'reward_terms': {'success_bonus': 10.0, 'distance_penalty': -0.04, 'collision_penalty': 0.0}, 'success': True}

The expected output pairs a near-10 reward with terminated=True and a reward-term dictionary whose entries sum to the scalar reward. Readers should interpret that as a successful terminal transition whose score is explainable from named components rather than an opaque single number.

Code Fragment 10.3.2 separates reward terms before returning their sum. The named reward_terms inside info make the scalar reward explainable during policy debugging and result audits.

When an experiment about reward design and termination fails, avoid labeling the whole method as weak. First assign the failure to perception, state estimation, planning, control, timing, data coverage, or evaluation. Then rerun one controlled perturbation that isolates the suspected cause. This pattern turns a disappointing rollout into a reusable diagnostic asset.

Key Takeaway

Reward trains behavior, but termination semantics explain episode boundaries. Keep reward terms, task endings, and external truncations separate all the way into the result artifact.

Exercise 10.3.1

For a simulated grasp task, write three predicates: successful grasp, dropped object, and time-limit cutoff. Then write the reward terms you would return in info to explain the scalar reward.

What's Next?

The next section should inherit the Reward design and termination interface contract and change only the next environment-design variable under study.

Bibliography and Further Reading
Tools And Libraries

Farama Foundation. "Gymnasium Documentation."

The official Gymnasium docs define the reset, step, render, terminated, truncated, and info conventions used by maintained environments. Readers implementing custom environments should use this as the API reference. Readers should connect this source to reward design and termination when deciding what is reusable, what is benchmark-specific, and what must be remeasured.

Tool

Farama Foundation. "PettingZoo Documentation."

PettingZoo defines maintained APIs for multi-agent reinforcement learning. It is directly relevant when a section moves from one embodied agent to turn-based, simultaneous, or mixed multi-agent interaction. Readers should connect this source to reward design and termination when deciding what is reusable, what is benchmark-specific, and what must be remeasured.

Tool
Foundational Papers

Terry, J. K. et al. (2021). "PettingZoo: Gym for Multi-Agent Reinforcement Learning." NeurIPS Datasets and Benchmarks.

This paper explains why multi-agent environments need explicit agent ordering and interface discipline. It gives researchers the context behind the AEC and parallel API choices described in this chapter. Readers should connect this source to reward design and termination when deciding what is reusable, what is benchmark-specific, and what must be remeasured.

Paper

Brockman, G. et al. (2016). "OpenAI Gym." arXiv.

The original Gym paper explains the environment abstraction that Gymnasium modernizes. It is useful for readers comparing legacy examples with the maintained Farama stack. Readers should connect this source to reward design and termination when deciding what is reusable, what is benchmark-specific, and what must be remeasured.

Paper
Tools And Libraries

Stable-Baselines3 Contributors. "Stable-Baselines3 Documentation."

Stable-Baselines3 gives a practical reference for how environment spaces, vectorized environments, wrappers, and evaluation callbacks are consumed by training code. Engineers should read it when turning a custom environment into a reproducible RL experiment. Readers should connect this source to reward design and termination when deciding what is reusable, what is benchmark-specific, and what must be remeasured.

Tool