A Careful Control Loop
Rendering, logging, and debugging defines the contract an embodied experiment exposes to learning code: observations, actions, rewards, termination, truncation, rendering, and diagnostic info. Gymnasium handles the single-agent version of that contract, while PettingZoo extends the same discipline to multi-agent interaction.
This section turns the agent-environment interface into render modes, episode logs, videos, info dictionaries, and debugging artifacts practice, preparing RL training, multi-agent experiments, and benchmark evaluation with one auditable environment contract.
What This Section Builds
Rendering, logging, and debugging are the evidence path for an environment. Rendering shows what the environment believes is happening, logging preserves what happened, and debugging connects those records to a concrete failure cause.
The goal is to stop treating a reward curve as the whole story. An embodied environment should produce enough trace evidence to answer which observation arrived, which action was sent, what the environment returned, and why the episode ended.
This environment is ready when another reader can reset it with the same seed, inspect render modes, episode logs, videos, info dictionaries, and debugging artifacts, reproduce the same rollout, and recover the same logged evidence.
Theory
Gymnasium environments declare render modes such as human, rgb_array, or ansi. The right mode depends on the artifact: a live window is useful for local debugging, an RGB array can be saved as video, and text rendering can be checked in automated tests.
Logging should sit next to the environment loop rather than after training. Each step record should include step index, seed, action, reward, termination flag, truncation flag, and selected info fields. For robotics, add controller status, contact events, safety margins, and timing.
A render frame tells you what the environment would show an observer. A log record tells you what the policy and trainer consumed. Debugging begins when those two views disagree, such as a video showing contact while info reports no collision.
Worked Example
Code Fragment 10.5.1 uses an ansi render mode so the example works without a graphics window. The render frame gives a human-readable view, while the step return gives the machine-readable trace.
# Use text rendering when a debug check should run without a GUI.
# The step trace still records reward, ending flags, and info keys.
import gymnasium as gym
env = gym.make("FrozenLake-v1", render_mode="ansi", is_slippery=False)
observation, info = env.reset(seed=3)
frame = env.render()
visible = [line for line in frame.splitlines() if line.strip()]
observation, reward, terminated, truncated, info = env.step(1)
clean_row = visible[0].replace("\x1b[41m", "[").replace("\x1b[0m", "]")
print(clean_row)
print({"obs": int(observation), "reward": reward, "ended": terminated or truncated, "info_keys": sorted(info.keys())})
env.close()
The expected output combines a human-readable render frame with a machine-readable step record. The frame shows the current grid state, while the dictionary confirms that the sampled transition did not end the episode and that a probability diagnostic is available in info.
Gymnasium render modes and wrappers such as episode statistics recording turn common debugging needs into standard calls. The shortcut works best when the saved artifact includes both visual evidence and structured fields, rather than only one or the other.
Practical Recipe
- Choose a render mode that matches the artifact: live inspection, saved video, image array, or text trace.
- Log one row per environment step with action, reward, ending flags, and selected
info. - Save the wrapper stack and render mode with the log.
- When a rollout fails, classify the failure before changing the policy.
- Keep two representative failure traces for each reported metric table.
A usable environment wrapper for this section records render modes, episode logs, videos, info dictionaries, and debugging artifacts, plus observation and action spaces, reset seed, info dictionary fields, and reproducible evidence artifacts.
The common mistake is debugging from aggregate reward alone. A reward curve can improve while the robot learns to exploit a simulator artifact, ignore a safety margin, or complete the task in a way the render trace would immediately expose.
For a grasping policy, save one short video, the step log, and the final info dictionary for every failed evaluation seed. A reviewer can then tell whether failure came from perception drift, action saturation, collision, time limit, or reward mislabeling.
For rendering, logging, and debugging, the useful test is simple: could a teammate point to the log line, plot, or trace that proves the idea changed the agent's next action?
Robot learning evaluation is moving toward richer artifacts: videos, action traces, simulator states, safety events, and human-readable task summaries. The frontier question is how to make those artifacts compact enough to compare at scale while still preserving enough detail to diagnose failures.
If a rollout fails, can you open one artifact and identify the observation, action, reward, ending flag, and visible scene at the failure step? If not, the logging plan is too thin.
Rendering and logging answer different parts of the same question. Rendering says what the environment displays as happening. Logging says what the algorithm saw and optimized. A strong debugging workflow keeps those synchronized by seed and step index.
The graduate-level habit is to require traceability from a reported number back to at least one representative episode. A success rate without failure traces is fragile because it cannot show which assumptions survived contact with the simulator.
| Tool or Library | Role in the Topic | Builder Advice |
|---|---|---|
human render mode | Live visual inspection | Use locally when a developer needs to watch behavior. |
rgb_array render mode | Image or video artifact | Use for saved rollouts and publication-quality inspection. |
ansi render mode | Text artifact | Use for deterministic tests and lightweight debugging. |
info dictionary | Machine-readable diagnostics | Use for contact flags, reward terms, hidden state checks, and timing. |
| Step log | Episode reconstruction | Use as the common index joining actions, rewards, endings, and render frames. |
A robust debugging implementation starts with a tiny trace format. The trace should be small enough to inspect by hand and structured enough to join with videos, metrics, and safety events.
- Choose the minimal render mode that captures the failure evidence.
- Write one log row per step before training long runs.
- Include
terminated,truncated, and selectedinfofields in each row. - Save seeds and wrapper stack beside the trace.
- Review a few failure traces before tuning reward or model architecture.
# Record a compact step trace that can be inspected after rollout.
# Each row preserves reward, ending status, and diagnostic keys.
import gymnasium as gym
env = gym.make("CartPole-v1")
observation, info = env.reset(seed=17)
env.action_space.seed(17)
trace = []
for step_index in range(3):
action = env.action_space.sample()
observation, reward, terminated, truncated, info = env.step(action)
trace.append({
"step": step_index + 1,
"action": int(action),
"reward": float(reward),
"ended": terminated or truncated,
"info_keys": sorted(info.keys()),
})
print(trace)
env.close()
The expected output is a short rollout ledger with one dictionary per step. Read it as a minimal debugging artifact: every action, reward, and ending flag is preserved in order, so a later aggregate return can still be traced back to concrete behavior.
When an experiment about rendering, logging, and debugging fails, avoid labeling the whole method as weak. First assign the failure to perception, state estimation, planning, control, timing, data coverage, or evaluation. Then rerun one controlled perturbation that isolates the suspected cause. This pattern turns a disappointing rollout into a reusable diagnostic asset.
Rendering makes behavior visible, logging makes behavior auditable, and debugging needs both views joined by seed and step index.
Run a five-step Gymnasium rollout and save a trace with action, reward, terminated, truncated, and one selected info key. Then write the one failure question that trace can answer.
The next section should inherit the Rendering, logging, and debugging interface contract and change only the next environment-design variable under study.
Farama Foundation. "Gymnasium Documentation."
The official Gymnasium docs define the reset, step, render, terminated, truncated, and info conventions used by maintained environments. Readers implementing custom environments should use this as the API reference. Readers should connect this source to rendering, logging, and debugging when deciding what is reusable, what is benchmark-specific, and what must be remeasured.
Farama Foundation. "PettingZoo Documentation."
PettingZoo defines maintained APIs for multi-agent reinforcement learning. It is directly relevant when a section moves from one embodied agent to turn-based, simultaneous, or mixed multi-agent interaction. Readers should connect this source to rendering, logging, and debugging when deciding what is reusable, what is benchmark-specific, and what must be remeasured.
This paper explains why multi-agent environments need explicit agent ordering and interface discipline. It gives researchers the context behind the AEC and parallel API choices described in this chapter. Readers should connect this source to rendering, logging, and debugging when deciding what is reusable, what is benchmark-specific, and what must be remeasured.
Brockman, G. et al. (2016). "OpenAI Gym." arXiv.
The original Gym paper explains the environment abstraction that Gymnasium modernizes. It is useful for readers comparing legacy examples with the maintained Farama stack. Readers should connect this source to rendering, logging, and debugging when deciding what is reusable, what is benchmark-specific, and what must be remeasured.
Stable-Baselines3 Contributors. "Stable-Baselines3 Documentation."
Stable-Baselines3 gives a practical reference for how environment spaces, vectorized environments, wrappers, and evaluation callbacks are consumed by training code. Engineers should read it when turning a custom environment into a reproducible RL experiment. Readers should connect this source to rendering, logging, and debugging when deciding what is reusable, what is benchmark-specific, and what must be remeasured.