Section 10.5: Rendering, logging, and debugging | Building Embodied AI: From Perception to Autonomous Action

A Careful Control Loop

Technical illustration for Section 10.5: Rendering, logging, and debugging. — Figure 10.5A: A debugging workflow for a Gymnasium environment: render() output at a key timestep alongside logged reward curve and episode-length histogram, pinpointing where the agent gets stuck.

Big Picture

Rendering, logging, and debugging defines the contract an embodied experiment exposes to learning code: observations, actions, rewards, termination, truncation, rendering, and diagnostic info. Gymnasium handles the single-agent version of that contract, while PettingZoo extends the same discipline to multi-agent interaction.

This section turns the agent-environment interface into render modes, episode logs, videos, info dictionaries, and debugging artifacts practice, preparing RL training, multi-agent experiments, and benchmark evaluation with one auditable environment contract.

What This Section Builds

Rendering, logging, and debugging are the evidence path for an environment. Rendering shows what the environment believes is happening, logging preserves what happened, and debugging connects those records to a concrete failure cause.

The goal is to stop treating a reward curve as the whole story. An embodied environment should produce enough trace evidence to answer which observation arrived, which action was sent, what the environment returned, and why the episode ended.

The Interface Is The Test

This environment is ready when another reader can reset it with the same seed, inspect render modes, episode logs, videos, info dictionaries, and debugging artifacts, reproduce the same rollout, and recover the same logged evidence.

Theory

Gymnasium environments declare render modes such as human, rgb_array, or ansi. The right mode depends on the artifact: a live window is useful for local debugging, an RGB array can be saved as video, and text rendering can be checked in automated tests.

Logging should sit next to the environment loop rather than after training. Each step record should include step index, seed, action, reward, termination flag, truncation flag, and selected info fields. For robotics, add controller status, contact events, safety margins, and timing.

Mechanism

A render frame tells you what the environment would show an observer. A log record tells you what the policy and trainer consumed. Debugging begins when those two views disagree, such as a video showing contact while info reports no collision.

Worked Example

Code Fragment 10.5.1 uses an ansi render mode so the example works without a graphics window. The render frame gives a human-readable view, while the step return gives the machine-readable trace.

# Use text rendering when a debug check should run without a GUI.
# The step trace still records reward, ending flags, and info keys.
import gymnasium as gym

env = gym.make("FrozenLake-v1", render_mode="ansi", is_slippery=False)
observation, info = env.reset(seed=3)
frame = env.render()
visible = [line for line in frame.splitlines() if line.strip()]

observation, reward, terminated, truncated, info = env.step(1)
clean_row = visible[0].replace("\x1b[41m", "[").replace("\x1b[0m", "]")

print(clean_row)
print({"obs": int(observation), "reward": reward, "ended": terminated or truncated, "info_keys": sorted(info.keys())})
env.close()

[S]FFF {'obs': 4, 'reward': 0, 'ended': False, 'info_keys': ['prob']}

The expected output combines a human-readable render frame with a machine-readable step record. The frame shows the current grid state, while the dictionary confirms that the sampled transition did not end the episode and that a probability diagnostic is available in info.

Code Fragment 10.5.1 pairs a text render frame with a structured step trace. The frame helps a reader see the grid state, while the dictionary records the observation id, reward, ending status, and diagnostic keys.

Library Shortcut

Gymnasium render modes and wrappers such as episode statistics recording turn common debugging needs into standard calls. The shortcut works best when the saved artifact includes both visual evidence and structured fields, rather than only one or the other.

Practical Recipe

Choose a render mode that matches the artifact: live inspection, saved video, image array, or text trace.
Log one row per environment step with action, reward, ending flags, and selected info.
Save the wrapper stack and render mode with the log.
When a rollout fails, classify the failure before changing the policy.
Keep two representative failure traces for each reported metric table.

Gymnasium And PettingZoo Practice

A usable environment wrapper for this section records render modes, episode logs, videos, info dictionaries, and debugging artifacts, plus observation and action spaces, reset seed, info dictionary fields, and reproducible evidence artifacts.

Common Failure Mode

The common mistake is debugging from aggregate reward alone. A reward curve can improve while the robot learns to exploit a simulator artifact, ignore a safety margin, or complete the task in a way the render trace would immediately expose.

Practical Example

For a grasping policy, save one short video, the step log, and the final info dictionary for every failed evaluation seed. A reviewer can then tell whether failure came from perception drift, action saturation, collision, time limit, or reward mislabeling.

Memory Hook

For rendering, logging, and debugging, the useful test is simple: could a teammate point to the log line, plot, or trace that proves the idea changed the agent's next action?

Research Frontier

Robot learning evaluation is moving toward richer artifacts: videos, action traces, simulator states, safety events, and human-readable task summaries. The frontier question is how to make those artifacts compact enough to compare at scale while still preserving enough detail to diagnose failures.

Self Check

If a rollout fails, can you open one artifact and identify the observation, action, reward, ending flag, and visible scene at the failure step? If not, the logging plan is too thin.

Rendering and logging answer different parts of the same question. Rendering says what the environment displays as happening. Logging says what the algorithm saw and optimized. A strong debugging workflow keeps those synchronized by seed and step index.

The graduate-level habit is to require traceability from a reported number back to at least one representative episode. A success rate without failure traces is fragile because it cannot show which assumptions survived contact with the simulator.

Practical Tool Choices For This Section

Tool or Library	Role in the Topic	Builder Advice
`human` render mode	Live visual inspection	Use locally when a developer needs to watch behavior.
`rgb_array` render mode	Image or video artifact	Use for saved rollouts and publication-quality inspection.
`ansi` render mode	Text artifact	Use for deterministic tests and lightweight debugging.
`info` dictionary	Machine-readable diagnostics	Use for contact flags, reward terms, hidden state checks, and timing.
Step log	Episode reconstruction	Use as the common index joining actions, rewards, endings, and render frames.

A robust debugging implementation starts with a tiny trace format. The trace should be small enough to inspect by hand and structured enough to join with videos, metrics, and safety events.

Choose the minimal render mode that captures the failure evidence.
Write one log row per step before training long runs.
Include terminated, truncated, and selected info fields in each row.
Save seeds and wrapper stack beside the trace.
Review a few failure traces before tuning reward or model architecture.

# Record a compact step trace that can be inspected after rollout.
# Each row preserves reward, ending status, and diagnostic keys.
import gymnasium as gym

env = gym.make("CartPole-v1")
observation, info = env.reset(seed=17)
env.action_space.seed(17)
trace = []

for step_index in range(3):
    action = env.action_space.sample()
    observation, reward, terminated, truncated, info = env.step(action)
    trace.append({
        "step": step_index + 1,
        "action": int(action),
        "reward": float(reward),
        "ended": terminated or truncated,
        "info_keys": sorted(info.keys()),
    })

print(trace)
env.close()

[{'step': 1, 'action': 1, 'reward': 1.0, 'ended': False, 'info_keys': []}, {'step': 2, 'action': 1, 'reward': 1.0, 'ended': False, 'info_keys': []}, {'step': 3, 'action': 0, 'reward': 1.0, 'ended': False, 'info_keys': []}]

The expected output is a short rollout ledger with one dictionary per step. Read it as a minimal debugging artifact: every action, reward, and ending flag is preserved in order, so a later aggregate return can still be traced back to concrete behavior.

Code Fragment 10.5.2 builds a minimal step trace around a Gymnasium loop. The trace preserves action, reward, ending status, and diagnostic keys, which is enough to connect an aggregate metric back to individual rollout behavior.

When an experiment about rendering, logging, and debugging fails, avoid labeling the whole method as weak. First assign the failure to perception, state estimation, planning, control, timing, data coverage, or evaluation. Then rerun one controlled perturbation that isolates the suspected cause. This pattern turns a disappointing rollout into a reusable diagnostic asset.

Key Takeaway

Rendering makes behavior visible, logging makes behavior auditable, and debugging needs both views joined by seed and step index.

Exercise 10.5.1

Run a five-step Gymnasium rollout and save a trace with action, reward, terminated, truncated, and one selected info key. Then write the one failure question that trace can answer.

What's Next?

The next section should inherit the Rendering, logging, and debugging interface contract and change only the next environment-design variable under study.

Bibliography and Further Reading

Tools And Libraries

Farama Foundation. "Gymnasium Documentation."

The official Gymnasium docs define the reset, step, render, terminated, truncated, and info conventions used by maintained environments. Readers implementing custom environments should use this as the API reference. Readers should connect this source to rendering, logging, and debugging when deciding what is reusable, what is benchmark-specific, and what must be remeasured.

Tool

Farama Foundation. "PettingZoo Documentation."

PettingZoo defines maintained APIs for multi-agent reinforcement learning. It is directly relevant when a section moves from one embodied agent to turn-based, simultaneous, or mixed multi-agent interaction. Readers should connect this source to rendering, logging, and debugging when deciding what is reusable, what is benchmark-specific, and what must be remeasured.

Tool

Foundational Papers

Terry, J. K. et al. (2021). "PettingZoo: Gym for Multi-Agent Reinforcement Learning." NeurIPS Datasets and Benchmarks.

This paper explains why multi-agent environments need explicit agent ordering and interface discipline. It gives researchers the context behind the AEC and parallel API choices described in this chapter. Readers should connect this source to rendering, logging, and debugging when deciding what is reusable, what is benchmark-specific, and what must be remeasured.

Paper

Brockman, G. et al. (2016). "OpenAI Gym." arXiv.

The original Gym paper explains the environment abstraction that Gymnasium modernizes. It is useful for readers comparing legacy examples with the maintained Farama stack. Readers should connect this source to rendering, logging, and debugging when deciding what is reusable, what is benchmark-specific, and what must be remeasured.

Paper

Tools And Libraries

Stable-Baselines3 Contributors. "Stable-Baselines3 Documentation."

Stable-Baselines3 gives a practical reference for how environment spaces, vectorized environments, wrappers, and evaluation callbacks are consumed by training code. Engineers should read it when turning a custom environment into a reproducible RL experiment. Readers should connect this source to rendering, logging, and debugging when deciding what is reusable, what is benchmark-specific, and what must be remeasured.

Tool