Section 10.6: Evaluation protocol and seeding | Building Embodied AI: From Perception to Autonomous Action

A Careful Control Loop

Big Picture

Evaluation protocol and seeding defines the contract an embodied experiment exposes to learning code: observations, actions, rewards, termination, truncation, rendering, and diagnostic info. Gymnasium handles the single-agent version of that contract, while PettingZoo extends the same discipline to multi-agent interaction.

This section turns the agent-environment interface into seed control, fixed panels, statistical comparison, and artifact versioning practice, preparing RL training, multi-agent experiments, and benchmark evaluation with one auditable environment contract.

What This Section Builds

Evaluation protocol and seeding are the guardrails for a valid comparison. A seed policy controls the starting randomness; an evaluation protocol controls which tasks, wrappers, metrics, and failure labels are compared.

The goal is one auditable comparison artifact. If two methods are compared, they should be evaluated in one pass on the same environment panel, same wrapper stack, same seed list, and same metric definitions.

The Interface Is The Test

This environment is ready when another reader can reset it with the same seed, inspect seed control, fixed panels, statistical comparison, and artifact versioning, reproduce the same rollout, and recover the same logged evidence.

Theory

Seeding is not a guarantee that every library, process, and physics engine will behave identically across machines. It is a contract for controlled comparison: the experiment declares how initial conditions, action sampling, environment randomness, and evaluation panels are generated.

Gymnasium supports this discipline by passing a seed into reset and allowing the action space to be seeded for reproducible sampling. A strong protocol records both, then evaluates methods under the same seed list instead of reporting numbers from separate runs.

Mechanism

The protocol is the unit of comparison. It binds environment id, seed list, wrappers, render mode, task panel, metrics, and aggregation rule into one object so later tables cannot mix incompatible numbers.

Worked Example

Code Fragment 10.6.1 demonstrates a minimal seed smoke test. The same seed reproduces the first sampled action and transition; a different seed changes the trace.

# Check whether the environment and action sampling are seeded together.
# The same seed should reproduce the first action and first transition.
import gymnasium as gym

def first_step(seed):
    env = gym.make("CartPole-v1")
    observation, info = env.reset(seed=seed)
    env.action_space.seed(seed)
    action = env.action_space.sample()
    next_observation, reward, terminated, truncated, info = env.step(action)
    env.close()
    return round(float(next_observation[0]), 5), int(action), terminated, truncated

print(first_step(21))
print(first_step(21))
print(first_step(22))

(0.02832, 0, False, False) (0.02832, 0, False, False) (-0.01397, 1, False, False)

The expected output repeats the first trace exactly under the repeated seed, then changes when the seed changes. That is the correct interpretation for a seed smoke test: determinism within a seed, variation across seeds, and no hidden episode ending in any of the one-step traces.

Code Fragment 10.6.1 uses a one-step trace as a seed check. The repeated seed produces the same tuple, while changing the seed changes the starting transition, which is exactly what an evaluation smoke test should expose.

Library Shortcut

Gymnasium gives the seed hooks, but the protocol is still the author's responsibility. The shortcut is to make seeds and wrappers explicit in code, then emit one artifact containing all compared methods instead of assembling a table from separate files.

Practical Recipe

Declare the seed list before running methods.
Evaluate every compared method on the same environment ids, wrappers, and seeds.
Save per-seed results before aggregating means or confidence intervals.
Report termination, truncation, and failure labels with the primary success metric.
Save one machine-readable artifact that contains all compared numbers.

Gymnasium And PettingZoo Practice

A usable environment wrapper for this section records seed control, fixed panels, statistical comparison, and artifact versioning, plus observation and action spaces, reset seed, info dictionary fields, and reproducible evidence artifacts.

Common Failure Mode

The common mistake is comparing method A on one seed panel with method B on another seed panel. The table may look number-by-number backed, but the comparison is invalid because the numbers were not co-computed on the same protocol.

Practical Example

For a navigation benchmark, evaluate all policies on the same 50 start-goal seeds, same obstacle layouts, same wrappers, and same time limit. The artifact should have one row per method and seed, with success, path length, collision count, termination flag, and truncation flag.

Memory Hook

Treat evaluation protocol and seeding like a control-room label. If the label does not tell a future debugger what moved, what sensed, or what failed, it is decoration rather than engineering knowledge.

Research Frontier

Embodied AI benchmarks increasingly stress evaluation protocol design: scenario diversity, randomized starts, sim-to-real transfer, and seed sensitivity all affect conclusions. The open research challenge is not only stronger policies, but protocols that reveal when a policy is robust rather than lucky.

Self Check

Can you identify the exact seed list, wrapper stack, environment version, and metric script used for every number in a comparison table? If not, the protocol is not reproducible enough.

Evaluation protocols prevent accidental storytelling. Without a shared protocol, one method can benefit from easier starts, different time limits, missing wrappers, or a different failure classifier. The table still has numbers, but the comparison does not answer the claimed question.

The graduate-level habit is to make the comparison artifact the source of truth. Every plotted mean, table entry, and claim should be derivable from one saved panel where method, seed, environment, and metric are all columns.

Practical Tool Choices For This Section

Tool or Library	Role in the Topic	Builder Advice
`reset(seed=...)`	Environment initialization	Use to control starting randomness for each episode.
`action_space.seed(...)`	Random action sampling	Use for deterministic smoke tests and random baselines.
Seed panel	Shared evaluation cases	Use the same panel for every compared method.
Per-seed row	Raw evidence	Save before computing means, intervals, or plots.
Protocol hash or config	Reproducibility handle	Save environment id, wrappers, metrics, and version data together.

A robust implementation produces a method-by-seed table first, then derives summaries from it. This keeps each claim traceable to a concrete run under the same protocol.

Write a protocol config with environment id, wrappers, seed list, max steps, and metrics.
Loop over methods inside the same evaluation script.
Loop over seeds inside each method and write one row per episode.
Aggregate only after all raw rows are saved.
Audit every table number by recomputing it from the saved artifact.

# Build a small co-computed evaluation panel.
# Each method is evaluated on the same seed list in one artifact.
import gymnasium as gym

def rollout(method_name, seed):
    env = gym.make("CartPole-v1", max_episode_steps=5)
    observation, info = env.reset(seed=seed)
    env.action_space.seed(seed)
    total_reward = 0.0
    terminated = truncated = False

    while not (terminated or truncated):
        action = 0 if method_name == "always_left" else env.action_space.sample()
        observation, reward, terminated, truncated, info = env.step(action)
        total_reward += float(reward)

    env.close()
    return {"method": method_name, "seed": seed, "reward": total_reward, "truncated": truncated}

panel = [rollout(method, seed) for method in ["always_left", "random"] for seed in [1, 2]]
print(panel)

[{'method': 'always_left', 'seed': 1, 'reward': 5.0, 'truncated': True}, {'method': 'always_left', 'seed': 2, 'reward': 5.0, 'truncated': True}, {'method': 'random', 'seed': 1, 'reward': 5.0, 'truncated': True}, {'method': 'random', 'seed': 2, 'reward': 5.0, 'truncated': True}]

The expected output is one co-computed row per method and seed, not one summary per method. Even though the toy panel gives the same reward in all four rows, the important interpretation is that the evidence format makes a valid comparison possible because every row was produced under the same protocol.

Code Fragment 10.6.2 co-computes two methods on the same seed panel. The list of rows is the evidence artifact; any table should be derived from these rows rather than from separate method-specific runs.

When an experiment about evaluation protocol and seeding fails, avoid labeling the whole method as weak. First assign the failure to perception, state estimation, planning, control, timing, data coverage, or evaluation. Then rerun one controlled perturbation that isolates the suspected cause. This pattern turns a disappointing rollout into a reusable diagnostic asset.

Key Takeaway

A comparison is valid when the compared numbers are co-computed under one protocol. Seeds, wrappers, task panel, and metric code are part of the result, not metadata trivia.

Exercise 10.6.1

Design a four-seed evaluation panel for two policies. Specify the environment id, wrapper stack, seed list, metric fields, and the single artifact that will store all per-seed rows.

What's Next?

The next section should inherit the Evaluation protocol and seeding interface contract and change only the next environment-design variable under study.

Bibliography and Further Reading

Tools And Libraries

Farama Foundation. "Gymnasium Documentation."

The official Gymnasium docs define the reset, step, render, terminated, truncated, and info conventions used by maintained environments. Readers implementing custom environments should use this as the API reference. Readers should connect this source to evaluation protocol and seeding when deciding what is reusable, what is benchmark-specific, and what must be remeasured.

Tool

Farama Foundation. "PettingZoo Documentation."

PettingZoo defines maintained APIs for multi-agent reinforcement learning. It is directly relevant when a section moves from one embodied agent to turn-based, simultaneous, or mixed multi-agent interaction. Readers should connect this source to evaluation protocol and seeding when deciding what is reusable, what is benchmark-specific, and what must be remeasured.

Tool

Foundational Papers

Terry, J. K. et al. (2021). "PettingZoo: Gym for Multi-Agent Reinforcement Learning." NeurIPS Datasets and Benchmarks.

This paper explains why multi-agent environments need explicit agent ordering and interface discipline. It gives researchers the context behind the AEC and parallel API choices described in this chapter. Readers should connect this source to evaluation protocol and seeding when deciding what is reusable, what is benchmark-specific, and what must be remeasured.

Paper

Brockman, G. et al. (2016). "OpenAI Gym." arXiv.

The original Gym paper explains the environment abstraction that Gymnasium modernizes. It is useful for readers comparing legacy examples with the maintained Farama stack. Readers should connect this source to evaluation protocol and seeding when deciding what is reusable, what is benchmark-specific, and what must be remeasured.

Paper

Tools And Libraries

Stable-Baselines3 Contributors. "Stable-Baselines3 Documentation."

Stable-Baselines3 gives a practical reference for how environment spaces, vectorized environments, wrappers, and evaluation callbacks are consumed by training code. Engineers should read it when turning a custom environment into a reproducible RL experiment. Readers should connect this source to evaluation protocol and seeding when deciding what is reusable, what is benchmark-specific, and what must be remeasured.

Tool