Section 10.7: PettingZoo for multi-agent

A Careful Control Loop
Big Picture

PettingZoo for multi-agent defines the contract an embodied experiment exposes to learning code: observations, actions, rewards, termination, truncation, rendering, and diagnostic info. Gymnasium handles the single-agent version of that contract, while PettingZoo extends the same discipline to multi-agent interaction.

This section turns the agent-environment interface into agent ordering, AEC versus parallel API choice, shared-state logging, and simultaneous-action semantics practice, preparing RL training, multi-agent experiments, and benchmark evaluation with one auditable environment contract.

What This Section Builds

PettingZoo becomes operational when a multi-agent task names its agents, timing, and per-agent returns. Gymnasium assumes one learning agent acts at each step. PettingZoo adds agent identities, per-agent observations, per-agent rewards, and per-agent episode endings.

The goal is to choose the right multi-agent API before writing the task. Use AEC when turn order matters. Use the Parallel API when agents act simultaneously and the environment resolves their joint action at the end of a cycle.

The Interface Is The Test

This environment is ready when another reader can reset it with the same seed, inspect agent ordering, AEC versus parallel API choice, shared-state logging, and simultaneous-action semantics, reproduce the same rollout, and recover the same logged evidence.

Theory

PettingZoo's AEC API models an Agent Environment Cycle: the environment selects one active agent, the agent observes, acts, and the environment advances to the next agent. That model fits turn-based or order-sensitive settings, such as a robot handing off an object after another robot clears space.

The Parallel API steps all live agents together through dictionaries keyed by agent id. It fits simultaneous control, such as two mobile robots moving in the same timestep. Both APIs keep terminations, truncations, rewards, observations, and infos per agent, because different agents may leave the task for different reasons.

Mechanism

The mental model is one Gymnasium contract per agent, plus a scheduler. AEC makes the scheduler explicit through agent_iter() and last(). Parallel environments hide the scheduler and ask for an action dictionary for every live agent.

Worked Example

Code Fragment 10.7.1 builds a tiny Parallel API environment with two agents on a one-dimensional line. It is intentionally small so the return dictionaries are easy to inspect.

# Implement the minimal shape of a PettingZoo ParallelEnv.
# Observations, rewards, terminations, truncations, and infos are keyed by agent.
from gymnasium import spaces
from pettingzoo.utils.env import ParallelEnv

class TwoRobotLine(ParallelEnv):
    metadata = {"name": "two_robot_line_v0"}

    def __init__(self):
        self.possible_agents = ["picker", "carrier"]
        self.observation_spaces = {agent: spaces.Box(0, 4, shape=(1,), dtype=int) for agent in self.possible_agents}
        self.action_spaces = {agent: spaces.Discrete(3) for agent in self.possible_agents}

    def reset(self, seed=None, options=None):
        self.agents = self.possible_agents[:]
        self.positions = {"picker": 0, "carrier": 4}
        observations = {agent: [pos] for agent, pos in self.positions.items()}
        infos = {agent: {} for agent in self.agents}
        return observations, infos

    def step(self, actions):
        moves = {0: -1, 1: 0, 2: 1}
        for agent, action in actions.items():
            self.positions[agent] = min(4, max(0, self.positions[agent] + moves[int(action)]))
        observations = {agent: [pos] for agent, pos in self.positions.items()}
        rewards = {agent: float(self.positions["picker"] == self.positions["carrier"]) for agent in self.agents}
        terminations = {agent: rewards[agent] == 1.0 for agent in self.agents}
        truncations = {agent: False for agent in self.agents}
        infos = {agent: {"position": self.positions[agent]} for agent in self.agents}
        if any(terminations.values()):
            self.agents = []
        return observations, rewards, terminations, truncations, infos

env = TwoRobotLine()
observations, infos = env.reset(seed=42)
print(observations)
print(env.step({"picker": 2, "carrier": 0})[1:4])
{'picker': [0], 'carrier': [4]} ({'picker': 0.0, 'carrier': 0.0}, {'picker': False, 'carrier': False}, {'picker': False, 'carrier': False})

The expected output exposes two independent agent views at reset, then three dictionaries keyed by agent id for rewards, terminations, and truncations after one joint step. Readers should interpret the all-false ending dictionaries as evidence that neither robot has yet reached the meeting condition.

Code Fragment 10.7.1 shows the core PettingZoo Parallel API shape without relying on a bundled game. The important detail is the dictionary contract: rewards, terminations, and truncations are keyed by picker and carrier, not returned as one scalar.
Library Shortcut

PettingZoo supplies the standard multi-agent API so trainers and test utilities can reason about agent ids, action spaces, observations, rewards, and ending flags. The shortcut is to conform to the API rather than inventing a custom dictionary format that every downstream tool must relearn.

Practical Recipe

  1. Choose AEC when turn order affects state, legality, or reward assignment.
  2. Choose Parallel API when all live agents submit actions for the same environment tick.
  3. Define possible_agents, per-agent spaces, and live agents explicitly.
  4. Return observations, rewards, terminations, truncations, and infos keyed by agent id.
  5. Log agent-specific failure labels, because one agent can truncate or terminate before another.
Gymnasium And PettingZoo Practice

A usable environment wrapper for this section records agent ordering, AEC versus parallel API choice, shared-state logging, and simultaneous-action semantics, plus observation and action spaces, reset seed, info dictionary fields, and reproducible evidence artifacts.

Common Failure Mode

The common mistake is averaging rewards across agents before debugging. A high team score can hide that one agent learned to wait while another agent does all the work, or that a collision penalty is assigned to the wrong participant.

Practical Example

In a warehouse task with a picker robot and a carrier robot, the Parallel API fits if both robots move once per tick. An AEC design fits if the picker must finish a grasp decision before the carrier is allowed to move. The choice changes the policy interface and the failure analysis.

Memory Hook

A good embodied system makes pettingzoo for multi-agent visible twice: once in the design sketch and once in the replay artifact. The second view keeps the first one honest.

Research Frontier

Multi-agent embodied AI raises open questions that single-agent Gymnasium tasks avoid: credit assignment, communication, opponent or teammate modeling, non-stationarity, and fair evaluation when agents have asymmetric roles. PettingZoo standardizes the interface so those scientific questions can be studied without rewriting the environment contract each time.

Self Check

Can you state which agents act together, which act in sequence, and which reward belongs to each agent? If not, the multi-agent API choice is still under-specified.

PettingZoo matters because multi-agent environments are not only bigger Gymnasium environments. The unit of action can be one active agent or a dictionary of simultaneous agent actions. The unit of reward can be individual, shared, or both. The unit of termination can differ by agent.

The graduate-level habit is to state the game form before coding: agents, observation timing, action timing, reward ownership, termination ownership, and whether roles are symmetric. Once those are explicit, the PettingZoo API choice becomes a modeling decision rather than a software afterthought.

Practical Tool Choices For This Section
Tool or LibraryRole in the TopicBuilder Advice
AEC APISequential agent turnsUse when order matters or the active agent changes after each environment update.
Parallel APISimultaneous actionsUse when all live agents choose actions for the same tick.
possible_agentsFull rosterUse for every agent that can appear in the environment.
agentsLive rosterUpdate as agents enter, finish, or leave the task.
Per-agent dictionariesMulti-agent return contractKeep rewards, terminations, truncations, and infos attributable.

A robust implementation starts with the interaction schedule. After that, write the per-agent spaces and the step return structure. Only then should you connect policies or trainers.

  1. List agents and roles before writing reward code.
  2. Choose AEC or Parallel API from the action timing, not from trainer convenience.
  3. Declare per-agent observation and action spaces.
  4. Return per-agent dictionaries for observations, rewards, terminations, truncations, and infos.
  5. Run PettingZoo API tests before trusting a custom environment.
# Sketch the official AEC loop shape for turn-based environments.
# The active agent receives last(), then submits one action with step().
def run_aec_policy(env, policy):
    env.reset(seed=42)
    for agent in env.agent_iter():
        observation, reward, termination, truncation, info = env.last()
        if termination or truncation:
            action = None
        else:
            action = policy(agent, observation, info)
        env.step(action)
    env.close()

print("AEC loop handles one active agent at a time.")
AEC loop handles one active agent at a time.

The expected output is intentionally verbal rather than numeric. It marks the core AEC interpretation: each iteration belongs to exactly one currently active agent, so per-agent observation, reward, and ending logic must be read turn by turn rather than as one simultaneous batch.

Code Fragment 10.7.2 shows the PettingZoo AEC control pattern. The last() call belongs to the current agent, and None is passed when that agent has already terminated or truncated.

When an experiment about pettingzoo for multi-agent fails, avoid labeling the whole method as weak. First assign the failure to perception, state estimation, planning, control, timing, data coverage, or evaluation. Then rerun one controlled perturbation that isolates the suspected cause. This pattern turns a disappointing rollout into a reusable diagnostic asset.

Key Takeaway

PettingZoo turns multi-agent interaction into an explicit contract: agent ids, per-agent spaces, per-agent rewards, and per-agent endings. Choose AEC or Parallel API from the interaction timing.

Exercise 10.7.1

Design a two-agent embodied task and decide whether it should use AEC or Parallel API. Specify the agents, each observation space, each action space, reward ownership, and one per-agent termination condition.

Bibliography and Further Reading
Tools And Libraries

Farama Foundation. "Gymnasium Documentation."

The official Gymnasium docs define the reset, step, render, terminated, truncated, and info conventions used by maintained environments. Readers implementing custom environments should use this as the API reference. Readers should connect this source to pettingzoo for multi-agent when deciding what is reusable, what is benchmark-specific, and what must be remeasured.

Tool

Farama Foundation. "PettingZoo Documentation."

PettingZoo defines maintained APIs for multi-agent reinforcement learning. It is directly relevant when a section moves from one embodied agent to turn-based, simultaneous, or mixed multi-agent interaction. Readers should connect this source to pettingzoo for multi-agent when deciding what is reusable, what is benchmark-specific, and what must be remeasured.

Tool
Foundational Papers

Terry, J. K. et al. (2021). "PettingZoo: Gym for Multi-Agent Reinforcement Learning." NeurIPS Datasets and Benchmarks.

This paper explains why multi-agent environments need explicit agent ordering and interface discipline. It gives researchers the context behind the AEC and parallel API choices described in this chapter. Readers should connect this source to pettingzoo for multi-agent when deciding what is reusable, what is benchmark-specific, and what must be remeasured.

Paper

Brockman, G. et al. (2016). "OpenAI Gym." arXiv.

The original Gym paper explains the environment abstraction that Gymnasium modernizes. It is useful for readers comparing legacy examples with the maintained Farama stack. Readers should connect this source to pettingzoo for multi-agent when deciding what is reusable, what is benchmark-specific, and what must be remeasured.

Paper
Tools And Libraries

Stable-Baselines3 Contributors. "Stable-Baselines3 Documentation."

Stable-Baselines3 gives a practical reference for how environment spaces, vectorized environments, wrappers, and evaluation callbacks are consumed by training code. Engineers should read it when turning a custom environment into a reproducible RL experiment. Readers should connect this source to pettingzoo for multi-agent when deciding what is reusable, what is benchmark-specific, and what must be remeasured.

Tool
What's Next?

Chapter 11 moves from environment APIs to the physics simulators that make embodied tasks run.