Section 10.4: Vectorized environments; wrappers

A Careful Control Loop
Technical illustration for Section 10.4: Vectorized environments; wrappers.
Figure 10.4A: Vectorized environment architecture: N independent environment copies run in parallel subprocesses, their observations are stacked into a batch tensor, and the policy processes the batch in a single forward pass.
Big Picture

Vectorized environments; wrappers defines the contract an embodied experiment exposes to learning code: observations, actions, rewards, termination, truncation, rendering, and diagnostic info. Gymnasium handles the single-agent version of that contract, while PettingZoo extends the same discipline to multi-agent interaction.

This section turns the agent-environment interface into vectorized reset semantics, wrapper order, batch shape, and per-environment info records practice, preparing RL training, multi-agent experiments, and benchmark evaluation with one auditable environment contract.

What This Section Builds

Wrappers and vectorized environments change the contract around a base task. Wrappers change what an environment exposes without rewriting the base simulator. Vector environments run several copies of a task behind one batched API.

The goal is to keep throughput improvements honest. A wrapper stack should be declared, ordered, and logged, and a vectorized rollout should preserve per-environment termination, truncation, reward, and info fields.

The Interface Is The Test

This environment is ready when another reader can reset it with the same seed, inspect vectorized reset semantics, wrapper order, batch shape, and per-environment info records, reproduce the same rollout, and recover the same logged evidence.

Theory

A wrapper is an environment transformation. It can alter observations, actions, rewards, metadata, or the behavior of reset and step. The base task remains underneath, but the policy sees the wrapped contract.

A vector environment batches several environment copies so the learner collects experience faster. Gymnasium vector envs return arrays for rewards, terminations, and truncations with one element per sub-environment. Observations are batched according to the observation space, which is why space design from Section 10.2 matters before vectorization.

Mechanism

Think of wrappers as a visible pipeline around the simulator and vectorization as a batch dimension around that pipeline. The audit question is always the same: which contract did the policy actually see?

Worked Example

Code Fragment 10.4.1 applies a simple Gymnasium observation wrapper. The wrapper adds time awareness to the observation, so the policy sees a five-value observation instead of the base four-value CartPole state.

# Wrap an environment so the observation includes elapsed time.
# The policy sees the wrapped observation space, not the base one.
import gymnasium as gym
from gymnasium.wrappers import TimeAwareObservation

env = gym.make("CartPole-v1")
wrapped = TimeAwareObservation(env)

observation, info = wrapped.reset(seed=13)
wrapped.action_space.seed(13)
action = wrapped.action_space.sample()
next_observation, reward, terminated, truncated, info = wrapped.step(action)

print(wrapped.observation_space.shape)
print(next_observation.shape, float(reward), terminated, truncated)
wrapped.close()
(5,) (5,) 1.0 False False

The expected output shows the observation shape grow from the usual CartPole size to five entries after the wrapper is applied. The second line confirms that the wrapped environment still returns a legal step tuple, but the policy is now receiving time information as part of its observation.

Code Fragment 10.4.1 shows that TimeAwareObservation changes the observation contract from four values to five. A result table that omits this wrapper would be incomplete because the policy received extra time information.
Library Shortcut

Gymnasium wrappers replace custom preprocessing glue with named, inspectable transformations. The shortcut is safe only when the wrapper order is saved, because reward clipping before logging and reward clipping after logging produce different evidence.

Practical Recipe

  1. Write the base environment contract before adding wrappers.
  2. Add one wrapper at a time and record how it changes spaces, rewards, or info.
  3. Use vector environments when rollout throughput, not environment semantics, is the bottleneck.
  4. Interpret vector outputs per sub-environment, not as one scalar episode.
  5. Log autoreset mode and final observations when using vector rollouts that reset sub-environments automatically.
Gymnasium And PettingZoo Practice

A usable environment wrapper for this section records vectorized reset semantics, wrapper order, batch shape, and per-environment info records, plus observation and action spaces, reset seed, info dictionary fields, and reproducible evidence artifacts.

Common Failure Mode

The common mistake is comparing a wrapped run with an unwrapped run as if only the policy changed. If one run clips rewards, normalizes observations, or adds time features, the comparison is no longer construct matched.

Practical Example

A manipulation lab might vectorize 32 simulated arms to collect rollouts faster, then wrap observations with normalization and action scaling. The result artifact should list both the vector environment parameters and the wrapper stack, because both affect what the policy learned.

Memory Hook

When vectorized environments; wrappers feels abstract, ask what would be different in the next frame of video, the next robot state, or the next safety margin.

Research Frontier

High-throughput robot learning depends on batched simulation, but batching changes the debugging surface. Current work on accelerated simulators and vectorized training stacks makes it easier to collect experience, while increasing the need for per-environment failure traces rather than only aggregate reward curves.

Self Check

Can you write the wrapper stack in order and explain the shape of one batched observation, reward, termination, and truncation array? If not, the vectorized experiment is not yet inspectable.

Wrappers are powerful because they separate task dynamics from interface transformations. That separation also creates a risk: the experiment may claim to evaluate a base environment while the policy actually saw normalized observations, clipped rewards, time features, action rescaling, and a time-limit wrapper.

Vector environments add another layer. They make the rollout batch look like one object, but each sub-environment still has its own episode boundary. Evaluation code should preserve that identity so one unstable instance does not disappear inside an average.

Practical Tool Choices For This Section
Tool or LibraryRole in the TopicBuilder Advice
ObservationWrapperObservation transformationUse for time features, resizing, normalization, or sensor projection.
ActionWrapperAction transformationUse for rescaling, clipping, or translating policy actions into controller commands.
RewardWrapperReward transformationUse with caution, because it changes the training signal.
SyncVectorEnvBatched rollout in one processUse for simple debugging and deterministic batched smoke tests.
AsyncVectorEnvBatched rollout across processesUse when environment stepping is expensive enough to justify multiprocessing complexity.

A robust vector implementation proves the single environment first, then creates a batched version with the same wrappers. The first artifact explains semantics; the second artifact explains throughput.

  1. Run one unwrapped environment step and save the return shapes.
  2. Add wrappers and rerun the same seed to show how the contract changed.
  3. Create the vector environment only after the wrapper stack is fixed.
  4. Store per-sub-environment rewards, termination flags, truncation flags, and final info.
  5. Use aggregate plots only after preserving the per-environment traces.
# Step three CartPole environments through one vectorized API call.
# Rewards and ending flags retain one element per sub-environment.
import gymnasium as gym

envs = gym.make_vec("CartPole-v1", num_envs=3, vectorization_mode="sync")
observations, infos = envs.reset(seed=11)
envs.action_space.seed(11)
actions = envs.action_space.sample()
observations, rewards, terminations, truncations, infos = envs.step(actions)

print(observations.shape)
print(actions.tolist())
print(rewards.tolist())
print(terminations.tolist(), truncations.tolist())
envs.close()
(3, 4) [0, 0, 1] [1.0, 1.0, 1.0] [False, False, False] [False, False, False]

The expected output should be read row-wise across three sub-environments: a batched observation tensor of shape (3, 4), three sampled actions, three rewards, and one termination and truncation flag per environment. Nothing is aggregated yet, which is exactly the right interpretation for vectorized evidence.

Code Fragment 10.4.2 uses gym.make_vec to step three environments together. The observation shape (3, 4) shows the batch dimension, and the reward and ending arrays keep one value per sub-environment.

When an experiment about vectorized environments; wrappers fails, avoid labeling the whole method as weak. First assign the failure to perception, state estimation, planning, control, timing, data coverage, or evaluation. Then rerun one controlled perturbation that isolates the suspected cause. This pattern turns a disappointing rollout into a reusable diagnostic asset.

Key Takeaway

Wrappers change the contract and vector environments batch the contract. Treat both as first-class experiment settings, not invisible implementation details.

Exercise 10.4.1

Create a two-environment vector rollout for a Gymnasium task, then add one observation wrapper. Record the observation shape before and after wrapping, and explain which result table fields must mention the wrapper.

What's Next?

The next section should inherit the Vectorized environments; wrappers interface contract and change only the next environment-design variable under study.

Bibliography and Further Reading
Tools And Libraries

Farama Foundation. "Gymnasium Documentation."

The official Gymnasium docs define the reset, step, render, terminated, truncated, and info conventions used by maintained environments. Readers implementing custom environments should use this as the API reference. Readers should connect this source to vectorized environments; wrappers when deciding what is reusable, what is benchmark-specific, and what must be remeasured.

Tool

Farama Foundation. "PettingZoo Documentation."

PettingZoo defines maintained APIs for multi-agent reinforcement learning. It is directly relevant when a section moves from one embodied agent to turn-based, simultaneous, or mixed multi-agent interaction. Readers should connect this source to vectorized environments; wrappers when deciding what is reusable, what is benchmark-specific, and what must be remeasured.

Tool
Foundational Papers

Terry, J. K. et al. (2021). "PettingZoo: Gym for Multi-Agent Reinforcement Learning." NeurIPS Datasets and Benchmarks.

This paper explains why multi-agent environments need explicit agent ordering and interface discipline. It gives researchers the context behind the AEC and parallel API choices described in this chapter. Readers should connect this source to vectorized environments; wrappers when deciding what is reusable, what is benchmark-specific, and what must be remeasured.

Paper

Brockman, G. et al. (2016). "OpenAI Gym." arXiv.

The original Gym paper explains the environment abstraction that Gymnasium modernizes. It is useful for readers comparing legacy examples with the maintained Farama stack. Readers should connect this source to vectorized environments; wrappers when deciding what is reusable, what is benchmark-specific, and what must be remeasured.

Paper
Tools And Libraries

Stable-Baselines3 Contributors. "Stable-Baselines3 Documentation."

Stable-Baselines3 gives a practical reference for how environment spaces, vectorized environments, wrappers, and evaluation callbacks are consumed by training code. Engineers should read it when turning a custom environment into a reproducible RL experiment. Readers should connect this source to vectorized environments; wrappers when deciding what is reusable, what is benchmark-specific, and what must be remeasured.

Tool