A Careful Control Loop
Vectorized environments; wrappers defines the contract an embodied experiment exposes to learning code: observations, actions, rewards, termination, truncation, rendering, and diagnostic info. Gymnasium handles the single-agent version of that contract, while PettingZoo extends the same discipline to multi-agent interaction.
This section turns the agent-environment interface into vectorized reset semantics, wrapper order, batch shape, and per-environment info records practice, preparing RL training, multi-agent experiments, and benchmark evaluation with one auditable environment contract.
What This Section Builds
Wrappers and vectorized environments change the contract around a base task. Wrappers change what an environment exposes without rewriting the base simulator. Vector environments run several copies of a task behind one batched API.
The goal is to keep throughput improvements honest. A wrapper stack should be declared, ordered, and logged, and a vectorized rollout should preserve per-environment termination, truncation, reward, and info fields.
This environment is ready when another reader can reset it with the same seed, inspect vectorized reset semantics, wrapper order, batch shape, and per-environment info records, reproduce the same rollout, and recover the same logged evidence.
Theory
A wrapper is an environment transformation. It can alter observations, actions, rewards, metadata, or the behavior of reset and step. The base task remains underneath, but the policy sees the wrapped contract.
A vector environment batches several environment copies so the learner collects experience faster. Gymnasium vector envs return arrays for rewards, terminations, and truncations with one element per sub-environment. Observations are batched according to the observation space, which is why space design from Section 10.2 matters before vectorization.
Think of wrappers as a visible pipeline around the simulator and vectorization as a batch dimension around that pipeline. The audit question is always the same: which contract did the policy actually see?
Worked Example
Code Fragment 10.4.1 applies a simple Gymnasium observation wrapper. The wrapper adds time awareness to the observation, so the policy sees a five-value observation instead of the base four-value CartPole state.
# Wrap an environment so the observation includes elapsed time.
# The policy sees the wrapped observation space, not the base one.
import gymnasium as gym
from gymnasium.wrappers import TimeAwareObservation
env = gym.make("CartPole-v1")
wrapped = TimeAwareObservation(env)
observation, info = wrapped.reset(seed=13)
wrapped.action_space.seed(13)
action = wrapped.action_space.sample()
next_observation, reward, terminated, truncated, info = wrapped.step(action)
print(wrapped.observation_space.shape)
print(next_observation.shape, float(reward), terminated, truncated)
wrapped.close()
The expected output shows the observation shape grow from the usual CartPole size to five entries after the wrapper is applied. The second line confirms that the wrapped environment still returns a legal step tuple, but the policy is now receiving time information as part of its observation.
TimeAwareObservation changes the observation contract from four values to five. A result table that omits this wrapper would be incomplete because the policy received extra time information.Gymnasium wrappers replace custom preprocessing glue with named, inspectable transformations. The shortcut is safe only when the wrapper order is saved, because reward clipping before logging and reward clipping after logging produce different evidence.
Practical Recipe
- Write the base environment contract before adding wrappers.
- Add one wrapper at a time and record how it changes spaces, rewards, or info.
- Use vector environments when rollout throughput, not environment semantics, is the bottleneck.
- Interpret vector outputs per sub-environment, not as one scalar episode.
- Log autoreset mode and final observations when using vector rollouts that reset sub-environments automatically.
A usable environment wrapper for this section records vectorized reset semantics, wrapper order, batch shape, and per-environment info records, plus observation and action spaces, reset seed, info dictionary fields, and reproducible evidence artifacts.
The common mistake is comparing a wrapped run with an unwrapped run as if only the policy changed. If one run clips rewards, normalizes observations, or adds time features, the comparison is no longer construct matched.
A manipulation lab might vectorize 32 simulated arms to collect rollouts faster, then wrap observations with normalization and action scaling. The result artifact should list both the vector environment parameters and the wrapper stack, because both affect what the policy learned.
When vectorized environments; wrappers feels abstract, ask what would be different in the next frame of video, the next robot state, or the next safety margin.
High-throughput robot learning depends on batched simulation, but batching changes the debugging surface. Current work on accelerated simulators and vectorized training stacks makes it easier to collect experience, while increasing the need for per-environment failure traces rather than only aggregate reward curves.
Can you write the wrapper stack in order and explain the shape of one batched observation, reward, termination, and truncation array? If not, the vectorized experiment is not yet inspectable.
Wrappers are powerful because they separate task dynamics from interface transformations. That separation also creates a risk: the experiment may claim to evaluate a base environment while the policy actually saw normalized observations, clipped rewards, time features, action rescaling, and a time-limit wrapper.
Vector environments add another layer. They make the rollout batch look like one object, but each sub-environment still has its own episode boundary. Evaluation code should preserve that identity so one unstable instance does not disappear inside an average.
| Tool or Library | Role in the Topic | Builder Advice |
|---|---|---|
ObservationWrapper | Observation transformation | Use for time features, resizing, normalization, or sensor projection. |
ActionWrapper | Action transformation | Use for rescaling, clipping, or translating policy actions into controller commands. |
RewardWrapper | Reward transformation | Use with caution, because it changes the training signal. |
SyncVectorEnv | Batched rollout in one process | Use for simple debugging and deterministic batched smoke tests. |
AsyncVectorEnv | Batched rollout across processes | Use when environment stepping is expensive enough to justify multiprocessing complexity. |
A robust vector implementation proves the single environment first, then creates a batched version with the same wrappers. The first artifact explains semantics; the second artifact explains throughput.
- Run one unwrapped environment step and save the return shapes.
- Add wrappers and rerun the same seed to show how the contract changed.
- Create the vector environment only after the wrapper stack is fixed.
- Store per-sub-environment rewards, termination flags, truncation flags, and final info.
- Use aggregate plots only after preserving the per-environment traces.
# Step three CartPole environments through one vectorized API call.
# Rewards and ending flags retain one element per sub-environment.
import gymnasium as gym
envs = gym.make_vec("CartPole-v1", num_envs=3, vectorization_mode="sync")
observations, infos = envs.reset(seed=11)
envs.action_space.seed(11)
actions = envs.action_space.sample()
observations, rewards, terminations, truncations, infos = envs.step(actions)
print(observations.shape)
print(actions.tolist())
print(rewards.tolist())
print(terminations.tolist(), truncations.tolist())
envs.close()
The expected output should be read row-wise across three sub-environments: a batched observation tensor of shape (3, 4), three sampled actions, three rewards, and one termination and truncation flag per environment. Nothing is aggregated yet, which is exactly the right interpretation for vectorized evidence.
gym.make_vec to step three environments together. The observation shape (3, 4) shows the batch dimension, and the reward and ending arrays keep one value per sub-environment.When an experiment about vectorized environments; wrappers fails, avoid labeling the whole method as weak. First assign the failure to perception, state estimation, planning, control, timing, data coverage, or evaluation. Then rerun one controlled perturbation that isolates the suspected cause. This pattern turns a disappointing rollout into a reusable diagnostic asset.
Wrappers change the contract and vector environments batch the contract. Treat both as first-class experiment settings, not invisible implementation details.
Create a two-environment vector rollout for a Gymnasium task, then add one observation wrapper. Record the observation shape before and after wrapping, and explain which result table fields must mention the wrapper.
The next section should inherit the Vectorized environments; wrappers interface contract and change only the next environment-design variable under study.
Farama Foundation. "Gymnasium Documentation."
The official Gymnasium docs define the reset, step, render, terminated, truncated, and info conventions used by maintained environments. Readers implementing custom environments should use this as the API reference. Readers should connect this source to vectorized environments; wrappers when deciding what is reusable, what is benchmark-specific, and what must be remeasured.
Farama Foundation. "PettingZoo Documentation."
PettingZoo defines maintained APIs for multi-agent reinforcement learning. It is directly relevant when a section moves from one embodied agent to turn-based, simultaneous, or mixed multi-agent interaction. Readers should connect this source to vectorized environments; wrappers when deciding what is reusable, what is benchmark-specific, and what must be remeasured.
This paper explains why multi-agent environments need explicit agent ordering and interface discipline. It gives researchers the context behind the AEC and parallel API choices described in this chapter. Readers should connect this source to vectorized environments; wrappers when deciding what is reusable, what is benchmark-specific, and what must be remeasured.
Brockman, G. et al. (2016). "OpenAI Gym." arXiv.
The original Gym paper explains the environment abstraction that Gymnasium modernizes. It is useful for readers comparing legacy examples with the maintained Farama stack. Readers should connect this source to vectorized environments; wrappers when deciding what is reusable, what is benchmark-specific, and what must be remeasured.
Stable-Baselines3 Contributors. "Stable-Baselines3 Documentation."
Stable-Baselines3 gives a practical reference for how environment spaces, vectorized environments, wrappers, and evaluation callbacks are consumed by training code. Engineers should read it when turning a custom environment into a reproducible RL experiment. Readers should connect this source to vectorized environments; wrappers when deciding what is reusable, what is benchmark-specific, and what must be remeasured.