Section 2.1: Agents and environments formally

"The environment is the part of the experiment that gets a vote after every action."

A Boundary-Conscious Embodied AI Agent
Technical illustration for Section 2.1: Agents and environments formally.
Figure 2.1A: The formal agent-environment boundary, labeling observation o_t, action a_t, next state s_{t+1}, and reward r_t as distinct typed signals crossing that boundary each timestep.
Big Picture

Agents and environments formally gives the perception-action loop a contract. The agent is the decision-making process, the environment is everything that produces observations and consequences, and the interface is the boundary where evidence, action, reward, and timing pass.

Concept map for Section 2.1 A local diagram showing how transition functions connect actions to next observations. Evidence what the agent receives Decision what the system changes Consequence what the next step inherits Closed-loop feedback makes the next input depend on the last action.
Figure 2.1. Environment dynamics and transition functions is easiest to reason about as a closed-loop evidence, decision, consequence pattern: transition functions connect actions to next observations.

This section turns the agent-environment loop into a precise object you can specify, implement, test, and log. The point is not to admire a loop diagram. The point is to know exactly what happens when a robot receives a sensor packet, chooses an action, waits for the world to respond, and records evidence about the result.

The distinction matters because many embodied failures are interface failures. A policy may be competent, but the environment wrapper may hide time-limit truncation. A simulator may expose privileged state that the real robot never observes. A logger may record reward but omit the action clipping that changed the actual command.

The Interface Is The Experiment

An embodied experiment is only as clear as its transition record: observation, action, reward or score, termination, truncation, timing, and diagnostic info. If any field is vague, later results become hard to interpret.

Theory

At time $t$, an agent receives an observation $o_t$, chooses an action $a_t$, and receives a consequence that usually includes a new observation $o_{t+1}$, a reward or score $r_{t+1}$, and episode status. The environment owns the transition dynamics. The agent owns the decision rule. The evaluator owns the claim about whether behavior was good.

For a fully specified single-agent environment, the minimal contract is close to the Gymnasium pattern: reset starts an episode and returns an initial observation plus metadata; step(action) advances the world and returns observation, reward, terminated, truncated, and info. Terminated means the task reached a natural end. Truncated means an external limit, such as a time limit, stopped the episode.

For multi-agent settings, the contract also needs turn order or simultaneous actions. PettingZoo makes that distinction explicit through sequential and parallel APIs. This matters for embodied systems with people, other robots, traffic participants, or adversarial agents.

Mechanism

The mechanism is a typed transition boundary. A useful environment does not merely run physics or replay data. It standardizes reset, step, action validation, observation structure, episode endings, random seeds, and diagnostic info so that closed-loop behavior can be reproduced.

Worked Example

Code Fragment 2.1.1 builds a tiny environment contract without any reinforcement learning library. The example is deliberately small so that every returned field is visible.

# Section 2.1: runnable checkpoint for Environment dynamics and transition functions.
# Keep the output small so the evidence record can be inspected directly.
from dataclasses import dataclass

@dataclass
class Transition:
    observation: dict
    action: str
    reward: float
    terminated: bool
    truncated: bool
    info: dict

def step(position, action, time_step):
    delta = 1 if action == "right" else -1
    next_position = position + delta
    reached_goal = next_position >= 3
    timed_out = time_step >= 4
    reward = 1.0 if reached_goal else -0.01
    return Transition(
        observation={"position": next_position},
        action=action,
        reward=reward,
        terminated=reached_goal,
        truncated=timed_out and not reached_goal,
        info={"latency_ms": 12, "action_clipped": False},
    )

print(step(position=2, action="right", time_step=3))
Code Fragment 2.1.1 defines an explicit transition record with observation, action, reward, termination, truncation, and diagnostic info.
Library Shortcut

The 28-line teaching loop becomes roughly 6 lines of user-facing interaction with Gymnasium once an environment class exists: make the environment, call reset, sample or choose actions, call step, and inspect terminated, truncated, and info. Gymnasium handles spaces, wrappers, seeding, reset semantics, and time-limit conventions internally. The hand-built version remains useful because it exposes exactly what the library standardizes.

Practical Recipe

  1. Name the agent process and the environment process separately.
  2. Specify reset, observation, action, reward, termination, truncation, and info fields before training.
  3. Track latency for observation capture, policy inference, transport, and action execution.
  4. Log the full transition tuple before aggregating success rate or return.
  5. Run a no-op or safe-action baseline to verify that the environment boundary behaves as expected.
Failure Mode

A benchmark score is weak evidence when reset semantics, time limits, action clipping, or termination causes are hidden. A policy can look successful because the wrapper ended difficult episodes early or because the evaluator compared runs from different environment versions.

Practical Example

A warehouse robotics team used a custom simulator and a real cart robot. By writing the interface contract first, they discovered that the simulator returned object pose as privileged ground truth while the real robot returned delayed camera detections. The fix was not a larger model. The fix was to expose state-estimate confidence and sensor delay in the transition info record.

Memorable Shortcut

If the environment wrapper cannot explain why an episode ended, it is not a benchmark yet. It is a suspense story with a CSV file.

Research Frontier

Modern robot learning systems increasingly train from large replay buffers, simulated rollouts, and real-world logs through shared environment-like interfaces. The frontier is making those interfaces rich enough for robot data, multi-camera observations, action chunks, safety monitors, and replayable failure analysis.

Mini Lab

Wrap Code Fragment 2.1.1 in a loop of five actions. Record every transition as JSON, then compute success rate twice: once counting truncation as failure and once reporting truncation separately.

Self Check

Can you name which process owns the observation, which process owns the action choice, which process declares episode ending, and which field records timing evidence?

The formal interface is useful only if it prevents silent changes in the experiment. A policy result should say which environment version produced it, how reset sampled initial state, which action space was accepted, whether the command was clipped, why the episode ended, and whether the ending was a task termination or an external truncation.

The most common mistake is to treat the environment as background code. In embodied AI the environment is part of the scientific claim. If a wrapper changes observations, action limits, time limits, rewards, or diagnostic info, it changes the meaning of the result.

Tool or LibraryRole in This TopicBuilder Advice
Gymnasiumstandardizes reset, step, spaces, termination, truncation, wrappers, and seedingUse it to make single-agent environment contracts inspectable before training.
PettingZooseparates sequential and parallel multi-agent interaction patternsUse it when other agents, people, vehicles, or robots change the transition dynamics.
ROS 2carries observations, commands, clocks, transforms, and diagnostics across real robot processesUse it to connect the formal environment contract to real-time deployed components.

Before training, run an interface audit on one transition. The audit should fail if a required field is missing, if termination and truncation are confused, or if action clipping is hidden.

  1. Declare the reset output and step output as named fields.
  2. Check that observation and action types match the declared spaces.
  3. Log termination and truncation as different fields.
  4. Put latency, clipping, safety gate status, and wrapper version in info.
  5. Save the first transition from every experiment run as a smoke-test artifact.
# Audit one transition record before trusting an environment result.
transition = {
    "observation": {"position": 3},
    "action": "right",
    "reward": 1.0,
    "terminated": True,
    "truncated": False,
    "info": {"latency_ms": 12, "action_clipped": False, "wrapper_version": "v2"},
}

def audit_transition(row: dict[str, object]) -> list[str]:
    required = {"observation", "action", "reward", "terminated", "truncated", "info"}
    problems = [f"missing {key}" for key in sorted(required - row.keys())]
    if row.get("terminated") and row.get("truncated"):
        problems.append("terminated and truncated cannot both explain the same ending")
    info = row.get("info", {})
    for key in ["latency_ms", "action_clipped", "wrapper_version"]:
        if key not in info:
            problems.append(f"info missing {key}")
    return problems

print(audit_transition(transition))
Code Fragment 2.1.2 audits a transition record for required fields, episode-ending semantics, and diagnostic info.

When an agent-environment experiment fails, first inspect the transition boundary. Check reset distribution, action clipping, wrapper order, time-limit handling, reward emission, and diagnostic info before changing the policy.

Key Takeaway

A formal agent-environment interface is the smallest unit of closed-loop evidence. If the transition record is clear, later learning, simulation, logging, and deployment decisions have a stable foundation.

Exercise 2.1.1

Write a transition schema for a door-opening robot. Include one field that belongs to the evaluator but is not visible to the agent.

What's Next?

Section 2.2 distinguishes hidden state from the observations the agent actually receives.

Bibliography & Further Reading

Foundational References For This Section

Bellman, R.. "A Markovian Decision Process." (1957). https://doi.org/10.1515/9781400835386-007

The mathematical origin of the state, action, transition, and reward framing.

Kaelbling, L. P., Littman, M. L., and Cassandra, A. R.. "Planning and acting in partially observable stochastic domains." (1998). https://www.sciencedirect.com/science/article/pii/S000437029800023X

A foundational POMDP reference for belief-state reasoning under partial observability.

Farama Foundation. "Gymnasium Documentation." (2024). https://gymnasium.farama.org/

The maintained reference for reset, step, spaces, termination, truncation, wrappers, and reproducible environments.