Section 1.3: Agents, environments, observations, actions, rewards, constraints | Building Embodied AI: From Perception to Autonomous Action

"Interfaces matter because the agent can only act on what the experiment lets it see and do."
A Careful Control Loop

Technical illustration for Section 1.3: Agents, environments, observations, actions, rewards, constraints. — **Figure 1.3A**: The task contract made physical: what the agent senses (observation), what it commands (action), what the world does in response (transition), what it is graded on (reward), and the boundary it must not cross (constraint).

Big Picture

Six words, agent, environment, observation, action, reward, constraint, are not vocabulary; they are a contract that pins down exactly what an embodied learning problem is before any algorithm touches it. The contract is a tuple. Write the tuple down and the hard questions become explicit: is the state observed or only partially sensed, is the action a discrete menu or a torque vector, is the objective a single reward or a reward subject to a cost budget. Get the tuple wrong and every downstream result is measuring a different problem than the one you meant. This section gives the formal contract at full precision: the MDP, its partially observed extension that almost every embodied task actually inhabits, and the constrained variant that captures the safety and resource limits real robots cannot ignore.

Figure 1.3. The contract as a closed loop: the agent receives an observation (a lossy view of state), chooses an action, and inherits the consequence as the next observation. Reward grades preference along the loop; cost is accounted separately against a budget.

The task contract is a tuple

The base contract for a fully observed embodied task is a Markov decision process, the tuple $(\mathcal{S},\mathcal{A},P,r,\gamma,\rho_0)$. Here $\mathcal{S}$ is the state space (every configuration the world can be in: joint angles, object poses, contact state), $\mathcal{A}$ is the action space, $P(s' \mid s,a)$ is the transition kernel giving the probability of landing in state $s'$ after taking action $a$ in state $s$, $r(s,a)$ is the scalar reward, $\gamma \in [0,1)$ is the discount factor, and $\rho_0$ is the initial-state distribution. A policy $\pi(a \mid s)$ closes the loop. The Markov property is the load-bearing assumption: $s_t$ summarizes all history relevant to the future, so $P$ and $r$ may condition on $s_t$ alone.

The objective is the discounted return, and the policy is graded by its expected value:

$$G_t = \sum_{k=0}^{\infty} \gamma^{k} r_{t+k}, \qquad J(\pi) = \mathbb{E}_{s_0 \sim \rho_0,\, a_t \sim \pi,\, s_{t+1}\sim P}\!\left[\sum_{t=0}^{\infty} \gamma^{t} r(s_t,a_t)\right].$$

Two distinctions in this objective are routinely conflated and cost teams weeks. Reward $r_t$ is a per-step quantity; return $G_t$ is the discounted sum the policy actually optimizes. A reward that looks correct step by step can induce a return that rewards stalling, oscillating, or any behavior the discount happens to favor. And the expectation is taken over the policy-induced trajectory distribution, so changing $\pi$ changes the states on which $r$ is sampled, the coupling developed in Section 1.1.

Why embodied tasks are partially observed

The MDP assumes the agent sees $s_t$. A robot never does. It sees pixels, joint encoders, force readings, and a language instruction, none of which recover full object pose, mass, friction, or another agent's intent. The honest contract is therefore a partially observable MDP (POMDP), the tuple $(\mathcal{S},\mathcal{A},\mathcal{O},P,Z,r,\gamma)$, which adds an observation space $\mathcal{O}$ and an observation function

$$Z(o \mid s',a) = \Pr(o_{t+1}=o \mid s_{t+1}=s',\, a_t=a),$$

the probability of receiving observation $o$ after action $a$ lands the world in state $s'$. Because $Z$ is generally many-to-one and noisy, the observation $o_t$ is a lossy projection of the state, not the state. A POMDP policy cannot be Markov in $o_t$; it must act on a sufficient statistic of history, the belief state $b_t(s) = \Pr(s_t = s \mid o_{0:t}, a_{0:t-1})$, which is itself updated by Bayes' rule through $Z$ and $P$. The belief MDP over $b_t$ is the formally correct object, and it is why memory, filtering, and recurrent or transformer policies appear the moment a task is genuinely partially observed.

Observation is not state

The single most common modeling error in embodied AI is writing a policy $\pi(a \mid o_t)$ and reasoning about it as if $o_t = s_t$. Under partial observation, two distinct states can emit the same observation, so any reactive policy on $o_t$ is provably suboptimal for tasks that require disambiguating them. The fix is not a better network on $o_t$ alone; it is giving the policy access to history, $\pi(a \mid o_{0:t}, a_{0:t-1})$, so it can carry the belief the single observation cannot.

Action spaces are a design decision, not a given

The set $\mathcal{A}$ is chosen, and the choice determines what can be learned. Common forms: a discrete menu ($\mathcal{A}=\{1,\dots,n\}$, e.g. a fixed set of grasp primitives); a continuous motor space ($\mathcal{A}\subseteq\mathbb{R}^d$, e.g. joint torques or end-effector velocities, almost always box-bounded by actuator limits); and a hierarchical space, where a high-level policy emits a subgoal or skill index and a low-level policy emits the motor command that realizes it, factoring $\pi = \pi_{\text{hi}} \circ \pi_{\text{lo}}$. The same physical robot can be posed as a discrete-action problem (pick which skill) or a continuous-control problem (emit the torque), and these are not the same learning problem: they differ in exploration cost, sample complexity, and which failures are even representable.

Embodied tasks are almost always constrained

An unconstrained objective lets the optimizer spend anything to gain reward, including force, energy, and collision risk a real platform cannot afford. The contract that captures hard limits is the constrained MDP (CMDP) of Altman: an MDP augmented with one or more cost functions $c_i(s,a)\ge 0$ and budgets $d_i$. The agent maximizes return subject to a bound on expected discounted cost,

$$\max_{\pi}\; J(\pi) \quad \text{subject to} \quad J_{c_i}(\pi) = \mathbb{E}_{\tau\sim\pi}\!\left[\sum_{t=0}^{\infty}\gamma^{t} c_i(s_t,a_t)\right] \le d_i \quad \text{for each } i.$$

The distinction from a reward penalty is structural, not cosmetic. Folding cost into reward as $r - \lambda c$ commits to a single exchange rate $\lambda$ chosen before training; the optimizer is then free to buy back any amount of safety violation whenever $\lambda c$ is cheaper than the reward gained. The CMDP keeps $c$ as a separate accounting line with its own budget $d$, so "collide less than $d$ times in expectation" is a commitment the optimizer must respect rather than a price it may pay. This is why later chapters reach for Lagrangian methods (which adapt $\lambda$ to enforce the budget) and constrained policy optimization rather than a hand-tuned penalty.

Reward names preference; cost names a boundary

A reward says "more of this is better." A cost with a budget says "you may not cross this line, regardless of how much reward lies beyond it." Merging them into one scalar too early throws away exactly the structure that lets you certify a policy as safe, because once the line is a price, a sufficiently large reward will always pay it.

The contract as runnable code

The environment below instantiates the full contract for a deliberately small embodied task: a point effector on a line must reach a goal while a wall it cannot pass through sits between some states and that goal. State is the true effector position; the agent never sees it directly, it receives a noisy position reading, making the task a POMDP. The action is a bounded continuous velocity command. Reward is shaped progress toward the goal; collision with the wall is logged as a separate cost, not subtracted from reward, making the task a CMDP. The episode terminates on reaching the goal and truncates at a step limit.

# A minimal embodied task contract as a Gymnasium-style environment:
# 1D reach-to-goal under partial observation (noisy position) with a separate collision cost.
# obs != state (POMDP); reward and cost are distinct fields (CMDP).
import numpy as np
import gymnasium as gym
from gymnasium import spaces


class Reach1D(gym.Env):
    metadata = {"render_modes": []}

    def __init__(self, obs_noise=0.02, wall_x=0.5, max_steps=100):
        super().__init__()
        self.obs_noise = obs_noise      # std of the position sensor; obs is a lossy view of state
        self.wall_x = wall_x            # impassable wall the effector must not push through
        self.max_steps = max_steps
        self.goal = 0.9
        # Continuous, box-bounded action: a velocity command clipped to actuator limits.
        self.action_space = spaces.Box(low=-0.1, high=0.1, shape=(1,), dtype=np.float32)
        # Observation is the noisy reading plus the goal; it is NOT the true state x.
        self.observation_space = spaces.Box(
            low=np.array([0.0, 0.0], dtype=np.float32),
            high=np.array([1.0, 1.0], dtype=np.float32),
        )

    def _obs(self):
        noisy = self.x + self.np_random.normal(0.0, self.obs_noise)
        return np.array([np.clip(noisy, 0.0, 1.0), self.goal], dtype=np.float32)

    def reset(self, *, seed=None, options=None):
        super().reset(seed=seed)
        self.x = 0.1                    # true state, hidden from the agent
        self.steps = 0
        return self._obs(), {"true_x": self.x}

    def step(self, action):
        a = float(np.clip(action, self.action_space.low, self.action_space.high)[0])
        prev_x = self.x
        proposed = np.clip(self.x + a, 0.0, 1.0)
        # The wall is impassable: a move that would cross it is blocked and incurs a cost.
        crossed = (prev_x - self.wall_x) * (proposed - self.wall_x) < 0
        cost = 1.0 if crossed else 0.0
        self.x = prev_x if crossed else proposed

        dist = abs(self.goal - self.x)
        reward = (abs(self.goal - prev_x) - dist)   # shaped progress; cost is NOT folded in
        terminated = dist < 0.02
        if terminated:
            reward += 1.0
        self.steps += 1
        truncated = self.steps >= self.max_steps
        info = {"cost": cost, "true_x": self.x}      # cost kept on its own accounting line
        return self._obs(), float(reward), terminated, truncated, info


if __name__ == "__main__":
    env = Reach1D()
    obs, info = env.reset(seed=0)
    total_reward, total_cost = 0.0, 0.0
    for _ in range(env.max_steps):
        action = env.action_space.high           # naive: always push toward the goal
        obs, reward, terminated, truncated, info = env.step(action)
        total_reward += reward
        total_cost += info["cost"]
        if terminated or truncated:
            break
    print(f"return={total_reward:.3f}  collision_cost={total_cost:.0f}  reached={terminated}")

Code 1.3.1. The contract in executable form. The agent's observation_space exposes a noisy reading, never the true state x (POMDP); the action_space is a box-bounded velocity (continuous control); and collision is returned in info["cost"] as a separate line, never subtracted from reward (CMDP). A naive always-forward policy reaches the goal but logs a nonzero collision cost, exactly the trade a reward-only formulation would hide.

Library shortcut: Gymnasium and PettingZoo

Gymnasium standardizes precisely the tuple above: typed observation_space and action_space (so $\mathcal{O}$ and $\mathcal{A}$ are declared, not implied), the reset()/step() protocol, and the five-value return that separates terminated (the task ended) from truncated (a time limit cut it off), with everything else, including costs, carried in info. PettingZoo extends the same protocol to multiple agents, where each agent has its own observation and action space and the others are part of the environment dynamics. For the constrained case, Safety-Gymnasium adds a first-class cost channel so $J_c(\pi)\le d$ is enforced rather than buried. These libraries do not choose your abstraction; they make the abstraction you chose explicit and inspectable.

Three pitfalls that look like algorithm bugs

Conflating observation with state. Treating $o_t$ as $s_t$ silently assumes the task is an MDP. If two states share an observation, no policy on $o_t$ can be optimal, and the symptom (a policy that plateaus and thrashes near ambiguous states) looks like a training failure when it is a modeling failure. Reward leakage. If the reward can be computed from a quantity correlated with success but not caused by it, the policy will exploit the correlate. Folding constraints into reward. Writing $r - \lambda c$ converts a hard boundary into a price; with enough downstream reward the optimizer learns to pay it, and you have built a reward-hacking incentive in the name of safety. Keep $c$ on its own budgeted line.

Research frontier: specifying reward and learning constraints

The contract assumes $r$ and $c$ are given. In practice neither is. Reward specification is a known failure surface: hand-written rewards are routinely game-able, which motivates reward learning from preferences, demonstrations, and language. Constraints are harder still, because the costs that matter (do not crush the object, do not startle the human, do not exceed thermal limits) are often unstated and must be inferred from demonstrations or interaction. Inverse constrained reinforcement learning and safe exploration, which must respect a budget during training and not only at convergence, are active and unsettled. The open question is how to acquire $r$ and the $c_i$ themselves with the same rigor the CMDP applies once they exist.

Key Takeaway

An embodied task is defined by a tuple before it is touched by an algorithm. Real tasks are partially observed, so the honest contract is the POMDP $(\mathcal{S},\mathcal{A},\mathcal{O},P,Z,r,\gamma)$ in which observation is a lossy view of state, and they are resource- and safety-limited, so the honest objective is the CMDP's constrained one, $\max_\pi J(\pi)$ s.t. $J_{c_i}(\pi)\le d_i$. Writing the tuple down, keeping observation distinct from state and cost distinct from reward, is the cheapest reliability investment in the entire pipeline.

Exercise 1.3.1

In Reach1D the wall makes the task partially observed in a second way: the agent's noisy reading cannot tell it which side of the wall it is on near $x=0.5$. Add a second action dimension that toggles a "probe" (zero motion, halved observation noise for that step) and modify the observation to include the last probe result. Does a policy that can probe achieve lower collision cost at equal return than one that cannot? Explain the result in terms of belief states.

Exercise 1.3.2

Reformulate Reach1D two ways: once as a reward-penalty MDP with objective $r - \lambda\,c$, and once as the CMDP it already is with budget $d$ on expected collision cost. For the penalty version, find a value of $\lambda$ for which the optimal policy still collides because the goal bonus outweighs the penalty. Then argue why no single $\lambda$ enforces "at most $d$ collisions in expectation" across all initial conditions, and connect this to why Lagrangian methods adapt $\lambda$ during training.

What's Next?

Section 1.4 separates physical embodiment from simulated embodiment and explains why both matter.

Section References

Sutton, R. S., and Barto, A. G. "Reinforcement Learning: An Introduction." 2nd ed. (2018). http://incompleteideas.net/book/the-book-2nd.html

The standard reference for the MDP tuple, returns, policies, value functions, and the trajectory-level objective used throughout this section.

Puterman, M. L. "Markov Decision Processes: Discrete Stochastic Dynamic Programming." Wiley (1994).

The rigorous treatment of MDPs: transition kernels, the Markov property, discounting, and existence of optimal policies. The formal backbone for the $(\mathcal{S},\mathcal{A},P,r,\gamma,\rho_0)$ contract.

Altman, E. "Constrained Markov Decision Processes." Chapman & Hall/CRC (1999). https://www-sop.inria.fr/members/Eitan.Altman/PAPERS/h.pdf

The definitive reference for the CMDP: cost functions, budget constraints $J_c(\pi)\le d$, and the Lagrangian and linear-program formulations that enforce them. Source for the constraint-versus-penalty distinction.

Kaelbling, L. P., Littman, M. L., and Cassandra, A. R. "Planning and Acting in Partially Observable Stochastic Domains." Artificial Intelligence 101 (1998): 99-134. https://www.sciencedirect.com/science/article/pii/S000437029800023X

The canonical POMDP reference: the observation function $Z$, belief states, and the belief-MDP reduction that justifies history-dependent policies under partial observation.