Section 3.1: The canonical stack: sense, perceive, estimate, predict, plan, control, act | Building Embodied AI: From Perception to Autonomous Action

A Careful Control Loop

Technical illustration for Section 3.1: The canonical stack: sense, perceive, estimate, predict, plan, control, act. — Figure 3.1A: The canonical sense-perceive-estimate-predict-plan-control-act stack shown as a data-flow pipeline, with each stage labeled by its input type, output type, and typical latency budget.

Big Picture

The canonical stack: sense, perceive, estimate, predict, plan, control, act is one lens on embodied system architectures. We study it because an embodied agent needs decisions that survive contact with noisy sensors, delayed effects, and changing environments.

Figure 3.1. The canonical stack: sense, perceive, estimate, predict, plan, control, act is easiest to reason about as a closed-loop evidence, decision, consequence pattern: the stack turns raw sensing into controlled action.

This section develops the technical contract for the canonical stack: sense, perceive, estimate, predict, plan, control, act into a usable mental model. First we define the object of study, then we connect it to the agent loop, then we test it with a compact implementation.

The key question in The canonical stack: sense, perceive, estimate, predict, plan, control, act is practical: what must the agent know, what can it observe, what action is available, and what evidence shows that the action worked under the stated conditions?

Action Is The Test

A representation earns its place when it changes the measurable action interface. In the canonical stack: sense, perceive, estimate, predict, plan, control, act, the reader should keep asking which decision becomes easier, safer, or more reliable.

Theory

For The canonical stack: sense, perceive, estimate, predict, plan, control, act, the practical design rule is to make the interface inspectable before optimization begins: inputs, outputs, units, latency, bounds, and failure labels should all be visible in the saved artifact.

The canonical stack is useful because it gives each transformation a job that can be tested independently without forgetting the closed loop. A compact version is:

$$o_t \xrightarrow{\text{sense}} x_t \xrightarrow{\text{perceive}} y_t \xrightarrow{\text{estimate}} \hat{s}_t \xrightarrow{\text{predict}} \hat{s}_{t+1:t+H} \xrightarrow{\text{plan}} \tau_t \xrightarrow{\text{control}} a_t \xrightarrow{\text{act}} o_{t+1}.$$

Here $o_t$ is the raw observation, $x_t$ is calibrated sensor data, $y_t$ is a semantic or geometric percept, $\hat{s}_t$ is the agent's best state estimate, $\hat{s}_{t+1:t+H}$ is the predicted future over horizon $H$, $\tau_t$ is the chosen trajectory or skill, and $a_t$ is the executable command. The important assumption is that each arrow preserves enough information for the next stage while reducing ambiguity for the decision. The stack breaks down when one stage silently changes units, coordinate frames, latency, uncertainty, or success criteria.

Mechanism

The mechanism in The canonical stack: sense, perceive, estimate, predict, plan, control, act is the contract between representation and action. Name what enters the module, what leaves it, which assumptions make that transformation valid, and which log would reveal a bad handoff.

Worked Example

The cleanest way to make the canonical stack concrete is to give every arrow its own function and run one observation through all seven. The toy world below is a 1-D reaching task: the body is at position $x$, a target sits at $x^\star$, and the only actuator is a bounded velocity command. Each stage does its single job and hands a typed value to the next.

import numpy as np

rng = np.random.default_rng(0)
TARGET = 1.00          # x* the agent must reach (meters)
STEP_MAX = 0.20        # actuator saturation (max move per step, meters)

def sense(true_x):                       # o_t: noisy raw reading
    return true_x + rng.normal(0, 0.02)
def perceive(o):                         # x_t: calibrated, de-biased
    return o - 0.01                      # known sensor bias
def estimate(x_meas, x_prev, alpha=0.7): # s_hat: low-pass state filter
    return alpha * x_meas + (1 - alpha) * x_prev
def predict(s_hat, a_prev, H=3):         # roll the model forward H steps
    return s_hat + H * a_prev
def plan(s_hat):                         # tau_t: desired displacement to goal
    return TARGET - s_hat
def control(tau):                        # a_t: saturate to actuator limit
    return float(np.clip(tau, -STEP_MAX, STEP_MAX))
def act(true_x, a):                      # world transition -> o_{t+1}
    return true_x + a

true_x, s_hat, a = 0.0, 0.0, 0.0
for t in range(12):
    o      = sense(true_x)
    x      = perceive(o)
    s_hat  = estimate(x, s_hat)
    s_pred = predict(s_hat, a)
    tau    = plan(s_hat)
    a      = control(tau)
    true_x = act(true_x, a)
    print(f"t={t:2d} o={o:+.3f} s_hat={s_hat:+.3f} "
          f"pred={s_pred:+.3f} a={a:+.3f} x={true_x:+.3f}")

print(f"final error = {TARGET - true_x:+.3f} m")

Code Fragment 3.1.1 runs one rollout through all seven stages of the canonical stack. Each line of the trace is a typed handoff: observation, estimate, prediction, command, and the new world state.

Expected output: the agent drives the error toward zero in under ten steps, with the command $a$ saturating at $\text{STEP\_MAX}=0.20$ on the first moves and then shrinking as $\hat{s}$ approaches the target. The value of the trace is that each column is one arrow in the equation above: if the robot misses, you can read off which stage first reported a wrong number. Try injecting a stale estimate (skip the estimate update for one step) and watch the error grow even though every other stage is correct.

Library Shortcut

For The canonical stack: sense, perceive, estimate, predict, plan, control, act, the hand-built fragment is a visibility tool. Production work should move to maintained stacks such as Hugging Face Transformers, open VLMs, OpenVLA, openpi, LeRobot, and tool-calling planners once the section has made the interface, logging contract, and failure recovery path explicit.

Practical Recipe

Write the observation, action, and success metric before choosing a model.
Build a baseline that is simple enough to debug by inspection.
Add the library implementation only after the baseline behavior is understood.
Record failures as structured cases: perception error, state error, planning error, control error, or evaluation error.
Run at least one perturbation test before trusting the result.

Common Failure Mode

The common mistake in The canonical stack: sense, perceive, estimate, predict, plan, control, act is to celebrate the component score before checking the closed-loop handoff. The failure usually appears at the boundary: stale state, wrong frame, delayed action, saturated actuator, or metric that ignores the real task cost.

Practical Example

A robotics team using the canonical stack: sense, perceive, estimate, predict, plan, control, act should log not only final success, but intermediate observations, chosen actions, controller status, and recovery events. The logs reveal whether the method is solving the task or merely passing the easiest episodes.

Fun Note

The canonical stack is a relay race where every runner blames the previous handoff until the robot misses the grasp.

Research Frontier

For The canonical stack: sense, perceive, estimate, predict, plan, control, act, treat frontier claims as hypotheses until they expose enough detail to reproduce the result: data boundary, embodiment, controller interface, evaluation panel, and failure cases.

Self Check

Can you name the observation, state estimate, action, success metric, and most likely failure mode for the canonical stack: sense, perceive, estimate, predict, plan, control, act? If not, the system boundary is still too vague.

The canonical stack: sense, perceive, estimate, predict, plan, control, act becomes useful when it is tied to a closed-loop contract for how perception, estimation, planning, learning, and control are arranged into a system. The contract names the observation stream, the action representation, the timing budget, the safety boundary, and the result artifact. That is the bridge between a readable concept and a system a skeptical builder can test.

For The canonical stack: sense, perceive, estimate, predict, plan, control, act, separate the conceptual claim, the systems claim, and the evidence claim. A good explanation, a clean API, and one successful rollout are different kinds of evidence, and the section should keep them distinct.

Tool or Library	Role in This Topic	Builder Advice
ROS 2	separates system modules while preserving message contracts and timing	Use it when the hand-built contract is clear and the experiment needs repeatable runs.
MuJoCo	gives architecture choices a repeatable simulated world for stress tests	Use it when the hand-built contract is clear and the experiment needs repeatable runs.
LeRobot	anchors modern policy architectures in reusable datasets and policy APIs	Use it when the hand-built contract is clear and the experiment needs repeatable runs.

For The canonical stack: sense, perceive, estimate, predict, plan, control, act, a robust implementation starts with one inspectable baseline whose artifact records observations, actions, units, timestamps, seeds, termination reasons, and the perturbation applied. The maintained-tool version is useful only if it preserves that schema and lets the comparison remain construct-matched.

Write a one-paragraph task contract with observation, action, success, failure, and safety fields.
Start with the smallest simulator, dataset, or wrapper that exposes the task contract faithfully.
Run one deterministic smoke test and one perturbation test before scaling.
Save one artifact containing configuration, seed, metrics, traces, and failure labels.
Compare methods only when the same script evaluates the same panel, split, seed set, and metric.

When The canonical stack: sense, perceive, estimate, predict, plan, control, act fails, avoid labeling the whole method as weak. First assign the failure to perception, state estimation, planning, control, timing, data coverage, or evaluation. Then rerun one controlled perturbation that isolates the suspected cause. This pattern turns a disappointing rollout into a reusable diagnostic asset.

A practical diagnostic is to freeze the downstream stages and replay the upstream trace. If the controller succeeds when fed the logged plan, the control layer is probably not the first cause. If the planner succeeds when fed a corrected state estimate, the problem moves to perception or estimation. This replay habit turns the stack from a diagram into a fault isolation tool.

Key Takeaway

The canonical stack: sense, perceive, estimate, predict, plan, control, act is useful when it makes the perception-action loop more reliable, not when it merely adds a more impressive model name.

Exercise 3.1.1

Design a method-matched experiment for The canonical stack: sense, perceive, estimate, predict, plan, control, act. Specify the environment, observation schema, action interface, metric, and one perturbation that targets the section's core assumption.

What's Next?

Section 3.2 compares this stack with the classical modular robotics pipeline.

Bibliography & Further Reading

Foundational References For This Section

Quigley, M. et al.. "ROS: an open-source Robot Operating System." (2009). https://www.ros.org/

The systems reference for modular robot software and message-passing architecture.

Todorov, E., Erez, T., and Tassa, Y.. "MuJoCo: A physics engine for model-based control." (2012). https://mujoco.org/

A widely used simulator for architecture and control experiments.

Brohan, A. et al.. "RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control." (2023). https://arxiv.org/abs/2307.15818

A central reference for locating VLM and VLA models in embodied control stacks.