Section 34.9: Action Representations In VLA Systems | Building Embodied AI: From Perception to Autonomous Action

"Every action representation is a promise about what kinds of motion your policy can even imagine."
A Motion-Interface Designer

An embodied policy choosing among discrete action tokens, smooth chunks, and continuous trajectories before handing commands to a safety-aware controller. — **Figure 34.9.1:** Action representation is the robot-facing API of a VLA, because it decides whether the policy speaks in symbols, chunks, or continuous motor traces.

Big Picture

Action representation decides what a VLA can express before learning even starts. Discrete tokens are compact and work naturally with autoregressive decoding. Continuous chunks preserve geometric fidelity. Diffusion and flow heads preserve multi-modality, while hierarchical skills trade low-level expressivity for longer-horizon structure.

The Design Question

An action head has to compress a continuous physical process into a model-friendly interface. The main trade-off is between compact discrete structure and faithful motor detail. If you discretize too aggressively, small but important control variations vanish. If you stay fully continuous, inference can become slower or harder to align with autoregressive backbones.

The field currently uses four broad strategies: per-step discrete tokens, compressed action tokens such as FAST and FAST+, direct continuous chunks, and generative continuous heads such as diffusion or flow matching. The right answer depends on control rate, action smoothness, and how multi-modal the task is.

The Representation Chooses The Failure Mode

Token models usually fail through aliasing and sequence length. Continuous heads usually fail through latency, calibration sensitivity, or weaker integration with language-model-style decoders.

A Compact Comparison Formula

A practical comparison uses both fidelity and runtime:

$$J = \alpha \cdot \text{task\_success} - \beta \cdot \text{latency} - \gamma \cdot \text{reconstruction\_error}.$$

The coefficients depend on the application. A dexterous high-rate hand may tolerate more model complexity to reduce reconstruction error. A mobile manipulator with strict runtime bounds may prefer a simpler but faster interface.

Code Fragment 1 contrasts a naive token budget with a chunked representation.

# Compare how many model outputs are needed for the same 1-second control horizon.
control_hz = 20
horizon_s = 1.0
timesteps = int(control_hz * horizon_s)

naive_tokens_per_step = 7
chunk_length = 5
chunk_outputs = timesteps // chunk_length

print(f"naive_token_predictions={timesteps * naive_tokens_per_step}")
print(f"chunk_predictions={chunk_outputs}")

naive_token_predictions=140
chunk_predictions=4

Code Fragment 1: The token count difference shows why chunking and compression matter. A model that must emit 140 discrete symbols for one second of motion faces a very different latency regime than a model that predicts four chunk-level objects.

Library Shortcut

OpenVLA, openpi, and LeRobot toolchains let you swap among discrete-token, chunked, and continuous policy heads without rebuilding the entire training stack. That maintained abstraction matters once the action-interface trade-offs are understood well enough to choose a head deliberately.

When Each Representation Wins

Action Representation Decision Guide

Representation	Best when	Main risk
Naive discrete tokens	Low-rate commands or simple proof-of-concept setups	Long sequences and quantization error
FAST or FAST+ tokens	Smooth high-rate actions with autoregressive backbones	Tokenizer mismatch across embodiments
Continuous chunks	Short-horizon manipulation with explicit controllers	Chunk boundaries can hide mid-course correction needs
Diffusion or flow heads	Multi-modal continuous trajectories and dexterous behaviors	Sampling cost and runtime complexity
Hierarchical skills	Long-horizon tasks with reusable motion motifs	Low-level nuance may be hidden behind the skill interface

Do Not Choose By Fashion

Diffusion and flow heads are not automatically superior. If your robot runs at low frequency with nearly deterministic action choices, a simpler chunked or tokenized interface may be easier to deploy and debug.

Practical Example

A household mobile manipulator opening a drawer may benefit from chunked continuous actions because the lower-level controller can smooth short-horizon motion. A bimanual dexterous hand manipulating cables may need a richer generative head or a stronger tokenizer because tiny trajectory details matter far more.

Memory Hook

Choosing an action representation is like choosing whether to speak to the robot in syllables, full sentences, or dance notation. Every choice drops something and gains something.

Self Check

Your robot runs at 50 Hz and needs smooth wrist motion. Which representation class would you rule out first, and why would its failure show up as a runtime or reconstruction problem?

Research Frontier

FAST and FAST+ strengthened the discrete route by showing that action tokenization can respect smooth continuous trajectories, while pi-zero style flow models strengthened the continuous route. A likely durable outcome is not one winner but a modular interface layer that lets the same VLA backbone swap action heads by task and embodiment.

Key Takeaway

Action representation is not a small implementation detail. It is the part of the VLA that decides how motor intelligence is packaged, how latency accumulates, and which classes of motion error become likely.

Exercise 34.9

For one robot task of your choice, compare a tokenized and a continuous action interface on the same control horizon. Write down the expected latency, reconstruction risk, and controller burden for each before you run anything.

What's Next?

Chapter 35 broadens this action-interface discussion into full robot foundation models and cross-embodiment learning, where the action contract has to survive changes in robot body, sensor tree, and adaptation workflow.

Section References

Pertsch et al. (2025). "FAST: Efficient Action Tokenization for Vision-Language-Action Models."

The central reference for compression-based action tokenization and the FAST+ tokenizer.

Physical Intelligence. "openpi" repository.

Useful for seeing how pi-zero family models package flow-based and token-based action interfaces in open code.

OpenVLA repository.

An open reference for autoregressive VLA training and adaptation workflows.

LeRobot documentation.

Useful for practical policy heads, datasets, and evaluation flows on accessible hardware.