"A VLA policy is a contract between language grounding, image tokens, robot state, and action decoding."
A Grounded AI Agent
A VLA can name an object and still fail the motion. Always evaluate grounding, action accuracy, latency, and recovery as separate properties.
The leap from a VLM to a VLA is a change in output contract. The model stops at scene understanding only if it has no action head. Once it emits robot commands, every semantic claim has to survive control timing, workspace limits, and recovery behavior.
Why VLMs Are Not Enough
A vision-language model can describe a scene, answer a question, or identify an object. That is useful, but embodiment adds a harder requirement: the answer must change the world. If the instruction is "put the cup on the coaster," the system must decide where the cup is, where the coaster is, which grasp is feasible, how the arm should move, when the gripper should close, and how to recover if the cup slips.
A VLA model extends the multimodal stack by adding an action channel. Formally, it learns a policy $\pi_\theta(a_{t:t+H} \mid o_{1:t}, q, r_t)$, where $o$ is visual observation, $q$ is the language instruction, $r_t$ is robot state, and $a_{t:t+H}$ is an action chunk over a short horizon. The exact action representation varies across systems: discrete tokens in RT-1 and RT-2, diffusion or flow outputs in RDT and pi-zero, and compressed action tokens in FAST-style autoregressive policies.
Language tells the policy what counts as progress, vision tells it what the world currently affords, and the action head commits to a motor trajectory. The action head is where a VLA stops being a scene interpreter and becomes an embodied policy.
The Interface Contract
The most useful way to read any VLA paper is to ask four interface questions. What observations enter the model? What robot state is exposed? What action space exits the model? What controller consumes those actions? This contract is the bridge back to Chapter 2 on action representations and Chapter 7 on controllers versus policies.
Code Fragment 1 makes the VLA interface concrete with typed containers. It does not run a neural policy, it shows the contract that every neural policy must satisfy.
# Minimal VLA interface: image features, instruction text, and robot state enter together.
# The policy returns an action chunk, not a single ungrounded language answer.
from dataclasses import dataclass
import numpy as np
@dataclass
class VLAObservation:
image_embedding: np.ndarray
instruction: str
joint_state: np.ndarray
def as_row(self) -> dict[str, object]:
return asdict(self)
@dataclass
class ActionChunk:
delta_xyz: np.ndarray
gripper_open: np.ndarray
obs = VLAObservation(
image_embedding=np.array([0.12, 0.88, 0.41]),
instruction="pick up the red block",
joint_state=np.array([0.0, 0.4, -0.2, 0.1]),
)
chunk = ActionChunk(
delta_xyz=np.array([[0.02, 0.00, -0.01], [0.01, 0.01, -0.02]]),
gripper_open=np.array([1.0, 0.0]),
)
print(obs.instruction)
print(chunk.delta_xyz.shape)
pick up the red block (2, 3)
VLAObservation and ActionChunk classes show the contract between perception, language, proprioception, and control. Notice that the action is a short sequence, which gives the controller enough temporal context to move smoothly.The hand-built interface above is about 25 lines. With LeRobot or OpenVLA tooling, the same contract is mostly declared through dataset features and policy configuration in a few lines, while the library handles image transforms, action normalization, batching, checkpoint loading, and device placement. Keep the manual version for debugging because it names every boundary that can fail.
# LeRobot shortcut: inspect the observation and action schema before training.
# The dataset object exposes cameras, robot state, language task, and action chunks.
from lerobot.common.datasets.lerobot_dataset import LeRobotDataset
dataset = LeRobotDataset("lerobot/aloha_static_coffee")
print(dataset.features.keys())
print(dataset.meta.info.get("fps"))LeRobotDataset shortcut replaces custom schema plumbing with a maintained dataset interface. It handles metadata, episode indexing, media decoding, and action field conventions internally.Figure 34.1 should be read as the minimal VLA interface: instruction, visual observation, proprioception, action representation, and rollout evidence must all be named before behavior is interpreted.
Build And Evaluation Checklist
Curriculum, depth, and self-containment. The VLM to VLA shift is a change in output contract. The model stops producing descriptions and starts producing robot actions that must satisfy timing and safety constraints. For From VLMs to VLAs: the core idea, the practical reading is to pin down the interface, assumptions, concrete example, and failure mode before comparing methods.
Production and evaluation contract. A VLA is a policy with language conditioning, not a captioner with a gripper. For From VLMs to VLAs: the core idea, treat the diagram, code, table, exercise, warning, and references as one evidence packet: boundary, artifact, tool choice, transfer check, failure mode, and source grounding.
Before accepting a From VLMs to VLAs: the core idea result, name the loop variable that changed, the tool that makes it reproducible, the failure that would fool the metric, and the source that backs the claim.
Write an evidence row for one instruction-conditioned rollout: camera stream, language prompt, robot state, action head, success metric, latency, and the failure label that explains the first bad action.
Failure Modes
A VLA can fail even when its language understanding looks strong. It can hallucinate affordances, issue an action outside the robot workspace, ignore proprioceptive limits, move too slowly for a dynamic scene, or overfit to camera viewpoints seen during data collection. These are not minor implementation details. They are the reasons VLA evaluation must be closed-loop and robot-aware.
Before fine-tuning a VLA, write a one-page interface card: camera names and rates, proprioceptive fields, action dimensions, action frequency, controller type, safety limits, dataset license, and the exact success metric. This card prevents a common mistake: training a powerful model against a vague action contract.
Expected output: From VLMs to VLAs: the core idea should leave a reproducible VLA evidence trace with checkpoint, action representation, robot interface, metric, and failure label.
A reliable interface appears twice: once in the system diagram and once in the replay logs. If those two views disagree, the policy contract is still too vague.
For a tabletop pick task, name the image inputs, the robot-state vector, the language instruction, the action dimension, and the controller that executes the output. If any answer is unknown, the VLA is not yet a buildable system.
The frontier question is whether action should be another language-like token stream or a continuous trajectory generated by a specialized head. RT-2 made the token route famous, while RDT and pi-zero strengthened the continuous diffusion and flow route. FAST reopens the token route by compressing action sequences before tokenization.
A VLA is best understood as a policy with a multimodal front end and an action-generating back end. The model name matters less than the observation-action contract it satisfies.
Choose one robot task and write its VLA interface card. Include observation fields, action fields, control rate, success metric, and two failure modes that a static VLM would miss.
What's Next?
Section 34.2 follows the historical path from RT-1 to RT-2 and RT-X, where action tokenization and cross-embodiment data became central ideas.
Brohan et al. (2022). "RT-1: Robotics Transformer for Real-World Control at Scale." arXiv.
RT-1 showed that a transformer policy trained on large real robot data could produce discretized low-level robot actions from images and instructions. It is the starting point for the chapter lineage and useful for readers who want the engineering details behind large-scale robot data collection.
RT-2 made the action-as-language move explicit by fine-tuning VLM backbones to emit robot actions as tokens. Researchers should read it for the co-training setup, while practitioners should read it for the limits of transferring web semantics into motor control.
This paper introduced the cross-institution robot data mixture and RT-X models. It is essential for understanding why embodiment metadata, action normalization, and dataset mixture design matter.
Hugging Face. "LeRobot." GitHub.
LeRobot is the practical open-source toolkit used here for datasets, policy training, evaluation, and low-cost robot workflows. Engineers should start here before writing custom data loaders or training loops.
Kim et al. (2024). "OpenVLA: An Open-Source Vision-Language-Action Model." arXiv.
OpenVLA connects open VLM backbones to robot action generation and provides a practical codebase for fine-tuning. Practitioners should read it alongside the GitHub repository before adapting an open VLA to a new robot.