Section 34.1: From VLMs to VLAs: the core idea | Building Embodied AI: From Perception to Autonomous Action

"A VLA policy is a contract between language grounding, image tokens, robot state, and action decoding."
A Grounded AI Agent

Technical illustration for Section 34.1: From VLMs to VLAs: the core idea. — Figure 34.1A: The VLA architecture: a vision encoder and language encoder feed joint embeddings into a cross-modal transformer, whose output tokens are projected by an action head into a continuous end-effector trajectory.

Do Not Confuse Semantics With Control

A VLA can name an object and still fail the motion. Always evaluate grounding, action accuracy, latency, and recovery as separate properties.

Big Picture

The leap from a VLM to a VLA is a change in output contract. The model stops at scene understanding only if it has no action head. Once it emits robot commands, every semantic claim has to survive control timing, workspace limits, and recovery behavior.

Why VLMs Are Not Enough

A vision-language model can describe a scene, answer a question, or identify an object. That is useful, but embodiment adds a harder requirement: the answer must change the world. If the instruction is "put the cup on the coaster," the system must decide where the cup is, where the coaster is, which grasp is feasible, how the arm should move, when the gripper should close, and how to recover if the cup slips.

A VLA model extends the multimodal stack by adding an action channel. Formally, it learns a policy $\pi_\theta(a_{t:t+H} \mid o_{1:t}, q, r_t)$, where $o$ is visual observation, $q$ is the language instruction, $r_t$ is robot state, and $a_{t:t+H}$ is an action chunk over a short horizon. The exact action representation varies across systems: discrete tokens in RT-1 and RT-2, diffusion or flow outputs in RDT and pi-zero, and compressed action tokens in FAST-style autoregressive policies.

The Core Move

Language tells the policy what counts as progress, vision tells it what the world currently affords, and the action head commits to a motor trajectory. The action head is where a VLA stops being a scene interpreter and becomes an embodied policy.

The Interface Contract

The most useful way to read any VLA paper is to ask four interface questions. What observations enter the model? What robot state is exposed? What action space exits the model? What controller consumes those actions? This contract is the bridge back to Chapter 2 on action representations and Chapter 7 on controllers versus policies.

Code Fragment 1 makes the VLA interface concrete with typed containers. It does not run a neural policy, it shows the contract that every neural policy must satisfy.

# Minimal VLA interface: image features, instruction text, and robot state enter together.
# The policy returns an action chunk, not a single ungrounded language answer.
from dataclasses import dataclass
import numpy as np

@dataclass
class VLAObservation:
    image_embedding: np.ndarray
    instruction: str
    joint_state: np.ndarray

    def as_row(self) -> dict[str, object]:
        return asdict(self)

@dataclass
class ActionChunk:
    delta_xyz: np.ndarray
    gripper_open: np.ndarray

obs = VLAObservation(
    image_embedding=np.array([0.12, 0.88, 0.41]),
    instruction="pick up the red block",
    joint_state=np.array([0.0, 0.4, -0.2, 0.1]),
)
chunk = ActionChunk(
    delta_xyz=np.array([[0.02, 0.00, -0.01], [0.01, 0.01, -0.02]]),
    gripper_open=np.array([1.0, 0.0]),
)
print(obs.instruction)
print(chunk.delta_xyz.shape)

pick up the red block
(2, 3)

Code Fragment 1: The VLAObservation and ActionChunk classes show the contract between perception, language, proprioception, and control. Notice that the action is a short sequence, which gives the controller enough temporal context to move smoothly.

Library Shortcut

The hand-built interface above is about 25 lines. With LeRobot or OpenVLA tooling, the same contract is mostly declared through dataset features and policy configuration in a few lines, while the library handles image transforms, action normalization, batching, checkpoint loading, and device placement. Keep the manual version for debugging because it names every boundary that can fail.

# LeRobot shortcut: inspect the observation and action schema before training.
# The dataset object exposes cameras, robot state, language task, and action chunks.
from lerobot.common.datasets.lerobot_dataset import LeRobotDataset

dataset = LeRobotDataset("lerobot/aloha_static_coffee")
print(dataset.features.keys())
print(dataset.meta.info.get("fps"))

Code Fragment 2: The LeRobotDataset shortcut replaces custom schema plumbing with a maintained dataset interface. It handles metadata, episode indexing, media decoding, and action field conventions internally.

Figure 34.1 should be read as the minimal VLA interface: instruction, visual observation, proprioception, action representation, and rollout evidence must all be named before behavior is interpreted.

Figure 34.1: A closed-loop map for From VLMs to VLAs: the core idea. The diagram forces the reader to name the input, model boundary, action interface, and evidence record before trusting the system.

Build And Evaluation Checklist

Curriculum, depth, and self-containment. The VLM to VLA shift is a change in output contract. The model stops producing descriptions and starts producing robot actions that must satisfy timing and safety constraints. For From VLMs to VLAs: the core idea, the practical reading is to pin down the interface, assumptions, concrete example, and failure mode before comparing methods.

Production and evaluation contract. A VLA is a policy with language conditioning, not a captioner with a gripper. For From VLMs to VLAs: the core idea, treat the diagram, code, table, exercise, warning, and references as one evidence packet: boundary, artifact, tool choice, transfer check, failure mode, and source grounding.

Checklist Memory Anchor

Before accepting a From VLMs to VLAs: the core idea result, name the loop variable that changed, the tool that makes it reproducible, the failure that would fool the metric, and the source that backs the claim.

Mini Audit Exercise

Write an evidence row for one instruction-conditioned rollout: camera stream, language prompt, robot state, action head, success metric, latency, and the failure label that explains the first bad action.

Failure Modes

A VLA can fail even when its language understanding looks strong. It can hallucinate affordances, issue an action outside the robot workspace, ignore proprioceptive limits, move too slowly for a dynamic scene, or overfit to camera viewpoints seen during data collection. These are not minor implementation details. They are the reasons VLA evaluation must be closed-loop and robot-aware.

Practical Recipe

Before fine-tuning a VLA, write a one-page interface card: camera names and rates, proprioceptive fields, action dimensions, action frequency, controller type, safety limits, dataset license, and the exact success metric. This card prevents a common mistake: training a powerful model against a vague action contract.

Expected output: From VLMs to VLAs: the core idea should leave a reproducible VLA evidence trace with checkpoint, action representation, robot interface, metric, and failure label.

Memory Hook

A reliable interface appears twice: once in the system diagram and once in the replay logs. If those two views disagree, the policy contract is still too vague.

Self Check

For a tabletop pick task, name the image inputs, the robot-state vector, the language instruction, the action dimension, and the controller that executes the output. If any answer is unknown, the VLA is not yet a buildable system.

Research Frontier

The frontier question is whether action should be another language-like token stream or a continuous trajectory generated by a specialized head. RT-2 made the token route famous, while RDT and pi-zero strengthened the continuous diffusion and flow route. FAST reopens the token route by compressing action sequences before tokenization.

Key Takeaway

A VLA is best understood as a policy with a multimodal front end and an action-generating back end. The model name matters less than the observation-action contract it satisfies.

Exercise 34.1

Choose one robot task and write its VLA interface card. Include observation fields, action fields, control rate, success metric, and two failure modes that a static VLM would miss.

What's Next?

Section 34.2 follows the historical path from RT-1 to RT-2 and RT-X, where action tokenization and cross-embodiment data became central ideas.

Bibliography and Further Reading

Foundational Papers and Reports

Brohan et al. (2022). "RT-1: Robotics Transformer for Real-World Control at Scale." arXiv.

RT-1 showed that a transformer policy trained on large real robot data could produce discretized low-level robot actions from images and instructions. It is the starting point for the chapter lineage and useful for readers who want the engineering details behind large-scale robot data collection.

Paper

Brohan et al. (2023). "RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control." arXiv.

RT-2 made the action-as-language move explicit by fine-tuning VLM backbones to emit robot actions as tokens. Researchers should read it for the co-training setup, while practitioners should read it for the limits of transferring web semantics into motor control.

Paper

Open X-Embodiment Collaboration et al. (2023). "Open X-Embodiment: Robotic Learning Datasets and RT-X Models." arXiv.

This paper introduced the cross-institution robot data mixture and RT-X models. It is essential for understanding why embodiment metadata, action normalization, and dataset mixture design matter.

Paper

Hugging Face. "LeRobot." GitHub.

LeRobot is the practical open-source toolkit used here for datasets, policy training, evaluation, and low-cost robot workflows. Engineers should start here before writing custom data loaders or training loops.

Tool

Tools, Libraries, and Frontier Notes

Kim et al. (2024). "OpenVLA: An Open-Source Vision-Language-Action Model." arXiv.

OpenVLA connects open VLM backbones to robot action generation and provides a practical codebase for fine-tuning. Practitioners should read it alongside the GitHub repository before adapting an open VLA to a new robot.

Paper