"An open VLA checkpoint is a starting artifact, not a deployment claim."
A Grounded AI Agent
Discretized action tokens are convenient for transformer training, but the robot still executes metric motion, gripper commands, and timing. Always preserve the conversion back to physical units in the evaluation artifact.
Octo and OpenVLA are valuable because they expose the full builder workflow: dataset schema, observation adapters, action decoding, fine-tuning, and evaluation. Open policies turn VLA research into an auditable engineering path.
Why Open Generalist Policies Matter
Closed demonstrations can inspire a field, but open policies teach it how to build. Octo and OpenVLA are important because they give researchers and engineers code, weights, data recipes, and failure surfaces that can be inspected. They also represent two different answers to the same problem: how should a policy pretrained on diverse robot data adapt to a new task, a new camera setup, or a new robot?
Octo is a generalist robot policy trained on Open X-Embodiment trajectories, with flexible observation and task conditioning. OpenVLA uses an Internet-pretrained VLM backbone and fine-tunes it for action generation. In a design review, compare them by the policy family, the action representation, the data mixture, the fine-tuning path, and the amount of compute needed for adaptation.
An open VLA is valuable not only because you can run it. You can inspect the dataset interface, reproduce fine-tuning, measure latency, change the action head, and discover where the policy breaks.
A Practical Selection Guide
| Question | Octo-style answer | OpenVLA-style answer |
|---|---|---|
| Main strength | Robot-data generalist initialization | VLM semantics plus robot action fine-tuning |
| Best early use | Fine-tune on a robot setup with related data | Study language-conditioned manipulation with open tooling |
| Primary risk | Limited semantic transfer outside robot data | Strong semantics without reliable physical grounding |
| Debug handle | Observation and action adapters | Prompt, tokenizer, and action decoding path |
The table above is a starting point, not a leaderboard. Use it to choose experiments. Do not use it to declare a universal winner because the answer depends on robot, task, dataset, and evaluation protocol.
Manual fine-tuning scripts quickly grow past 100 lines once they include video loading, normalization, episode slicing, and checkpointing. LeRobot and OpenVLA tooling reduce that to configuration plus one training command, while handling dataset adapters, transforms, logging, and model loading internally.
# Practical route: use a maintained training entry point instead of custom loaders.
# Check the current repository docs before running because model names evolve.
python -m lerobot.scripts.train configs/smolvla_aloha_static_coffee.yamlFigure 34.3 should be read as an adaptation pipeline: checkpoint, tokenizer or encoder, robot interface, fine-tuning data, calibration, and rollout logs each require their own version record.
Build And Evaluation Checklist
Curriculum, depth, and self-containment. Octo and OpenVLA represent two open routes: generalist diffusion-policy initialization and open VLM-based action generation. For Open generalist policies: Octo, OpenVLA, the practical reading is to pin down the interface, assumptions, concrete example, and failure mode before comparing methods.
Production and evaluation contract. Open weights matter because they let readers inspect data adapters, action heads, and fine-tuning recipes. For Open generalist policies: Octo, OpenVLA, treat the diagram, code, table, exercise, warning, and references as one evidence packet: boundary, artifact, tool choice, transfer check, failure mode, and source grounding.
Before accepting a Open generalist policies: Octo, OpenVLA result, name the loop variable that changed, the tool that makes it reproducible, the failure that would fool the metric, and the source that backs the claim.
Write an evidence row for one open-policy adaptation: base checkpoint, robot observations, action adapter, fine-tuning episodes, evaluation seed, and the failure mode that blocks deployment.
When Direct Execution Is Impractical
Open VLA models are still heavier than most compact examples in this book. On a small laptop or a 6 GB GPU, readers should start with dataset inspection, schema validation, and a tiny policy head before attempting full fine-tuning. The goal is to understand the data contract first, then scale compute when the contract is clean.
Start with a single task and 20 held-out episodes. Verify that the model consumes the correct cameras, that action normalization inverts cleanly, that inference latency fits the control loop, and that failure videos are saved. Only then increase dataset size or model scale.
If a policy fails, first check whether the gripper convention is inverted. Many dramatic robot failures reduce to one bit meaning "open" in one dataset and "closed" in another.
You inherit a robot dataset with three cameras and a 7-dimensional action vector. Which parts of Octo or OpenVLA adaptation would you inspect before training?
Open VLAs are becoming smaller and more accessible. SmolVLA is important because it shifts experimentation from large-lab-only runs toward consumer-hardware fine-tuning, but the same evaluation hygiene still applies.
Open generalist policies turn VLA research into an engineering workflow: inspect the schema, adapt the interface, fine-tune carefully, and evaluate on held-out closed-loop behavior.
Pick Octo, OpenVLA, or SmolVLA. Write a fine-tuning plan for a new tabletop task, including required data fields, compute assumptions, held-out tests, and the first failure video you would inspect.
What's Next?
Section 34.4 explains why several frontier systems use diffusion or flow heads instead of plain action tokens.
Octo Model Team et al. (2024). "Octo: An Open-Source Generalist Robot Policy." arXiv.
Octo is a transformer-based diffusion policy pretrained on Open X-Embodiment trajectories and designed for flexible fine-tuning. It is the clearest open reference for generalist policy initialization before the Internet-pretrained VLA wave.
Kim et al. (2024). "OpenVLA: An Open-Source Vision-Language-Action Model." arXiv.
OpenVLA connects open VLM backbones to robot action generation and provides a practical codebase for fine-tuning. Practitioners should read it alongside the GitHub repository before adapting an open VLA to a new robot.
OpenVLA Project. "OpenVLA GitHub Repository." GitHub.
The repository contains training and fine-tuning code for OpenVLA-style policies. It is the implementation reference when the chapter discusses open tooling rather than closed vendor demos.
Hugging Face. "LeRobot." GitHub.
LeRobot is the practical open-source toolkit used here for datasets, policy training, evaluation, and low-cost robot workflows. Engineers should start here before writing custom data loaders or training loops.
SmolVLA is a compact open VLA designed to run on more accessible hardware and fine-tune on LeRobot datasets. It is the best fit for the chapter hands-on lab because it lowers the barrier to experimentation.
This paper introduced the cross-institution robot data mixture and RT-X models. It is essential for understanding why embodiment metadata, action normalization, and dataset mixture design matter.