"A robot policy is a promise about the next second of the world."
A Grounded AI Agent
Figure 34.6 gives this page a compact map of the interface. Read it left to right, then check whether the surrounding prose names the same observation, action, and evidence contract.
Build And Evaluation Checklist
Curriculum, depth, and self-containment. Co-training blends web semantics with robot demonstrations. The gain is semantic breadth, and the risk is a mismatch between internet concepts and executable actions. For Co-training with web data for semantic generalization, the practical reading is to pin down the interface, assumptions, concrete example, and failure mode before comparing methods.
Production and evaluation contract. Keep web data and robot data contributions separable in ablations. For Co-training with web data for semantic generalization, treat the diagram, code, table, exercise, warning, and references as one evidence packet: boundary, artifact, tool choice, transfer check, failure mode, and source grounding.
Before accepting a Co-training with web data for semantic generalization result, name the loop variable that changed, the tool that makes it reproducible, the failure that would fool the metric, and the source that backs the claim.
For this section, write one evidence row with observation, action, metric, dataset or robot, seed, and failure label. Then explain why comparing that row with a result from a different setup would be invalid.
def validate_adaptation_card(payload: dict[str, object]) -> dict[str, object]:
assert payload, "payload must not be empty"
return payload
# Keep adaptation metadata beside the checkpoint.
adaptation_card = {
"base_model": "openvla-7b",
"robot": "aloha_static",
"control_hz": 10,
"action_normalization": "dataset statistics",
}
print(validate_adaptation_card(adaptation_card))
Start adaptation from an open checkpoint and its official preprocessing code when available. The shortcut avoids mismatched image normalization, tokenizer settings, action scaling, and camera ordering.
Co-training matters because web data can teach object semantics and task language that robot datasets rarely cover at scale. The hard part is preserving the distinction between semantic recognition and the physically grounded action knowledge that only embodied data supplies.
Why Co-Training Exists
Robot data is scarce, expensive, and narrow compared with web-scale image-text data. Web data contains object names, visual categories, spatial language, and common-sense associations, but it rarely contains the forces and trajectories needed to manipulate objects. Co-training tries to keep the semantic breadth of web data while teaching the model which actions change the physical world.
RT-2 is the canonical example: a VLM backbone is trained with both vision-language examples and robot action examples so that action tokens become part of the model vocabulary. Pi-zero point five pushes the idea further with heterogeneous sources for open-world mobile manipulation. The same principle appears in closed frontier systems such as Gemini Robotics and humanoid-focused systems such as GR00T.
The question is not whether to use web data or robot data. The question is how to mix them so semantic knowledge improves action without washing out the physical grounding that only embodied trajectories provide.
A Mental Model For Mixtures
Think of the training set as three buckets: web vision-language pairs, robot trajectories with language labels, and robot trajectories with weak or synthetic labels. Each bucket teaches a different ability. Web data teaches recognition and instruction semantics. Robot data teaches consequences. Weak labels increase scale but can inject ambiguity. A useful VLA training recipe makes those tradeoffs explicit.
When building a co-training run, create a mixture sheet with columns for source, license, robot embodiment, task family, label quality, action representation, and sampling weight. Review the sheet before training and after evaluation. Many surprising failures are really mixture-design failures.
Semantic Generalization, Physical Generalization
Semantic generalization means the policy understands a new phrase or object category. Physical generalization means it can act correctly under new geometry, friction, lighting, dynamics, or embodiment. A VLA needs both, but they are not the same. A robot that recognizes a "ceramic mug" can still fail to grasp it if the handle pose is unusual or the camera viewpoint differs from training.
Internet pretraining can teach what objects look like and how people talk about them. It does not directly teach force closure, compliance, torque limits, or the sound of a gripper stalling. Treat web knowledge as semantic support, not as a substitute for embodied data.
When co-training with web data for semantic generalization feels abstract, ask what would be different in the next frame of video, the next robot state, or the next safety margin.
Current frontier systems increasingly combine robot data, web data, synthetic data, human video, simulation, and generated labels. The open research question is how to prove that each source contributes real closed-loop capability rather than better-looking demos.
Expected output: Co-training with web data for semantic generalization should leave a reproducible VLA evidence trace with checkpoint, action representation, robot interface, metric, and failure label.
Give one example where web knowledge helps a robot and one example where only embodied data can teach the missing behavior.
Co-training is useful when it preserves the distinction between knowing what an instruction means and knowing how a robot can physically carry it out.
Design a co-training mixture for a mobile manipulator that tidies a kitchen. Include at least four data sources, a sampling weight for each, and one validation test that isolates semantic generalization from physical generalization.
What's Next?
Section 34.7 turns from training data to prompting and runtime conditioning.
RT-2 made the action-as-language move explicit by fine-tuning VLM backbones to emit robot actions as tokens. Researchers should read it for the co-training setup, while practitioners should read it for the limits of transferring web semantics into motor control.
Pi-zero point five extends pi-zero through heterogeneous co-training for broader open-world generalization. It is useful for readers studying the frontier between task-specific robot policies and household-scale generalist behavior.
Google DeepMind (2025). "Gemini Robotics: Bringing AI into the Physical World." arXiv.
This technical report documents Gemini Robotics as a generalist VLA model for direct robot control. It belongs in the bibliography because it provides the research framing behind the public product pages.
Gemini Robotics 1.5 is described by Google DeepMind as a VLA model that maps visual information and instructions into motor commands. It is important for frontier context, but readers should distinguish official demonstrations from independently replicated results.
Bjorck et al. (2025). "GR00T N1: An Open Foundation Model for Generalist Humanoid Robots." arXiv.
GR00T N1 frames humanoid control as a dual-system VLA architecture with reasoning and fast action generation. It prepares the transition from Chapter 34 into Chapter 35 and the later humanoid chapter.
This paper introduced the cross-institution robot data mixture and RT-X models. It is essential for understanding why embodiment metadata, action normalization, and dataset mixture design matter.
SmolVLA is a compact open VLA designed to run on more accessible hardware and fine-tune on LeRobot datasets. It is the best fit for the chapter hands-on lab because it lowers the barrier to experimentation.