Section 35.6: Data scale, compute, and the open-vs-closed divide | Building Embodied AI: From Perception to Autonomous Action

"Every scaling curve is secretly a budget memo with philosophical opinions about openness."
A Research Lead With Spreadsheets

Big Picture

Robot foundation models scale with data and compute, but embodied AI adds a third term that language-model discussions often understate: evaluation cost on real embodiments. The open-vs-closed divide is therefore about auditability and iteration speed as much as raw capability.

Why Scaling In Robotics Is Different

Language models can often scale by adding text and compute while keeping the evaluation channel cheap. Robotics is more stubborn. Each extra demonstration may require hardware time, operators, resets, safety review, and embodiment-specific calibration. A compute-heavy training run can still be bottlenecked by the cost of generating trustworthy robot data and validating it on real platforms.

That is why open and closed stacks create different research tempos. Closed systems may report impressive results with inaccessible data or infrastructure. Open stacks may lag in headline numbers while moving faster in community debugging, fine-tuning, and independent validation.

The Hidden Scaling Term

For embodied AI, the expensive term is often not only tokens or FLOPs. It is real-world evidence generation.

A Simple Scaling Lens

A stylized way to think about the trade-off is

$$P \approx f(D, C, E),$$

where $D$ is data diversity and volume, $C$ is compute for training and serving, and $E$ is evaluation throughput on trustworthy scenario panels. Many labs can buy more compute faster than they can buy more credible robot evidence, which means progress saturates on the least glamorous axis.

A more operational view is to treat research throughput as

$$T_{\mathrm{iteration}} \approx \max(T_{\mathrm{data\ prep}}, T_{\mathrm{training}}, T_{\mathrm{hardware\ evaluation}}) + T_{\mathrm{failure\ analysis}}.$$

This decomposition makes the open-versus-closed divide more concrete. A closed stack may lower apparent model-development time because the policy arrives pre-integrated, but it often raises attribution cost because the lab cannot inspect which part of the performance came from data curation, architecture, post-training, teleoperation quality, or evaluation filtering. An open stack may start from a weaker absolute capability level while still producing faster scientific learning because failures are easier to localize and adaptation loops are easier to rerun.

Code Fragment 1 turns this intuition into a toy budget calculation.

# Compare where the budget goes in two robot-foundation-model programs.
programs = {
    "open_lab": {"data_hours": 400, "gpu_days": 120, "real_eval_days": 40},
    "closed_frontier": {"data_hours": 5000, "gpu_days": 900, "real_eval_days": 120},
}

for name, vals in programs.items():
    evidence_pressure = vals["real_eval_days"] / vals["gpu_days"]
    print(f"{name}: evidence_pressure={evidence_pressure:.2f}")

open_lab: evidence_pressure=0.33
closed_frontier: evidence_pressure=0.13

The expected output is a higher evidence-pressure ratio for the smaller open program, meaning a larger fraction of its iteration budget is spent on real-world validation instead of pure training throughput. That does not automatically make the open path worse; it often means the lab is paying for more inspectable evidence per unit of model development.

Code Fragment 1: The `evidence_pressure` ratio is not a universal metric, but it is a useful planning device. Higher values mean that real-world validation is consuming a larger share of the program's total iteration budget.

Library Shortcut

Open stacks such as LeRobot, OpenVLA, openpi, SmolVLA, Hugging Face Hub, and ONNX Runtime reduce the cost of experimentation by standardizing datasets, training recipes, checkpoint exchange, and evaluation exports. The main payoff is not only convenience. It is that more of the lab's budget can go toward real validation instead of custom infrastructure glue.

Mechanisms Behind The Cost Curve

Data scale matters in at least three different ways: the number of embodiments represented, the diversity of tasks and scenes, and the quality of state-action alignment inside each trajectory. Compute interacts with those axes asymmetrically. More FLOPs can help fit larger mixtures or more expressive action models, but they cannot repair missing calibration metadata, poor reset discipline, or under-specified action semantics.

This is where the open and closed worlds separate mechanistically. Open stacks usually expose dataset schema, action conventions, and training code, so their common failure mode is limited scale or uneven embodiment coverage. Closed stacks may demonstrate stronger integrated performance, but their common scientific weakness is that the source of the gain becomes entangled with private data mixtures, private evaluation filters, and hidden post-training procedures.

Open Versus Closed Is A Research Trade-Off

Open And Closed Stack Trade-Offs

Dimension	Open stack	Closed stack
Auditability	High: interfaces, datasets, and code can often be inspected	Low to medium: strongest details may remain vendor-private
Fine-tuning accessibility	High for community hardware and small labs	Usually limited to demos or narrow partner programs
Peak capability	May lag frontier reports	May lead on headline demonstrations
Replication speed	Fast once artifacts are published	Slow if key ingredients are inaccessible
Pedagogical value	Excellent for teaching full pipelines	Useful for frontier awareness and architecture lessons
Attribution of gains	Usually easier to localize to data, adapters, or training recipe	Often confounded by private data mixtures, curation rules, and evaluation infrastructure
Failure analysis depth	High when logs, schemas, and checkpoints are exported	Often shallow if only demos or aggregate metrics are visible

Evaluation Consequences

The choice of stack changes what kind of science a lab can do. Closed systems are often useful as frontier references or upper-bound demonstrations, but they are weak substrates for careful ablations because too many causal variables are hidden. Open systems are usually the right substrate for adaptation studies, mixture design, action-interface debugging, and course-ready assignments because every assumption can be written into a manifest and challenged.

The minimum evidence bundle for this section should include one construct-matched task panel, embodiment labels, dataset provenance, training or fine-tuning config, latency notes, and a failure taxonomy saved in the same artifact. Without that bundle, a stronger demo may still be a weaker scientific claim.

Do Not Confuse Accessibility With Weakness

An open model that a community can fine-tune, probe, and reproduce may generate more durable scientific progress than a closed model with better demos but thinner audit trails.

Practical Example

A small research group choosing between SmolVLA on LeRobot data and a vendor API should ask a blunt question: which path gets us to a reproducible adaptation, a fair evaluation panel, and a clear failure taxonomy within our actual budget? The answer is often the open path, even if the vendor demo looks stronger today.

Memory Hook

Some research programs scale like rockets. Others scale like moving a couch up the stairs. Robot data collection is usually the couch.

Self Check

If you had to cut one budget line tomorrow, which would damage the program more: data collection, GPU time, or real-world evaluation? Your answer reveals the true bottleneck of the project.

Research Frontier

Recent open releases such as SmolVLA and community LeRobot datasets are trying to democratize the field, while frontier vendor systems emphasize richer embodiment transfer and larger-scale post-training. The strategic question is whether the next major capability jump will come from more proprietary scale, better open data interfaces, or cheaper trustworthy evaluation.

Key Takeaway

Data scale and compute matter, but embodied AI progress is governed just as much by who can afford to generate and verify real behavior. Openness changes that equation.

Exercise 35.6

Write a one-page budget memo for an open robot-foundation-model project. Include planned data sources, compute budget, real-evaluation budget, and one reason the open stack would speed up or slow down the research cycle.

What's Next?

Section 35.7 closes the chapter by asking what still breaks even after all these design choices, and which open questions are still blocking truly general robot foundation models.

Bibliography and Further Reading

Open Tooling And Frontier Context

Hugging Face (2025). "SmolVLA."

A strong reference for affordable training and deployment on community-accessible hardware.

Tool report

LeRobot project page.

Useful for understanding how open infrastructure reduces the cost of working with robot datasets and pretrained policies.

Project

NVIDIA Research. "GR00T N1.5."

Relevant as a frontier capability report when thinking about the current closed or semi-closed side of the field.

Official page