"Generalist behavior is easy to announce and hard to certify."
A Safety-Minded Reviewer
Robot foundation models have made real progress, but they still break at the exact places that matter most for deployment: embodiment shifts, contact-rich edge cases, recovery after failure, data provenance, and trustworthy evaluation. This section is about those unresolved edges.
Five Limits That Still Matter
First, data heterogeneity is still under-controlled. We can pool more trajectories than ever, but we still do not have a universal guarantee that one robot's data helps another robot in the ways we think it does. Second, evaluation remains fragile. Many systems are impressive on curated tasks yet weak under perturbation, latency bounds, or changed embodiment assumptions.
Third, safety and recovery remain shallow compared with the ambition of "general-purpose" robotics. A large model may know what to do in-distribution and still have poor abstention behavior when the scene violates its assumptions. Fourth, data rights and provenance are becoming central as community datasets grow. Fifth, the open-versus-closed divide still makes it hard to separate architecture lessons from infrastructure advantages.
The hardest unresolved problems are rarely inside one neural block. They sit between data contracts, embodiment adapters, control loops, and evaluation pipelines.
A Failure Taxonomy
A useful diagnostic decomposition is
$$R = R_{\text{perception}} + R_{\text{state}} + R_{\text{action}} + R_{\text{control}} + R_{\text{evaluation}},$$
where the terms denote the fraction of failures attributable primarily to those layers on one matched scenario panel. The decomposition is not a theorem. It is a discipline for avoiding the lazy conclusion that "the model failed" when the real cause was stale calibration, interface mismatch, or a weak evaluation harness.
Code Fragment 1 computes a tiny version of that taxonomy.
# Count failure causes on one matched scenario panel.
failures = ["perception", "action", "action", "evaluation", "control", "action"]
counts = {}
for name in failures:
counts[name] = counts.get(name, 0) + 1
for name, value in sorted(counts.items()):
print(f"{name}: {value}")
action: 3 control: 1 evaluation: 1 perception: 1
The expected output is a structured failure histogram where action-side problems dominate this small panel. That is the useful scientific reading, because it tells the team where to spend the next debugging cycle instead of treating all failures as equally mysterious.
The counting code is trivial. The hard part is deciding on a stable taxonomy and recording it consistently in the same artifact bundle as video, metrics, prompts, and seeds. That is where maintained evaluation tooling from LeRobot reports, OpenVLA experiments, openpi serving logs, and DROID or LIBERO replay panels pays off.
For builders, the practical anchor set is concrete: use Hugging Face and LeRobot to standardize dataset cards and checkpoint exchange, PyTorch or JAX to test whether a representation change actually affects training dynamics, Weights & Biases or TensorBoard to track failure slices across runs, and DROID or LIBERO to stress the policy under broader variability. These are not interchangeable labels. Each one helps answer a different open question in the table below.
| Tool or benchmark | Question it helps study | Why it belongs in this section |
|---|---|---|
| LeRobot reports | How adaptation and failure taxonomies should be logged | Turns abstract open questions into inspectable artifact design. |
| OpenVLA or openpi experiments | Which representation or adaptation lever actually changes behavior | Open stacks make causal claims more inspectable than vendor demos. |
| DROID replay panels | Which failure types appear under broader real-world variability | Useful for testing whether nominal benchmark wins survive in-the-wild data. |
| LIBERO task suites | Whether broad multitask competence survives held-out task families | A practical benchmark anchor for the limits of generalization claims. |
Open Questions Worth Caring About
| Layer | Open question | Why it is still hard |
|---|---|---|
| Data | How should heterogeneous robot data be weighted in one mixture? | Helpfulness varies by embodiment, task, and control convention. |
| Representation | What action abstraction transfers best across robots? | Tokens, diffusion, flow, and hierarchical skills each fail differently. |
| Adaptation | How much of a new robot should be solved by metadata versus weight updates? | The cheapest lever changes from case to case. |
| Safety | How should a generalist policy know when not to act? | Abstention in physical systems is not as simple as low confidence in classification. |
| Evaluation | Which scalable benchmarks best predict real-world deployment behavior? | Simulation, synthetic perturbations, and hardware panels still disagree in important ways. |
Vendor demonstrations and official reports are useful signals, especially for architecture ideas, but they do not replace independent panels, open artifacts, and careful failure accounting.
A mobile manipulator that succeeds in nominal object delivery may still fail the real deployment question if it cannot abstain when a hallway is blocked, a camera is occluded, or the grasped object shifts unexpectedly. The limitation is not "more data needed" in the abstract. It is a missing recovery and uncertainty story.
Calling a robot foundation model "general" before you understand its abstention behavior is like calling a submarine versatile before asking whether it knows when to surface.
Which of the five limitation categories in this section would worry you most for a household robot, and which artifact would you demand to inspect before trusting it?
Current research is pushing on safer embodied reasoning, richer multi-embodiment transfer, scalable real-to-sim evaluation, and better action tokenization. The hardest scientific gap may be confidence calibration for action: how to make a generalist robot know when its current policy prior no longer applies.
The frontier is not blocked by one missing giant model. It is blocked by unresolved interfaces among data, embodiment, control, recovery, and evidence.
Write a failure taxonomy for a robot foundation model in your target application area. Include at least one abstention-related failure, one embodiment-mismatch failure, and one evaluation-design failure.
What's Next?
Section 35.8 turns those open questions into a builder workflow for serving, fine-tuning, and evaluating open robot foundation models with explicit evidence cards and deployment constraints.
Useful for turning failure taxonomies, dataset cards, and replay artifacts into a reproducible evaluation workflow.
An open reference point for studying which failures arise from representation, adaptation, or deployment choices in VLA systems.
Google DeepMind (2025). "Gemini Robotics: Bringing AI into the Physical World."
Useful both for frontier capabilities and for the safety considerations that accompany direct robot control from multimodal models.
Relevant for the question of which simulated or scaled panels best predict real-world generalist policy behavior.
Pertsch et al. (2025). "FAST: Efficient Action Tokenization for Vision-Language-Action Models."
Useful because action representation remains one of the unresolved transfer bottlenecks for robot foundation models.
A practical anchor for studying how broader, noisier real-world robot data changes the failure profile of foundation policies.
A concrete benchmark anchor for asking whether a generalist model retains competence across held-out task compositions.