Section 35.7: Limitations and open questions

"Generalist behavior is easy to announce and hard to certify."

A Safety-Minded Reviewer
Big Picture

Robot foundation models have made real progress, but they still break at the exact places that matter most for deployment: embodiment shifts, contact-rich edge cases, recovery after failure, data provenance, and trustworthy evaluation. This section is about those unresolved edges.

Five Limits That Still Matter

First, data heterogeneity is still under-controlled. We can pool more trajectories than ever, but we still do not have a universal guarantee that one robot's data helps another robot in the ways we think it does. Second, evaluation remains fragile. Many systems are impressive on curated tasks yet weak under perturbation, latency bounds, or changed embodiment assumptions.

Third, safety and recovery remain shallow compared with the ambition of "general-purpose" robotics. A large model may know what to do in-distribution and still have poor abstention behavior when the scene violates its assumptions. Fourth, data rights and provenance are becoming central as community datasets grow. Fifth, the open-versus-closed divide still makes it hard to separate architecture lessons from infrastructure advantages.

Open Questions Live At Interfaces

The hardest unresolved problems are rarely inside one neural block. They sit between data contracts, embodiment adapters, control loops, and evaluation pipelines.

A Failure Taxonomy

A useful diagnostic decomposition is

$$R = R_{\text{perception}} + R_{\text{state}} + R_{\text{action}} + R_{\text{control}} + R_{\text{evaluation}},$$

where the terms denote the fraction of failures attributable primarily to those layers on one matched scenario panel. The decomposition is not a theorem. It is a discipline for avoiding the lazy conclusion that "the model failed" when the real cause was stale calibration, interface mismatch, or a weak evaluation harness.

Code Fragment 1 computes a tiny version of that taxonomy.

# Count failure causes on one matched scenario panel.
failures = ["perception", "action", "action", "evaluation", "control", "action"]
counts = {}
for name in failures:
    counts[name] = counts.get(name, 0) + 1

for name, value in sorted(counts.items()):
    print(f"{name}: {value}")
action: 3
control: 1
evaluation: 1
perception: 1

The expected output is a structured failure histogram where action-side problems dominate this small panel. That is the useful scientific reading, because it tells the team where to spend the next debugging cycle instead of treating all failures as equally mysterious.

Code Fragment 1: The sorted counts show why taxonomies matter. A team that only records "failure" learns almost nothing. A team that records structured failure labels can see where the stack is actually leaking reliability.
Library Shortcut

The counting code is trivial. The hard part is deciding on a stable taxonomy and recording it consistently in the same artifact bundle as video, metrics, prompts, and seeds. That is where maintained evaluation tooling from LeRobot reports, OpenVLA experiments, openpi serving logs, and DROID or LIBERO replay panels pays off.

For builders, the practical anchor set is concrete: use Hugging Face and LeRobot to standardize dataset cards and checkpoint exchange, PyTorch or JAX to test whether a representation change actually affects training dynamics, Weights & Biases or TensorBoard to track failure slices across runs, and DROID or LIBERO to stress the policy under broader variability. These are not interchangeable labels. Each one helps answer a different open question in the table below.

Concrete Tool Anchors For Studying Open Problems
Tool or benchmarkQuestion it helps studyWhy it belongs in this section
LeRobot reportsHow adaptation and failure taxonomies should be loggedTurns abstract open questions into inspectable artifact design.
OpenVLA or openpi experimentsWhich representation or adaptation lever actually changes behaviorOpen stacks make causal claims more inspectable than vendor demos.
DROID replay panelsWhich failure types appear under broader real-world variabilityUseful for testing whether nominal benchmark wins survive in-the-wild data.
LIBERO task suitesWhether broad multitask competence survives held-out task familiesA practical benchmark anchor for the limits of generalization claims.

Open Questions Worth Caring About

Open Questions By Layer
LayerOpen questionWhy it is still hard
DataHow should heterogeneous robot data be weighted in one mixture?Helpfulness varies by embodiment, task, and control convention.
RepresentationWhat action abstraction transfers best across robots?Tokens, diffusion, flow, and hierarchical skills each fail differently.
AdaptationHow much of a new robot should be solved by metadata versus weight updates?The cheapest lever changes from case to case.
SafetyHow should a generalist policy know when not to act?Abstention in physical systems is not as simple as low confidence in classification.
EvaluationWhich scalable benchmarks best predict real-world deployment behavior?Simulation, synthetic perturbations, and hardware panels still disagree in important ways.
Do Not Treat Frontier Reports As Settled Science

Vendor demonstrations and official reports are useful signals, especially for architecture ideas, but they do not replace independent panels, open artifacts, and careful failure accounting.

Practical Example

A mobile manipulator that succeeds in nominal object delivery may still fail the real deployment question if it cannot abstain when a hallway is blocked, a camera is occluded, or the grasped object shifts unexpectedly. The limitation is not "more data needed" in the abstract. It is a missing recovery and uncertainty story.

Memory Hook

Calling a robot foundation model "general" before you understand its abstention behavior is like calling a submarine versatile before asking whether it knows when to surface.

Self Check

Which of the five limitation categories in this section would worry you most for a household robot, and which artifact would you demand to inspect before trusting it?

Research Frontier

Current research is pushing on safer embodied reasoning, richer multi-embodiment transfer, scalable real-to-sim evaluation, and better action tokenization. The hardest scientific gap may be confidence calibration for action: how to make a generalist robot know when its current policy prior no longer applies.

Key Takeaway

The frontier is not blocked by one missing giant model. It is blocked by unresolved interfaces among data, embodiment, control, recovery, and evidence.

Exercise 35.7

Write a failure taxonomy for a robot foundation model in your target application area. Include at least one abstention-related failure, one embodiment-mismatch failure, and one evaluation-design failure.

What's Next?

Section 35.8 turns those open questions into a builder workflow for serving, fine-tuning, and evaluating open robot foundation models with explicit evidence cards and deployment constraints.

Bibliography and Further Reading
Open Questions And Evaluation

LeRobot documentation.

Useful for turning failure taxonomies, dataset cards, and replay artifacts into a reproducible evaluation workflow.

Documentation

OpenVLA repository.

An open reference point for studying which failures arise from representation, adaptation, or deployment choices in VLA systems.

Repository

Google DeepMind (2025). "Gemini Robotics: Bringing AI into the Physical World."

Useful both for frontier capabilities and for the safety considerations that accompany direct robot control from multimodal models.

Report

PolaRiS evaluation report.

Relevant for the question of which simulated or scaled panels best predict real-world generalist policy behavior.

Paper

Pertsch et al. (2025). "FAST: Efficient Action Tokenization for Vision-Language-Action Models."

Useful because action representation remains one of the unresolved transfer bottlenecks for robot foundation models.

Paper

DROID dataset project page.

A practical anchor for studying how broader, noisier real-world robot data changes the failure profile of foundation policies.

Dataset

LIBERO benchmark.

A concrete benchmark anchor for asking whether a generalist model retains competence across held-out task compositions.

Benchmark