"A giant policy is still a small result if you only measured the easiest slice."
An Evaluation Skeptic
Large behavior models promise broader coverage across tasks and embodiments, but their evidence can be surprisingly thin if evaluation collapses everything into one headline metric. This section is about the discipline needed to turn "larger model" into a real claim about behavior.
Why Aggregate Metrics Mislead
A foundation policy evaluated over many tasks can improve its mean success rate while regressing on the exact embodiment or perturbation regime that matters to you. This is not a statistical nuisance, it is a systems fact. Different embodiments stress different parts of the stack: action adapters, contact modeling, semantic grounding, or control latency.
The right evaluation question is therefore not "what is the global success rate?" but "what changed on one matched scenario panel, and how do the slices distribute across robots, tasks, and perturbation types?"
All compared numbers should come from one fixed scenario panel, one metric script, and one artifact bundle. Slicing happens after measurement, not by mixing different evaluation runs into one story.
Matched Evaluation In Symbols
A simple notation for slice-aware evaluation is
$$J(\pi, S)=\sum_{s \in S} w_s \; \mathbb{E}_{e \sim \mathcal{E}_s}[m(\pi,e)], \qquad \Delta_s = J(\pi_1, S_s)-J(\pi_0, S_s),$$
where $S$ is a fixed set of slices, such as embodiment, task family, and perturbation category. The main scalar metric $J$ is useful only if the per-slice deltas $\Delta_s$ remain visible. Otherwise the evaluation is blind to where the model actually got better.
Code Fragment 1 computes this kind of slice table from one shared result panel.
# Compute matched slice metrics from one evaluation panel.
results = [
{"embodiment": "arm", "task": "drawer", "success": 1},
{"embodiment": "arm", "task": "drawer", "success": 0},
{"embodiment": "mobile", "task": "drawer", "success": 1},
{"embodiment": "mobile", "task": "drawer", "success": 1},
]
groups = {}
for row in results:
key = (row["embodiment"], row["task"])
groups.setdefault(key, []).append(row["success"])
for key, values in groups.items():
mean_success = sum(values) / len(values)
print(f"{key}: success={mean_success:.2f}")
('arm', 'drawer'): success=0.50
('mobile', 'drawer'): success=1.00The expected output is a slice table that exposes embodiment-specific performance rather than burying it in one aggregate score. A reader should immediately see that the overall mean would overstate capability because the arm embodiment is still failing half the time on the same task family.
The grouping code is short because the panel is tiny. In a real evaluation harness, LeRobot reports, LIBERO task panels, DROID-style replay logs, and your own benchmark runner should save one artifact with video, metrics, prompts, seeds, and slice labels, so you can regenerate the same table without re-running ad hoc notebook cells.
What To Slice By
| Slice | Why it matters | Typical hidden failure |
|---|---|---|
| Embodiment | Separates transfer quality from architecture hype. | The largest average gains come from the easiest robot. |
| Task family | Different tasks stress different interface layers. | Pick-and-place improves while contact-rich insertion regresses. |
| Perturbation type | Shows whether robustness is semantic, visual, or dynamical. | A model handles paraphrased instructions but fails shifted lighting. |
| Intervention cost | Measures operator burden, not just headline success. | Success rises only because humans rescue more runs. |
| Latency band | Exposes whether improvements survive runtime constraints. | A stronger policy fails once inference is bounded to deployment speed. |
If a larger model only helps on well-lit tabletop tasks but hurts on mobile tasks with delayed sensing, the mean score may still rise. That is not a contradiction. It is the reason slice-aware reporting exists.
A lab evaluating an adapted VLA on LIBERO-style tasks and a real mobile manipulator should not merge those outcomes into one undifferentiated bar chart. The real question is whether the transfer story holds in both regimes and whether the runtime budget changes the answer.
A giant average score is a trench coat. Make it open the coat and show you the slice labels.
Name the three slices you would demand before believing a claim that one robot foundation model "outperforms baselines." If latency or intervention count is absent, what deployment fact might still be hidden?
Recent work on simulation-backed evaluation, paraphrase robustness, and paired real-to-sim panels is trying to make generalist robot-policy evaluation cheaper and more honest. The open problem is correlation: which scalable benchmark slices best predict what happens on real hardware under deployment constraints?
Large behavior models deserve large evaluation discipline. The correct deliverable is not one global metric, it is one matched scenario panel with transparent slices that reveal where scale truly helped.
Design an evaluation panel for a cross-embodiment robot policy with at least four slices, one aggregate score, and one rule for handling intervention-assisted successes. Explain which deployment mistake your panel is trying to prevent.
What's Next?
Section 35.5 moves from evaluation back to adaptation and asks how a supposedly general policy should be prompted, conditioned, and locally retuned when you meet a new robot.
A strong reference for multi-task evaluation and why broad behavior must still be tracked by task family.
Li et al. (2024). "Evaluating Real-World Robot Manipulation Policies in Simulation."
Useful for understanding simulation-backed proxies such as SIMPLER and how they relate to real policy evaluation.
Relevant because broad in-the-wild data only helps if evaluation can still isolate embodiment and perturbation effects.