Section 35.4: Large behavior models and rigorous evaluation

"A giant policy is still a small result if you only measured the easiest slice."

An Evaluation Skeptic
A leaderboard wall with one large average score peeling back to reveal smaller panels for different robots, tasks, and perturbations.
Figure 35.4A: Large behavior models need sliced evaluation, because one glossy average can hide which robot, task family, or perturbation actually improved.
Big Picture

Large behavior models promise broader coverage across tasks and embodiments, but their evidence can be surprisingly thin if evaluation collapses everything into one headline metric. This section is about the discipline needed to turn "larger model" into a real claim about behavior.

Why Aggregate Metrics Mislead

A foundation policy evaluated over many tasks can improve its mean success rate while regressing on the exact embodiment or perturbation regime that matters to you. This is not a statistical nuisance, it is a systems fact. Different embodiments stress different parts of the stack: action adapters, contact modeling, semantic grounding, or control latency.

The right evaluation question is therefore not "what is the global success rate?" but "what changed on one matched scenario panel, and how do the slices distribute across robots, tasks, and perturbation types?"

One Panel, Many Views

All compared numbers should come from one fixed scenario panel, one metric script, and one artifact bundle. Slicing happens after measurement, not by mixing different evaluation runs into one story.

Matched Evaluation In Symbols

A simple notation for slice-aware evaluation is

$$J(\pi, S)=\sum_{s \in S} w_s \; \mathbb{E}_{e \sim \mathcal{E}_s}[m(\pi,e)], \qquad \Delta_s = J(\pi_1, S_s)-J(\pi_0, S_s),$$

where $S$ is a fixed set of slices, such as embodiment, task family, and perturbation category. The main scalar metric $J$ is useful only if the per-slice deltas $\Delta_s$ remain visible. Otherwise the evaluation is blind to where the model actually got better.

Code Fragment 1 computes this kind of slice table from one shared result panel.

# Compute matched slice metrics from one evaluation panel.
results = [
    {"embodiment": "arm", "task": "drawer", "success": 1},
    {"embodiment": "arm", "task": "drawer", "success": 0},
    {"embodiment": "mobile", "task": "drawer", "success": 1},
    {"embodiment": "mobile", "task": "drawer", "success": 1},
]

groups = {}
for row in results:
    key = (row["embodiment"], row["task"])
    groups.setdefault(key, []).append(row["success"])

for key, values in groups.items():
    mean_success = sum(values) / len(values)
    print(f"{key}: success={mean_success:.2f}")
('arm', 'drawer'): success=0.50
('mobile', 'drawer'): success=1.00

The expected output is a slice table that exposes embodiment-specific performance rather than burying it in one aggregate score. A reader should immediately see that the overall mean would overstate capability because the arm embodiment is still failing half the time on the same task family.

Code Fragment 1: The aggregate score for this panel would be 0.75, but the slice table shows a much more important fact: the arm embodiment is still failing half the time on the same task. This is why slice visibility is non-negotiable.
Library Shortcut

The grouping code is short because the panel is tiny. In a real evaluation harness, LeRobot reports, LIBERO task panels, DROID-style replay logs, and your own benchmark runner should save one artifact with video, metrics, prompts, seeds, and slice labels, so you can regenerate the same table without re-running ad hoc notebook cells.

What To Slice By

Minimum Evaluation Slices For Large Behavior Models
SliceWhy it mattersTypical hidden failure
EmbodimentSeparates transfer quality from architecture hype.The largest average gains come from the easiest robot.
Task familyDifferent tasks stress different interface layers.Pick-and-place improves while contact-rich insertion regresses.
Perturbation typeShows whether robustness is semantic, visual, or dynamical.A model handles paraphrased instructions but fails shifted lighting.
Intervention costMeasures operator burden, not just headline success.Success rises only because humans rescue more runs.
Latency bandExposes whether improvements survive runtime constraints.A stronger policy fails once inference is bounded to deployment speed.
The Mean Can Lie Politely

If a larger model only helps on well-lit tabletop tasks but hurts on mobile tasks with delayed sensing, the mean score may still rise. That is not a contradiction. It is the reason slice-aware reporting exists.

Practical Example

A lab evaluating an adapted VLA on LIBERO-style tasks and a real mobile manipulator should not merge those outcomes into one undifferentiated bar chart. The real question is whether the transfer story holds in both regimes and whether the runtime budget changes the answer.

Memory Hook

A giant average score is a trench coat. Make it open the coat and show you the slice labels.

Self Check

Name the three slices you would demand before believing a claim that one robot foundation model "outperforms baselines." If latency or intervention count is absent, what deployment fact might still be hidden?

Research Frontier

Recent work on simulation-backed evaluation, paraphrase robustness, and paired real-to-sim panels is trying to make generalist robot-policy evaluation cheaper and more honest. The open problem is correlation: which scalable benchmark slices best predict what happens on real hardware under deployment constraints?

Key Takeaway

Large behavior models deserve large evaluation discipline. The correct deliverable is not one global metric, it is one matched scenario panel with transparent slices that reveal where scale truly helped.

Exercise 35.4

Design an evaluation panel for a cross-embodiment robot policy with at least four slices, one aggregate score, and one rule for handling intervention-assisted successes. Explain which deployment mistake your panel is trying to prevent.

What's Next?

Section 35.5 moves from evaluation back to adaptation and asks how a supposedly general policy should be prompted, conditioned, and locally retuned when you meet a new robot.

Bibliography and Further Reading
Evaluation Sources

LIBERO benchmark.

A strong reference for multi-task evaluation and why broad behavior must still be tracked by task family.

Benchmark

Li et al. (2024). "Evaluating Real-World Robot Manipulation Policies in Simulation."

Useful for understanding simulation-backed proxies such as SIMPLER and how they relate to real policy evaluation.

Paper

DROID dataset project page.

Relevant because broad in-the-wild data only helps if evaluation can still isolate embodiment and perturbation effects.

Dataset