Section 12.6: Reading a leaderboard without fooling yourself

A Careful Control Loop
Big Picture

Reading a leaderboard without fooling yourself asks what a benchmark number actually measures. A task suite is useful only when its episodes, splits, seeds, metrics, wrappers, and failure labels match the embodied construct the paper claims to evaluate.

For Reading a leaderboard without fooling yourself, benchmark discipline means freezing environment API, task panel, success definition, seed policy, and artifact schema before comparing methods.

What This Section Builds

This section makes leaderboard reading operational. It gives you a checklist for deciding whether a result compares methods, or whether it accidentally compares different tasks, splits, seeds, simulators, wrappers, or tuning budgets.

The goal is not cynicism. The goal is disciplined trust: promote only construct-matched metrics that were co-computed in one pass on one configuration, and keep everything else in diagnostics.

Evidence Is The Test

For Reading a leaderboard without fooling yourself, treat the leaderboard as an instrument: it is interpretable only when the benchmark isolates the capability, fixes the protocol, and records rerunnable context.

Theory

A leaderboard row is a summary of many design choices. The visible number may be success rate, SPL, reward, normalized score, completion rate, or throughput, but the hidden denominator is the episode panel and evaluation protocol. If two rows use different denominators, subtracting them does not estimate a method effect.

The practical design rule is simple: table comparisons need one evaluation script, one config, and one artifact that contains all compared rows. Cross-paper numbers can motivate discussion, but they should not be written as direct wins unless the benchmark protocol is demonstrably the same.

Mechanism

The mechanism is a provenance check. For each row, trace the method checkpoint, task panel, split, seed list, wrapper stack, simulator version, metric script, and aggregation rule. A row with missing provenance can be useful context, but it is not a clean comparator.

Worked Example

Code Fragment 1 checks whether rows can share a paper table. It rejects a direct comparison when one method uses a different split, seed set, or metric, even if the row names come from the same benchmark family.

# Group leaderboard rows by the protocol fields that make them comparable.
# A method win is paper-facing only when all compared rows share
# panel, split, seed set, wrappers, simulator, and metric.
rows = [
    {"method": "baseline", "panel": "MT50", "split": "official", "seeds": (0, 1, 2), "metric": "success"},
    {"method": "candidate", "panel": "MT50", "split": "official", "seeds": (0, 1, 2), "metric": "success"},
    {"method": "ablation", "panel": "MT50", "split": "tuned_validation", "seeds": (0, 1, 2), "metric": "success"},
]

groups = {}
for row in rows:
    key = (row["panel"], row["split"], row["seeds"], row["metric"])
    groups.setdefault(key, []).append(row["method"])

print(groups)
{('MT50', 'official', (0, 1, 2), 'success'): ['baseline', 'candidate'], ('MT50', 'tuned_validation', (0, 1, 2), 'success'): ['ablation']}
Code Fragment 1: The grouping key separates comparable rows from diagnostic rows. Here baseline and candidate can be compared directly, while ablation stays separate because its split is tuned_validation.
Library Shortcut

Leaderboard tooling can automate this provenance check, but the rule is conceptual. A method claim needs a same-config comparison. A survey claim can cite cross-paper rows, but it should label them as context when their protocols differ.

Practical Recipe

  1. Identify the construct: manipulation generalization, lifelong transfer, long-horizon household progress, navigation efficiency, social safety, or simulation throughput.
  2. Check whether every compared row shares the same panel, split, seeds, wrappers, simulator build, and metric script.
  3. Prefer seed-level values and confidence intervals over one best-run number.
  4. Audit whether validation data, public test episodes, prompt templates, or generated-scene seeds influenced tuning.
  5. Keep mismatched cross-paper numbers in a diagnostic note rather than a win table.
Benchmark Evidence Rule

For Reading a leaderboard without fooling yourself, compare only metrics co-computed in one benchmark pass with the same task panel, wrappers, seed policy, success definition, and logged failure labels.

Common Failure Mode

The common mistake is treating a leaderboard as a table of facts while ignoring protocol drift. One row may use a held-out test split, another may use validation episodes, another may change the action wrapper, and another may tune seeds. The numbers are real, but the direct comparison is not.

Practical Example

A robotics team reading a leaderboard should reconstruct the evaluation manifest for each row: checkpoint, panel, split, seeds, wrappers, simulator, metric script, aggregation rule, and tuning access. If the manifest cannot be reconstructed, the row can inform scouting but should not anchor a paper claim.

Memory Hook

The safest leaderboard question is not "who is first?" It is "which rows were measured by the same ruler?"

Research Frontier

The frontier is reproducible leaderboards that attach executable configs, per-seed traces, and failure labels to every row. As embodied benchmarks grow through generated scenes, large demonstration corpora, GPU-parallel simulators, and language-conditioned tasks, provenance becomes as important as the score.

Self Check

Can you name the task panel, split, seed list, wrapper stack, simulator build, metric script, aggregation rule, tuning access, and failure taxonomy for each row? If not, the comparison is still too vague.

Reading a leaderboard well means converting a ranked list into an evidence map. A ManiSkill3 speed result, a Meta-World success result, a Habitat SPL result, and an Isaac Lab throughput result can all be valuable, but they answer different questions. The audit is to keep each number attached to the question it actually answers.

The graduate-level habit is to distinguish result validity from comparison validity. A row can be valid for its own protocol and still invalid as a direct comparison to another row. A paper-facing claim needs both: the row must be internally reproducible, and the compared rows must be co-computed on one configuration.

Leaderboard Reading Checklist
QuestionWhy it mattersDecision
Same task panel?Different episodes or generated scenes change the denominator.If no, compare qualitatively only.
Same split and tuning access?Validation tuning and test evaluation are different claims.If no, keep rows separate.
Same seeds and aggregation?Best-run reporting can hide variance and seed sensitivity.If no, rerun or report uncertainty.
Same wrappers and simulator build?Observation, action, termination, and physics changes alter the task.If no, call it a harness change.
Same metric script?Success, SPL, progress, reward, and throughput answer different questions.If no, do not subtract the numbers.

A robust leaderboard entry starts as a machine-readable row with provenance fields. Store the model checkpoint, benchmark version, split, seeds, wrappers, simulator, metric script, hardware when throughput is measured, and the raw per-episode outputs. A reviewer should be able to reconstruct the table from that row.

  1. Build a provenance table before copying any result into prose.
  2. Group rows by task panel, split, seeds, wrappers, simulator, and metric.
  3. Promote only groups with two or more methods to direct comparison.
  4. Attach uncertainty or seed-level values to every promoted comparison.
  5. Write mismatched rows as diagnostic context, not as wins.

Code Fragment 2 shows a row format that supports direct-comparison filtering. It is deliberately boring, because boring provenance is what keeps benchmark claims honest.

# Store leaderboard provenance with the metric value.
# The comparison key is everything except method and value because
# those are the fields a method is allowed to change.
from dataclasses import dataclass, asdict

@dataclass
class LeaderboardRow:
    method: str
    panel: str
    split: str
    seeds: tuple[int, ...]
    simulator: str
    metric: str
    value: float

    def as_row(self) -> dict[str, object]:
        return asdict(self)

row = LeaderboardRow(
    method="candidate",
    panel="Habitat-SocialNav",
    split="unseen_scenes",
    seeds=(0, 1, 2, 3, 4),
    simulator="Habitat 3.0",
    metric="success_weighted_by_path_length",
    value=0.51,
)
print(row.as_row())
{'method': 'candidate', 'panel': 'Habitat-SocialNav', 'split': 'unseen_scenes', 'seeds': (0, 1, 2, 3, 4), 'simulator': 'Habitat 3.0', 'metric': 'success_weighted_by_path_length', 'value': 0.51}
Code Fragment 2: The LeaderboardRow keeps value attached to the panel, split, seed list, simulator, and metric. A direct comparison is valid only when another row matches these provenance fields and changes only the method and measured value.

Expected output: the printed row should expose the provenance fields needed to reconstruct the comparison. If a leaderboard omits these fields, treat it as scouting evidence until the configuration can be verified.

When a leaderboard comparison looks surprising, audit the denominator before interpreting the method. Check for hidden split changes, seed tuning, wrapper drift, simulator drift, metric changes, hardware changes for throughput, and selective reporting. Only then decide whether the method result needs a scientific explanation.

Key Takeaway

A leaderboard is useful when it helps you promote same-config, construct-matched comparisons to evidence and keep mismatched rows in diagnostics.

Exercise 12.6.1

Take three published or hypothetical leaderboard rows and write their provenance fields: panel, split, seeds, wrappers, simulator, metric, and tuning access. Mark which rows can be directly compared and which must stay diagnostic.

Bibliography and Further Reading
Tools And Libraries

ManiSkill Contributors. "ManiSkill Documentation."

ManiSkill provides manipulation tasks, demonstrations, GPU-parallel workflows, and documentation for robot-learning experiments. It is relevant when this section asks how benchmark design turns simulator capability into comparable evidence. Readers should connect this source to reading a leaderboard without fooling yourself when deciding what is reusable, what is benchmark-specific, and what must be remeasured.

Tool

RoboCasa Team. "RoboCasa Documentation."

RoboCasa documents everyday manipulation tasks and simulation assets, including the 2024 release lineage and later RoboCasa365 expansion. Readers should use it to study how task diversity and environment generation affect benchmark claims. Readers should connect this source to reading a leaderboard without fooling yourself when deciding what is reusable, what is benchmark-specific, and what must be remeasured.

Tool

Mandlekar, A. et al. "robomimic Documentation."

robomimic provides datasets and algorithms for learning from demonstrations. It matters here because benchmark evaluation often depends as much on dataset format and split discipline as on simulator physics. Readers should connect this source to reading a leaderboard without fooling yourself when deciding what is reusable, what is benchmark-specific, and what must be remeasured.

Tool
Datasets And Benchmarks

James, S. et al. (2019). "RLBench: The Robot Learning Benchmark and Learning Environment." arXiv.

RLBench frames a large set of vision-guided manipulation tasks with demonstrations and task variation. It is useful for readers studying few-shot, multi-task, and manipulation benchmark design. Readers should connect this source to reading a leaderboard without fooling yourself when deciding what is reusable, what is benchmark-specific, and what must be remeasured.

Paper

Stanford Vision and Learning Lab. "BEHAVIOR-1K."

BEHAVIOR-1K grounds household embodied AI tasks in human needs and long-horizon mobile manipulation. It gives benchmark designers a concrete example of task suites that go beyond isolated tabletop success rates. Readers should connect this source to reading a leaderboard without fooling yourself when deciding what is reusable, what is benchmark-specific, and what must be remeasured.

Dataset
What's Next?

Chapter 13 uses benchmark discipline to design domain randomization and synthetic data that support transfer claims.