Section 52.5: Reproducible evaluation: SIMPLER and sim-as-proxy

For Reproducible evaluation: SIMPLER and sim-as-proxy, a benchmark conclusion survives reruns only when the panel, seed policy, exclusion rules, and raw episode artifacts are inspectable.

An Evaluation Methodologist
Big Picture

Simulation is valuable because it is cheap, repeatable, and broad. It is dangerous when teams forget that proxy quality is itself a hypothesis that must be tested against physical evidence.

Reproducible evaluation: SIMPLER and sim-as-proxy illustration for Chapter 52.
Figure 52.5.1: A sim-to-real evaluation ladder shows how simulator panels can be useful proxies when their limits are named rather than ignored.

Why This Matters

Reproducible evaluation: SIMPLER and sim-as-proxy matters because evaluation choices rewrite the scientific claim. If the metric drops time, energy, or safety terms that the deployment team cares about, the benchmark no longer matches the real decision.

Let $S$ be the simulator score and $R$ the real-world score over matched tasks. A practical proxy check is the fidelity gap $$\Delta = \left|\mathbb{E}[S] - \mathbb{E}[R]\right|$$ plus a rank correlation between methods over the same task panel. A low average gap without rank preservation is still a weak proxy.

Key Insight

A simulator is not validated by looking plausible. It is validated when the same ranking, failure modes, or sensitivity trends survive on the matched real panel often enough to support the intended decision.

Algorithmic View
  1. Define which decision the simulator proxy is supposed to support, such as model ranking, hyperparameter filtering, or failure-mode search.
  2. Construct a matched sim and real panel with aligned task definitions and artifact schema.
  3. Compute mean gaps, ranking stability, and overlap in failure taxonomy.
  4. Document where the simulator is trustworthy and where it is only exploratory.
  5. Re-check the proxy whenever the robot hardware, perception stack, or environment distribution changes materially.

Worked Example

A simulator may preserve policy ranking on tabletop grasping but fail to preserve contact-rich insertion errors because friction and compliance are mis-modeled. That makes it a good filter for broad candidate screening but a weak final judge for insertion policies.

sim_scores = {"A": 0.84, "B": 0.76, "C": 0.72}
real_scores = {"A": 0.68, "B": 0.70, "C": 0.61}
ranking_sim = sorted(sim_scores, key=sim_scores.get, reverse=True)
ranking_real = sorted(real_scores, key=real_scores.get, reverse=True)
mean_gap = round(sum(abs(sim_scores[k] - real_scores[k]) for k in sim_scores) / len(sim_scores), 3)
print({"ranking_sim": ranking_sim, "ranking_real": ranking_real, "mean_gap": mean_gap})
{'ranking_sim': ['A', 'B', 'C'], 'ranking_real': ['B', 'A', 'C'], 'mean_gap': 0.11}
Code Fragment 52.5.1 compares simulator and real rankings directly, making proxy failure visible even when the mean gap looks moderate.

Expected output: The proxy loses trust here because the best simulator method is not the best real-world method. The mean gap alone would miss that decision-level failure.

Library Shortcut

SIMPLER-style infrastructure, benchmark manifests, and replay artifacts help because they enforce matched schemas across simulation and real execution. The library advantage is standardization, not automatic transfer.

SIMPLER-style sim-as-proxy evaluation needs a correlation audit, not just a simulator score. Pandas aligns simulated and real episodes, SciPy estimates rank agreement and uncertainty, DVC pins both panels, and MLflow or Weights and Biases links each simulated run to the physical policy it claims to predict.

The strongest simulator proxy claims are narrow and explicit. A simulator might be trusted for policy ranking within one morphology and camera setup, but not for fleet-wide energy forecasting or human-interaction safety.

The practical deliverable is a proxy-validity table that names the task family, robot configuration, simulator settings, real-world panel, correlation window, and known mismatch. Without that table, a simulator result is evidence for simulation only.

A common failure mode is to treat a proxy as universally valid after one early correlation result. Proxy validity is local to task family, hardware configuration, and decision type.

Cross-References

This section connects backward to Chapter 20 on sim-to-real transfer and forward to Section 52.6 on benchmark hygiene.

Lab Recipe

Take three candidate policies, evaluate them in simulation and on a small real panel, and compute both mean score gap and ranking agreement. Then write one paragraph naming which decision the simulator can support reliably.

Failure Mode

Do not use simulator-only confidence intervals to justify real-world deployment approval. Proxy evidence can prioritize tests, but it cannot replace the tests whose outcome it is only trying to predict.

Practical Example

For autonomous driving, CARLA may be strong for regression testing and scenario replay but incomplete for real sensor contamination or rare human behavior. For manipulation, MuJoCo or Isaac may screen policies well while missing subtle compliance errors.

Research Frontier

The frontier is richer simulator realism plus validity audits that measure which conclusions transfer and which do not.

Self Check

Can you state one decision for which your simulator is trustworthy and one for which it is not? If not, the proxy contract is still too vague.

Key Takeaway

Sim-as-proxy is a scientific claim about decision support. It earns trust through matched panels, explicit fidelity gaps, and repeated checks against real evidence.

Exercise 52.5.1

Choose one simulator you use. Define the decision it is meant to support, the matched real panel needed to test that claim, and the failure signal that would invalidate the proxy.

Fun Note

Calling a simulator a "proxy" without checking whether it preserves rankings is like calling a map "accurate" because it is printed in color. The validation happens on the road, not on the page.

Section References

Official SIMPLER and related benchmark resources.

Use the project artifacts to see how matched simulator and real evaluation can share manifests and replay structure.

Todorov, E., Erez, T., and Tassa, Y. "MuJoCo: A physics engine for model-based control." (2012). https://mujoco.org/

A central simulator lineage for embodied control research.

What's Next

Section 52.6 closes the chapter by moving from metric design to benchmark governance, protocol control, and real-world evaluation hygiene.