"The right simulator is the one whose wrongness you can afford, measure, and explain."
A Reality-Gap Auditor
This section converts the chapter into a decision procedure. You do not choose a simulator by brand, speed, or popularity. You choose it by matching task physics, observation needs, throughput, integration, maturity, and evaluation risk.
Choose a simulator by task contract, not reputation. Record the dominant physical risk, required sensor model, throughput target, asset format, and integration boundary, then run the same task panel before comparing tools.
Simulator claims are only meaningful when measured on your robot, your task, your controller frequency, and your observation pipeline. Treat unmatched benchmark numbers as hints, not evidence.
This section applies the dynamics background from Chapter 6: Dynamics and Simulation Math to the simulator stack introduced in Chapter 9: Why Simulation Is Central. It also prepares the GPU training workflows in Chapter 17: Massively Parallel and GPU RL and the randomization workflow in Chapter 13: Domain Randomization and Synthetic Data by tying tool choice to measurable task risk.
The Decision Table
The table below is a starting map, not a leaderboard. Use it to choose which tools deserve a task-level comparison, then measure those tools on the same robot, task, controller frequency, observation pipeline, and validation script.
| Primary need | Start with | Second option | Reason |
|---|---|---|---|
| Small contact-control experiment | MuJoCo | Drake | Fast, readable, strong dynamics loop |
| JAX-native parallel RL | MJX | Brax or MuJoCo | JAX transformations and batched state are central |
| NVIDIA GPU MuJoCo-style throughput | MuJoCo Warp | Newton | Warp backend targets NVIDIA acceleration |
| USD scenes and sensor-rich robot learning | Isaac Lab | Newton or Genesis | Isaac Lab connects physics, rendering, sensors, and learning workflows |
| Emerging Warp and OpenUSD research | Newton | Isaac Lab | Useful when the frontier engine itself is part of the research question |
| Pythonic multi-physics and generated scenes | Genesis | Isaac Lab | Promising for multi-physics, rendering, and generative workflows |
| Model-based control and verification | Drake | MuJoCo | Optimization and analysis are first-class |
| Manipulation benchmark tasks | SAPIEN or ManiSkill | MuJoCo or Isaac Lab | Task suites and manipulation assets matter |
| ROS 2 system integration | Modern Gazebo | Isaac Sim | Middleware, sensors, and controllers dominate the need |
Recency And Deprecation Table
| Old or risky default | Current direction | Action |
|---|---|---|
| OpenAI Gym | Gymnasium | Use Gymnasium APIs for new environment work |
| Isaac Gym Preview, IsaacGymEnvs, OmniIsaacGymEnvs, Orbit | Isaac Lab | Migrate new robot-learning projects to Isaac Lab |
| Gazebo Classic | Modern Gazebo | Do not start new projects on Classic after its January 2025 end of life |
| Unverified vendor simulator claims | Task-level benchmark artifact | Reproduce performance and fidelity on your own task |
| One simulator for every stage | Multi-tool pipeline | Train, benchmark, integrate, and verify with fit-for-purpose tools |
Compare simulators on one task configuration, one metric script, and one artifact. A throughput number from one robot and a fidelity claim from another robot do not form a valid comparison. The comparison artifact should contain every number used in the table, so a reviewer can trace each recommendation back to the same run.
A Runnable Scoring Rubric
Code Fragment 1 turns simulator choice into a co-computed score. The important part is that all tools are scored on the same criteria in one run, which prevents invalid number-by-number comparisons from different configurations.
# Score simulators on one task using one rubric and one config.
# Higher is better, but the weights must match the task.
# This avoids mixing metrics from different experiments.
weights = {
"physics_fit": 0.30,
"throughput": 0.20,
"sensor_stack": 0.15,
"integration": 0.15,
"maturity": 0.15,
"validation_evidence": 0.05,
}
scores = {
"MuJoCo": {
"physics_fit": 5,
"throughput": 3,
"sensor_stack": 2,
"integration": 3,
"maturity": 5,
"validation_evidence": 5,
},
"Isaac Lab": {
"physics_fit": 4,
"throughput": 5,
"sensor_stack": 5,
"integration": 4,
"maturity": 4,
"validation_evidence": 4,
},
"Drake": {
"physics_fit": 4,
"throughput": 2,
"sensor_stack": 2,
"integration": 3,
"maturity": 5,
"validation_evidence": 5,
},
"Modern Gazebo": {
"physics_fit": 3,
"throughput": 2,
"sensor_stack": 4,
"integration": 5,
"maturity": 4,
"validation_evidence": 4,
},
}
ranked = sorted(
((name, sum(weights[k] * vals[k] for k in weights)) for name, vals in scores.items()),
key=lambda item: item[1],
reverse=True,
)
for name, score in ranked:
print(f"{name}: {score:.2f}")
Isaac Lab: 4.35 MuJoCo: 3.85 Modern Gazebo: 3.45 Drake: 3.35
The scoring script is about 50 lines of decision support. In a real project, the shortcut is a reproducible experiment harness: one config file, one launcher, one metrics table, one artifact store, and one report. Tools such as Hydra, Weights & Biases, MLflow, or plain JSON plus CSV can store the same comparison, but the principle is the same.
Hands-On Lab: Benchmark A Simulator Choice
Objective
Build a reproducible simulator-selection artifact for a reaching, pushing, locomotion, or ROS 2 integration task.
What You'll Practice
- Writing a task-specific simulator rubric
- Co-computing comparable scores in one pass
- Adding a deprecation and maintenance check
- Recording validation evidence instead of preference scores
- Producing an engineering recommendation with a fallback
Setup
The required version uses only Python's standard library. Optional extensions can call MuJoCo, Isaac Lab, ManiSkill, or Gazebo after those tools are installed.
Steps
Work through the steps in order so the final recommendation has a visible chain from task risk to weights, scores, maintenance status, validation evidence, and falsification test.
Step 1: Define The Task
Choose one task and write the dominant risk: contact fidelity, throughput, visual sensors, model-based control, manipulation benchmark coverage, or ROS 2 integration.
task = "tabletop pushing"
dominant_risk = "contact fidelity"
requirements = {"contact fidelity", "asset import", "batch rollout", "camera rendering"}
score = {"MuJoCo": 3, "Isaac Lab": 4, "Genesis": 3}
choice = max(score, key=score.get)
print({"task": task, "dominant_risk": dominant_risk, "recommended_stack": choice})task and dominant_risk, force the recommendation to start from the embodied problem.Hint
If the task fails when objects slip incorrectly, choose contact fidelity. If it fails because training is too slow, choose throughput.
Step 2: Set Weights
Choose weights that match the task. The weights should sum to 1.0 so the final score is interpretable, and one criterion should capture whether the evidence was actually measured on this task.
Hint
A manipulation benchmark usually weights physics and maturity heavily. A synthetic-data task gives more weight to sensors and rendering.
Step 3: Score Candidates
Score at least three simulators from 1 to 5 using the same criteria. Keep the candidates in one table so the comparison is co-computed.
scores = {
"MuJoCo": {
"physics_fit": 5,
"throughput": 3,
"sensor_stack": 2,
"integration": 3,
"maturity": 5,
"validation_evidence": 5,
},
"Isaac Lab": {
"physics_fit": 4,
"throughput": 5,
"sensor_stack": 5,
"integration": 4,
"maturity": 4,
"validation_evidence": 4,
},
"Modern Gazebo": {
"physics_fit": 3,
"throughput": 2,
"sensor_stack": 4,
"integration": 5,
"maturity": 4,
"validation_evidence": 4,
},
}
if isinstance(scores, list):
print({"rows": len(scores), "first": scores[0] if scores else None})
elif isinstance(scores, dict):
print({"fields": sorted(scores), "audit_ready": all(value not in (None, "") for value in scores.values())})
else:
print({"value": scores})Hint
Do not score a tool highly because it is popular. Score it highly because it fits the written task.
Step 4: Rank And Explain
Compute a ranking and print the recommendation with the exact reason.
ranked = sorted(
((name, sum(weights[k] * vals[k] for k in weights)) for name, vals in scores.items()),
key=lambda item: item[1],
reverse=True,
)
print(ranked[0])
('MuJoCo', 4.1)ranked list gives both the winner and the fallback candidates for the written recommendation.Hint
If two tools are close, the fallback recommendation is as important as the winner.
Step 5: Add A Currency Check
Add one maintenance note for each candidate: current, frontier, legacy, or deprecated. Then add one validation note that states which evidence artifact supports the score.
currency = {
"MuJoCo": "current",
"Isaac Lab": "current",
"Modern Gazebo": "current",
}
validation_artifacts = {
"MuJoCo": "friction sweep replay",
"Isaac Lab": "camera randomization replay",
"Modern Gazebo": "ROS 2 topic and controller log",
}
def audit_release_support(
currency: dict[str, str], validation_artifacts: dict[str, str]
) -> tuple[dict[str, str], dict[str, str]]:
assert currency.keys() == validation_artifacts.keys()
return currency, validation_artifacts
currency, validation_artifacts = audit_release_support(currency, validation_artifacts)
def audit_release_support(
currency: dict[str, str], validation_artifacts: dict[str, str]
) -> tuple[dict[str, str], dict[str, str]]:
assert currency.keys() == validation_artifacts.keys()
return currency, validation_artifacts
currency, validation_artifacts = audit_release_support(currency, validation_artifacts)
print(currency)
print(validation_artifacts)
{'MuJoCo': 'current', 'Isaac Lab': 'current', 'Modern Gazebo': 'current'}
{'MuJoCo': 'friction sweep replay', 'Isaac Lab': 'camera randomization replay', 'Modern Gazebo': 'ROS 2 topic and controller log'}currency table keeps recency risk visible, while validation_artifacts prevents the score from becoming unsupported opinion.Hint
Gazebo Classic should be marked legacy or deprecated, while modern Gazebo should be marked current.
Expected Output
The finished lab should output a ranked list, a primary recommendation, a fallback, a currency note, and a validation artifact for each candidate. The written paragraph should explain why the tool fits the task and what experiment would change the decision.
Stretch Goals
- Run the rubric twice: once for training and once for deployment testing.
- Add cost, operating system, and GPU memory columns.
- Replace manual scores with measured throughput from two installed simulators.
Each stretch goal should produce a new comparison artifact, not only a changed recommendation sentence.
Complete Solution
The solution below keeps the task, rubric, evidence note, and falsification test in one reproducible artifact.
# Complete simulator selection rubric for a tabletop pushing task.
# The scores are co-computed with one config so comparisons are valid.
# Replace scores with measured data when tools are installed locally.
task = "tabletop pushing"
dominant_risk = "contact fidelity"
weights = {
"physics_fit": 0.35,
"throughput": 0.15,
"sensor_stack": 0.10,
"integration": 0.15,
"maturity": 0.20,
"validation_evidence": 0.05,
}
scores = {
"MuJoCo": {
"physics_fit": 5,
"throughput": 3,
"sensor_stack": 2,
"integration": 3,
"maturity": 5,
"validation_evidence": 5,
},
"Isaac Lab": {
"physics_fit": 4,
"throughput": 5,
"sensor_stack": 5,
"integration": 4,
"maturity": 4,
"validation_evidence": 4,
},
"Modern Gazebo": {
"physics_fit": 3,
"throughput": 2,
"sensor_stack": 4,
"integration": 5,
"maturity": 4,
"validation_evidence": 4,
},
}
currency = {
"MuJoCo": "current",
"Isaac Lab": "current",
"Modern Gazebo": "current",
}
validation_artifacts = {
"MuJoCo": "friction sweep replay",
"Isaac Lab": "camera randomization replay",
"Modern Gazebo": "ROS 2 topic and controller log",
}
ranked = sorted(
((name, sum(weights[k] * vals[k] for k in weights)) for name, vals in scores.items()),
key=lambda item: item[1],
reverse=True,
)
winner, winner_score = ranked[0]
fallback, fallback_score = ranked[1]
print(f"task: {task}")
print(f"dominant risk: {dominant_risk}")
print(f"recommendation: {winner} ({winner_score:.2f}, {currency[winner]})")
print(f"fallback: {fallback} ({fallback_score:.2f}, {currency[fallback]})")
print(f"evidence: {validation_artifacts[winner]}")
print("falsification test: compare pushing distance under three friction settings")
task: tabletop pushing dominant risk: contact fidelity recommendation: MuJoCo (4.10, current) fallback: Isaac Lab (4.00, current) evidence: friction sweep replay falsification test: compare pushing distance under three friction settings
For a pick-and-place project, MuJoCo might win the early controller prototype, ManiSkill might win benchmark comparison, and Isaac Lab might win large-scale policy training with cameras. The correct recommendation can be a pipeline, not a single tool.
A good embodied system makes choosing a simulator visible twice: once in the design sketch and once in the replay artifact. The second view keeps the first one honest.
Can you name the simulator you would use for training, the simulator or stack you would use for integration testing, and the mature baseline you would cite? If not, your tool plan is incomplete.
Repeat the lab for legged locomotion and increase the throughput weight. Then repeat it for ROS 2 deployment testing and increase the integration weight. Explain why the winner changes.
The frontier is not one simulator winning every category. The frontier is reproducible simulator composition: train in one fast environment, evaluate in benchmark suites, test integration in ROS 2, and measure reality-gap risk before deployment.
Simulator choice is an experimental design decision. Match the tool to the dominant risk, compare candidates in one co-computed artifact, and document what would change your mind.
Continue to Chapter 12, where simulator choice becomes benchmark choice: what exactly are we measuring, and how do we avoid fooling ourselves?
Google DeepMind. "MuJoCo Documentation."
MuJoCo is the main mature baseline for compact rigid-body robot-learning experiments. Use the docs to verify current APIs, MJX, and MuJoCo Warp options before running comparisons.
Isaac Lab Project. "Isaac Lab Documentation."
Isaac Lab is the current NVIDIA robot-learning framework and a common default for GPU RL and sensor-rich simulation. It is central for readers scaling experiments into thousands of environments.
NVIDIA. "Newton Physics Engine."
Newton represents the emerging Warp and OpenUSD direction for open robot-learning physics. Readers should treat it as a frontier option and validate it against mature baselines.
Open Robotics. "Installing Gazebo With ROS."
This page documents supported ROS and Gazebo combinations. It is especially relevant for system-integration choices because version compatibility can decide whether a simulator plan is practical.
ManiSkill Team. "ManiSkill Documentation."
ManiSkill is the practical bridge from simulator choice to manipulation benchmarks. It helps readers connect SAPIEN-powered simulation to the task-suite discussion in the next chapter.