Section 11.8: Choosing a Simulator | Building Embodied AI: From Perception to Autonomous Action

"The right simulator is the one whose wrongness you can afford, measure, and explain."
A Reality-Gap Auditor

Big Picture

This section converts the chapter into a decision procedure. You do not choose a simulator by brand, speed, or popularity. You choose it by matching task physics, observation needs, throughput, integration, maturity, and evaluation risk.

Simulator Choice Evidence Rule

Choose a simulator by task contract, not reputation. Record the dominant physical risk, required sensor model, throughput target, asset format, and integration boundary, then run the same task panel before comparing tools.

Do Not Rank Tools From Marketing Claims

Simulator claims are only meaningful when measured on your robot, your task, your controller frequency, and your observation pipeline. Treat unmatched benchmark numbers as hints, not evidence.

This section applies the dynamics background from Chapter 6: Dynamics and Simulation Math to the simulator stack introduced in Chapter 9: Why Simulation Is Central. It also prepares the GPU training workflows in Chapter 17: Massively Parallel and GPU RL and the randomization workflow in Chapter 13: Domain Randomization and Synthetic Data by tying tool choice to measurable task risk.

The Decision Table

The table below is a starting map, not a leaderboard. Use it to choose which tools deserve a task-level comparison, then measure those tools on the same robot, task, controller frequency, observation pipeline, and validation script.

Simulator Selection Guide

Primary need	Start with	Second option	Reason
Small contact-control experiment	MuJoCo	Drake	Fast, readable, strong dynamics loop
JAX-native parallel RL	MJX	Brax or MuJoCo	JAX transformations and batched state are central
NVIDIA GPU MuJoCo-style throughput	MuJoCo Warp	Newton	Warp backend targets NVIDIA acceleration
USD scenes and sensor-rich robot learning	Isaac Lab	Newton or Genesis	Isaac Lab connects physics, rendering, sensors, and learning workflows
Emerging Warp and OpenUSD research	Newton	Isaac Lab	Useful when the frontier engine itself is part of the research question
Pythonic multi-physics and generated scenes	Genesis	Isaac Lab	Promising for multi-physics, rendering, and generative workflows
Model-based control and verification	Drake	MuJoCo	Optimization and analysis are first-class
Manipulation benchmark tasks	SAPIEN or ManiSkill	MuJoCo or Isaac Lab	Task suites and manipulation assets matter
ROS 2 system integration	Modern Gazebo	Isaac Sim	Middleware, sensors, and controllers dominate the need

Recency And Deprecation Table

Use Current Tools, Not Old Names

Old or risky default	Current direction	Action
OpenAI Gym	Gymnasium	Use Gymnasium APIs for new environment work
Isaac Gym Preview, IsaacGymEnvs, OmniIsaacGymEnvs, Orbit	Isaac Lab	Migrate new robot-learning projects to Isaac Lab
Gazebo Classic	Modern Gazebo	Do not start new projects on Classic after its January 2025 end of life
Unverified vendor simulator claims	Task-level benchmark artifact	Reproduce performance and fidelity on your own task
One simulator for every stage	Multi-tool pipeline	Train, benchmark, integrate, and verify with fit-for-purpose tools

One Config, One Comparison

Compare simulators on one task configuration, one metric script, and one artifact. A throughput number from one robot and a fidelity claim from another robot do not form a valid comparison. The comparison artifact should contain every number used in the table, so a reviewer can trace each recommendation back to the same run.

A Runnable Scoring Rubric

Code Fragment 1 turns simulator choice into a co-computed score. The important part is that all tools are scored on the same criteria in one run, which prevents invalid number-by-number comparisons from different configurations.

# Score simulators on one task using one rubric and one config.
# Higher is better, but the weights must match the task.
# This avoids mixing metrics from different experiments.
weights = {
    "physics_fit": 0.30,
    "throughput": 0.20,
    "sensor_stack": 0.15,
    "integration": 0.15,
    "maturity": 0.15,
    "validation_evidence": 0.05,
}

scores = {
    "MuJoCo": {
        "physics_fit": 5,
        "throughput": 3,
        "sensor_stack": 2,
        "integration": 3,
        "maturity": 5,
        "validation_evidence": 5,
    },
    "Isaac Lab": {
        "physics_fit": 4,
        "throughput": 5,
        "sensor_stack": 5,
        "integration": 4,
        "maturity": 4,
        "validation_evidence": 4,
    },
    "Drake": {
        "physics_fit": 4,
        "throughput": 2,
        "sensor_stack": 2,
        "integration": 3,
        "maturity": 5,
        "validation_evidence": 5,
    },
    "Modern Gazebo": {
        "physics_fit": 3,
        "throughput": 2,
        "sensor_stack": 4,
        "integration": 5,
        "maturity": 4,
        "validation_evidence": 4,
    },
}

ranked = sorted(
    ((name, sum(weights[k] * vals[k] for k in weights)) for name, vals in scores.items()),
    key=lambda item: item[1],
    reverse=True,
)

for name, score in ranked:
    print(f"{name}: {score:.2f}")

Isaac Lab: 4.35
MuJoCo: 3.85
Modern Gazebo: 3.45
Drake: 3.35

Code Fragment 1: This rubric ranks candidate simulators with one weight vector, one scoring table, and an explicit validation-evidence column. Change the weights for a control-heavy task and Drake may rise, change them for ROS 2 integration and Gazebo may rise.

Library Shortcut

The scoring script is about 50 lines of decision support. In a real project, the shortcut is a reproducible experiment harness: one config file, one launcher, one metrics table, one artifact store, and one report. Tools such as Hydra, Weights & Biases, MLflow, or plain JSON plus CSV can store the same comparison, but the principle is the same.

Hands-On Lab: Benchmark A Simulator Choice

Duration: about 75 minutesDifficulty: Intermediate

Objective

Build a reproducible simulator-selection artifact for a reaching, pushing, locomotion, or ROS 2 integration task.

What You'll Practice

Writing a task-specific simulator rubric
Co-computing comparable scores in one pass
Adding a deprecation and maintenance check
Recording validation evidence instead of preference scores
Producing an engineering recommendation with a fallback

Setup

The required version uses only Python's standard library. Optional extensions can call MuJoCo, Isaac Lab, ManiSkill, or Gazebo after those tools are installed.

Steps

Work through the steps in order so the final recommendation has a visible chain from task risk to weights, scores, maintenance status, validation evidence, and falsification test.

Step 1: Define The Task

Choose one task and write the dominant risk: contact fidelity, throughput, visual sensors, model-based control, manipulation benchmark coverage, or ROS 2 integration.

task = "tabletop pushing"
dominant_risk = "contact fidelity"
requirements = {"contact fidelity", "asset import", "batch rollout", "camera rendering"}
score = {"MuJoCo": 3, "Isaac Lab": 4, "Genesis": 3}
choice = max(score, key=score.get)
print({"task": task, "dominant_risk": dominant_risk, "recommended_stack": choice})

Code Fragment 2: This lab starter names the task and the dominant risk before any simulator is chosen. The two variables, task and dominant_risk, force the recommendation to start from the embodied problem.

Hint

If the task fails when objects slip incorrectly, choose contact fidelity. If it fails because training is too slow, choose throughput.

Step 2: Set Weights

Choose weights that match the task. The weights should sum to 1.0 so the final score is interpretable, and one criterion should capture whether the evidence was actually measured on this task.

Code Fragment 3: This starter defines the rubric weights for simulator selection. The named keys make the tradeoff visible: physics, throughput, sensors, integration, maturity, and validation evidence cannot all dominate at once.

Hint

A manipulation benchmark usually weights physics and maturity heavily. A synthetic-data task gives more weight to sensors and rendering.

Step 3: Score Candidates

Score at least three simulators from 1 to 5 using the same criteria. Keep the candidates in one table so the comparison is co-computed.

scores = {
    "MuJoCo": {
        "physics_fit": 5,
        "throughput": 3,
        "sensor_stack": 2,
        "integration": 3,
        "maturity": 5,
        "validation_evidence": 5,
    },
    "Isaac Lab": {
        "physics_fit": 4,
        "throughput": 5,
        "sensor_stack": 5,
        "integration": 4,
        "maturity": 4,
        "validation_evidence": 4,
    },
    "Modern Gazebo": {
        "physics_fit": 3,
        "throughput": 2,
        "sensor_stack": 4,
        "integration": 5,
        "maturity": 4,
        "validation_evidence": 4,
    },
}

if isinstance(scores, list):
    print({"rows": len(scores), "first": scores[0] if scores else None})
elif isinstance(scores, dict):
    print({"fields": sorted(scores), "audit_ready": all(value not in (None, "") for value in scores.values())})
else:
    print({"value": scores})

Code Fragment 4: This starter stores all candidate simulator scores in one table. Keeping identical criteria for MuJoCo, Isaac Lab, and Modern Gazebo prevents invalid comparisons across different rubrics.

Hint

Do not score a tool highly because it is popular. Score it highly because it fits the written task.

Step 4: Rank And Explain

Compute a ranking and print the recommendation with the exact reason.

ranked = sorted(
    ((name, sum(weights[k] * vals[k] for k in weights)) for name, vals in scores.items()),
    key=lambda item: item[1],
    reverse=True,
)
print(ranked[0])

('MuJoCo', 4.1)

Code Fragment 5: This starter computes a weighted ranking from the shared rubric. The ranked list gives both the winner and the fallback candidates for the written recommendation.

Hint

If two tools are close, the fallback recommendation is as important as the winner.

Step 5: Add A Currency Check

Add one maintenance note for each candidate: current, frontier, legacy, or deprecated. Then add one validation note that states which evidence artifact supports the score.

currency = {
    "MuJoCo": "current",
    "Isaac Lab": "current",
    "Modern Gazebo": "current",
}
validation_artifacts = {
    "MuJoCo": "friction sweep replay",
    "Isaac Lab": "camera randomization replay",
    "Modern Gazebo": "ROS 2 topic and controller log",
}
def audit_release_support(
    currency: dict[str, str], validation_artifacts: dict[str, str]
) -> tuple[dict[str, str], dict[str, str]]:
    assert currency.keys() == validation_artifacts.keys()
    return currency, validation_artifacts

currency, validation_artifacts = audit_release_support(currency, validation_artifacts)
def audit_release_support(
    currency: dict[str, str], validation_artifacts: dict[str, str]
) -> tuple[dict[str, str], dict[str, str]]:
    assert currency.keys() == validation_artifacts.keys()
    return currency, validation_artifacts

currency, validation_artifacts = audit_release_support(currency, validation_artifacts)
print(currency)
print(validation_artifacts)

{'MuJoCo': 'current', 'Isaac Lab': 'current', 'Modern Gazebo': 'current'}
{'MuJoCo': 'friction sweep replay', 'Isaac Lab': 'camera randomization replay', 'Modern Gazebo': 'ROS 2 topic and controller log'}

Code Fragment 6: This starter adds maintenance status and validation artifacts to the simulator decision. The currency table keeps recency risk visible, while validation_artifacts prevents the score from becoming unsupported opinion.

Hint

Gazebo Classic should be marked legacy or deprecated, while modern Gazebo should be marked current.

Expected Output

The finished lab should output a ranked list, a primary recommendation, a fallback, a currency note, and a validation artifact for each candidate. The written paragraph should explain why the tool fits the task and what experiment would change the decision.

Stretch Goals

Run the rubric twice: once for training and once for deployment testing.
Add cost, operating system, and GPU memory columns.
Replace manual scores with measured throughput from two installed simulators.

Each stretch goal should produce a new comparison artifact, not only a changed recommendation sentence.

Complete Solution

The solution below keeps the task, rubric, evidence note, and falsification test in one reproducible artifact.

# Complete simulator selection rubric for a tabletop pushing task.
# The scores are co-computed with one config so comparisons are valid.
# Replace scores with measured data when tools are installed locally.
task = "tabletop pushing"
dominant_risk = "contact fidelity"
weights = {
    "physics_fit": 0.35,
    "throughput": 0.15,
    "sensor_stack": 0.10,
    "integration": 0.15,
    "maturity": 0.20,
    "validation_evidence": 0.05,
}
scores = {
    "MuJoCo": {
        "physics_fit": 5,
        "throughput": 3,
        "sensor_stack": 2,
        "integration": 3,
        "maturity": 5,
        "validation_evidence": 5,
    },
    "Isaac Lab": {
        "physics_fit": 4,
        "throughput": 5,
        "sensor_stack": 5,
        "integration": 4,
        "maturity": 4,
        "validation_evidence": 4,
    },
    "Modern Gazebo": {
        "physics_fit": 3,
        "throughput": 2,
        "sensor_stack": 4,
        "integration": 5,
        "maturity": 4,
        "validation_evidence": 4,
    },
}
currency = {
    "MuJoCo": "current",
    "Isaac Lab": "current",
    "Modern Gazebo": "current",
}
validation_artifacts = {
    "MuJoCo": "friction sweep replay",
    "Isaac Lab": "camera randomization replay",
    "Modern Gazebo": "ROS 2 topic and controller log",
}
ranked = sorted(
    ((name, sum(weights[k] * vals[k] for k in weights)) for name, vals in scores.items()),
    key=lambda item: item[1],
    reverse=True,
)
winner, winner_score = ranked[0]
fallback, fallback_score = ranked[1]
print(f"task: {task}")
print(f"dominant risk: {dominant_risk}")
print(f"recommendation: {winner} ({winner_score:.2f}, {currency[winner]})")
print(f"fallback: {fallback} ({fallback_score:.2f}, {currency[fallback]})")
print(f"evidence: {validation_artifacts[winner]}")
print("falsification test: compare pushing distance under three friction settings")

task: tabletop pushing
dominant risk: contact fidelity
recommendation: MuJoCo (4.10, current)
fallback: Isaac Lab (4.00, current)
evidence: friction sweep replay
falsification test: compare pushing distance under three friction settings

Code Fragment 7: This complete solution prints the selected simulator, fallback, currency labels, validation artifact, and falsification test. The output shows why the recommendation is an experiment plan rather than a popularity ranking.

Practical Example

For a pick-and-place project, MuJoCo might win the early controller prototype, ManiSkill might win benchmark comparison, and Isaac Lab might win large-scale policy training with cameras. The correct recommendation can be a pipeline, not a single tool.

Memory Hook

A good embodied system makes choosing a simulator visible twice: once in the design sketch and once in the replay artifact. The second view keeps the first one honest.

Self Check

Can you name the simulator you would use for training, the simulator or stack you would use for integration testing, and the mature baseline you would cite? If not, your tool plan is incomplete.

Exercise 11.8

Repeat the lab for legged locomotion and increase the throughput weight. Then repeat it for ROS 2 deployment testing and increase the integration weight. Explain why the winner changes.

Research Frontier

The frontier is not one simulator winning every category. The frontier is reproducible simulator composition: train in one fast environment, evaluate in benchmark suites, test integration in ROS 2, and measure reality-gap risk before deployment.

Key Takeaway

Simulator choice is an experimental design decision. Match the tool to the dominant risk, compare candidates in one co-computed artifact, and document what would change your mind.

What's Next?

Continue to Chapter 12, where simulator choice becomes benchmark choice: what exactly are we measuring, and how do we avoid fooling ourselves?

Bibliography and Further Reading

Tools & Libraries

Google DeepMind. "MuJoCo Documentation."

MuJoCo is the main mature baseline for compact rigid-body robot-learning experiments. Use the docs to verify current APIs, MJX, and MuJoCo Warp options before running comparisons.

Tool

Isaac Lab Project. "Isaac Lab Documentation."

Isaac Lab is the current NVIDIA robot-learning framework and a common default for GPU RL and sensor-rich simulation. It is central for readers scaling experiments into thousands of environments.

Tool

NVIDIA. "Newton Physics Engine."

Newton represents the emerging Warp and OpenUSD direction for open robot-learning physics. Readers should treat it as a frontier option and validate it against mature baselines.

Tool

Open Robotics. "Installing Gazebo With ROS."

This page documents supported ROS and Gazebo combinations. It is especially relevant for system-integration choices because version compatibility can decide whether a simulator plan is practical.

Tool

ManiSkill Team. "ManiSkill Documentation."

ManiSkill is the practical bridge from simulator choice to manipulation benchmarks. It helps readers connect SAPIEN-powered simulation to the task-suite discussion in the next chapter.

Tool