Section 20.5: Measuring transfer performance | Building Embodied AI: From Perception to Autonomous Action

A Careful Control Loop

Technical illustration showing simulated and real robot rollouts entering the same evaluation table, with success, safety, intervention, and failure-label columns checked together. — **Figure 20.5A**: Transfer measurement is credible when simulator numbers and hardware numbers are computed from the same task panel, protocol, and success definition.

Big Picture

Measuring transfer performance is not the same as reporting one real-robot success rate. A transfer result should say how the policy performed in simulation, how it performed on hardware, how much safety supervision it needed, which failures dominated, and whether those numbers were computed on the same task panel.

For Measuring transfer performance, sim-to-real transfer should name the randomized variables, simulator assumptions, real-world measurement, and demonstration-learning handoff in one transfer ledger.

This section develops a transfer evaluation contract. The contract includes task success, return, completion time, safety violations, intervention count, blocked-action count, reset count, hardware health, and failure category. Reporting only success hides how the success was purchased.

The key question is practical: do the metrics compare the same construct under the same protocol, or are we comparing a clean simulator score with a noisy hardware anecdote?

Action Is The Test

A transfer metric earns trust when every compared number is co-computed from one evaluation artifact. Separate scripts, separate task panels, or separate success definitions can produce a table that looks precise while comparing different constructs.

Theory

A basic transfer ratio is $\rho = S_{\text{real}} / S_{\text{sim}}$, where $S$ is the same success metric measured on matched task instances. If a policy succeeds in 90 percent of simulated trials and 63 percent of hardware trials, $\rho = 0.70$. The ratio is useful only when both rates use the same task family and success rule.

Good reports also include gap diagnostics: success gap $S_{\text{sim}} - S_{\text{real}}$, intervention rate, safety-violation rate, blocked-action rate, and failure-category distribution. These metrics tell the reader whether the policy failed because it could not solve the task, because it needed supervision, or because safety gates stopped unsafe commands.

Mechanism

The mechanism is paired evaluation. Select the task panel, freeze the policy checkpoint, run the simulator evaluation, run the hardware evaluation under the documented gate, assign failure labels, and compute every metric from the same artifact table.

Worked Example

Code Fragment 20.5.1 computes a small transfer report from matched simulated and real trials. Notice that success, interventions, and failure labels are computed from the same rows.

# Compute transfer metrics from one matched evaluation artifact.
# Success and safety numbers stay tied to the same task panel.
sim_success = [1, 1, 1, 1, 0]
real_success = [1, 0, 1, 1, 0]
interventions = [0, 1, 0, 0, 1]

sim_rate = sum(sim_success) / len(sim_success)
real_rate = sum(real_success) / len(real_success)
transfer_ratio = real_rate / sim_rate
intervention_rate = sum(interventions) / len(interventions)

print(f"sim_success={sim_rate:.2f}")
print(f"real_success={real_rate:.2f}")
print(f"transfer_ratio={transfer_ratio:.2f}")
print(f"intervention_rate={intervention_rate:.2f}")

sim_success=0.80 real_success=0.60 transfer_ratio=0.75 intervention_rate=0.40

Code Fragment 20.5.1 computes sim_rate, real_rate, transfer_ratio, and intervention_rate from one matched artifact. The intervention rate changes the interpretation of the transfer ratio, because 0.75 transfer with frequent supervision is not the same result as 0.75 autonomous transfer.

Expected output: a transfer report should include the simulator rate, hardware rate, transfer ratio, and safety denominator. If the policy required interventions, that fact belongs next to the success rate, not in a separate paragraph.

Library Shortcut

In practical systems, use evaluation harnesses that emit a single table per protocol: configuration, policy checkpoint, task instance, simulator metrics, hardware metrics, safety events, and failure label. Gymnasium wrappers, Isaac Lab task panels, ROS 2 logs, and LeRobot datasets are useful only if they preserve the common artifact schema.

Practical Recipe

Freeze the policy checkpoint and evaluation code before running the comparison.
Define the task panel, success rule, timeout, safety gates, and intervention policy in writing.
Compute simulator and hardware metrics from one artifact schema.
Report uncertainty with trial counts, paired panels, and confidence intervals or bootstrap intervals when sample size permits.
Include failure labels, videos or state traces, and all safety denominators in the same result package.

Common Failure Mode

The common mistake is to compare the best simulator run with a separate hardware run collected under a different protocol. That table may pass a number-by-number audit while failing the scientific comparison.

Practical Example

A manipulation benchmark should report matched success on the same object poses, number of human interventions, blocked actions, hardware resets, median completion time, and failure labels such as slip, missed grasp, collision gate, delay, or perception miss. Those fields tell a builder what to fix next.

Fun Note

A transfer score without a gap decomposition is a number without an address. Knowing the policy achieved 70 percent on hardware tells you very little. Knowing it lost 20 points to friction mismatch, 6 to actuator delay, and 4 to perception noise tells you exactly where to spend the next week.

Research Frontier

The open evaluation problem is not only better benchmarks. It is comparable evidence across robots, labs, simulators, and safety protocols. Transfer diagnostics are becoming as important as headline success rates because they show whether a method is robust, supervised, brittle, or merely lucky on the tested panel.

Self Check

Given a sim-to-real result, can you identify the task panel, policy checkpoint, success rule, trial count, intervention rate, blocked-action rate, and failure taxonomy? If not, the transfer claim is underspecified.

The idea in this section becomes useful when the result table is built from a single evaluation artifact. The artifact names the checkpoint, simulator version, robot configuration, task panel, safety gate, success rule, raw trials, and failure labels. Without that artifact, a transfer table can be impossible to reproduce or interpret.

The graduate-level habit is to separate three claims. The performance claim says the policy solved the task. The transfer claim says the same construct was evaluated in sim and real. The robustness claim says the result survives held-out dynamics, actuator delay, sensor noise, and initial-condition shifts.

Practical Tool Choices For This Section

Tool or Library	Role in the Topic	Builder Advice
Gymnasium	Metric wrapper consistency	Use it to keep reward, termination, timeout, and info dictionaries stable across evaluation scripts.
Isaac Lab	Task panels and perturbations	Use it to evaluate held-out dynamics, actuator delay, terrain, and sensor perturbations before hardware trials.
ROS 2 bags	Hardware evidence artifact	Use them to preserve synchronized observations, commands, controller states, safety events, and timestamps.
LeRobot	Dataset packaging	Use it to store real-robot episodes with metadata, failure labels, and policy identifiers.
Statistical notebooks	Uncertainty and diagnostics	Use a single notebook to compute all compared metrics from the same artifact table.

A robust implementation starts with the result schema, not with a plotting script. The schema should force every row to contain the policy identifier, task instance, simulator or robot source, success label, safety events, failure label, and trace pointer.

Define a row schema before collecting trials.
Run simulator and hardware evaluation through adapters that emit the same schema.
Compute all metrics, including safety denominators, from that one table.
Attach trace pointers so failures can be replayed and relabeled.
Publish aggregate metrics only with trial counts, uncertainty, and the failure-label distribution.

When a transfer result disappoints, do not stop at the aggregate gap. Split failures by perception miss, state-estimation drift, contact slip, actuator delay, controller saturation, safety gate, timeout, and evaluator disagreement. The failure distribution is often more useful than the mean score.

Evaluation Recipe

For transfer measurement, compare only construct-matched metrics that are co-computed in one pass on one configuration: same task panel, same policy checkpoint, same seed set where applicable, same perturbation suite, same safety gate, and the same success definition. Save the result as one artifact with raw trials, traces, summary statistics, videos or state logs, uncertainty, and failure labels so every number in a later table is backed by the same run.

Key Takeaway

A transfer score is credible only when success, safety, interventions, perturbations, and failure labels are measured together under the same protocol.

Exercise 20.5.1

Design a result schema for a real-robot transfer evaluation. Include fields for policy checkpoint, task instance, simulator success, hardware success, intervention count, blocked actions, safety violations, trace pointer, and failure label.

What's Next?

This section turned measuring transfer performance into a testable embodied-learning contract: define the loop, choose the tool, save one comparable artifact, and diagnose failure by interface. Next, continue with Chapter 20, where the same evaluation habit carries into the next reinforcement-learning decision.

References & Further Reading

Foundational Papers, Tools, and Practice References

Tobin, J. et al. (2017). Domain Randomization for Transferring Deep Neural Networks from Simulation to the Real World. IROS.

Demonstrates that training with randomized visual and physical parameters forces policies to learn features invariant to simulator appearance, enabling direct transfer to a physical robot without fine-tuning. Read to understand the gap between visual sim-to-real and dynamics sim-to-real; this paper focuses on the visual side.

Paper

Peng, X. B. et al. (2018). Sim-to-Real Transfer of Robotic Control with Dynamics Randomization. ICRA.

This paper shows dynamics randomization for transferring learned control policies.

Paper

Kumar, A. et al. (2021). RMA: Rapid Motor Adaptation for Legged Robots. RSS.

Introduces RMA, which separates a base policy trained with full privileged state from a lightweight adaptation module trained online from proprioception only. Read Section 3 for the two-phase training procedure; RMA is one of the clearest demonstrations that explicit adaptation at inference time outperforms domain randomization alone for legged locomotion.

Paper

Tan, J. et al. (2018). Sim-to-Real: Learning Agile Locomotion for Quadruped Robots. RSS.

This work is a clear example of transferring locomotion policies from simulation to hardware.

Paper

NVIDIA Isaac Lab documentation.

NVIDIA's GPU-accelerated robot learning framework that runs thousands of parallel environments on a single GPU. Read the documentation for task configuration, domain randomization APIs, and the sim-to-real export path; massively parallel training with Isaac Lab is how locomotion and dexterous manipulation policies achieve the sample counts needed for sim-to-real transfer.

Tool

Drake documentation.

Drake is relevant when transfer work needs explicit dynamics, constraints, and system identification.

Tool