Section 20.2: What transfers and what does not | Building Embodied AI: From Perception to Autonomous Action

A Careful Control Loop

Technical illustration showing robot skills crossing from simulation to hardware while fragile details such as friction, lighting, latency, and success thresholds are inspected at a checkpoint. — **Figure 20.2A**: Transfer is selective. High-level task structure may travel well, while sensing, timing, contact, and safety margins need fresh evidence on the robot.

Big Picture

What transfers and what does not is the central triage question in sim-to-real RL. Some learned structure is portable: task decomposition, contact-seeking strategies, recovery reflexes, and rough state-feedback patterns. Other parts are local to the simulator: texture cues, friction thresholds, actuator timing, reward shortcuts, and termination artifacts.

For What transfers and what does not, sim-to-real transfer should name the randomized variables, simulator assumptions, real-world measurement, and demonstration-learning handoff in one transfer ledger.

An RL policy should be decomposed before deployment into representation, policy structure, value estimates, reward terms, low-level actuation, and evaluation metrics. These pieces do not deserve the same transfer claim because each one couples to different simulator assumptions.

The key question is practical: which learned quantities can be reused, which must be calibrated, and which must be thrown away before the first hardware trial?

Action Is The Test

Transfer is strongest when the learned quantity describes an invariant of the task rather than an accident of the simulator. "Move until contact, then regulate force" is more portable than "move 7.5 cm because the simulated drawer opens at that displacement."

Theory

Let a policy be decomposed as $\pi(a_t\mid z_t)$, where $z_t=f(o_{0:t})$ is the learned state representation. The representation $f$ may transfer if it captures geometry, contact phase, or goal relation. The action distribution $\pi$ may fail if the target robot has different torque limits, command latency, backlash, or compliance.

Reward models transfer even less automatically. A simulator reward can be dense, clean, and privileged, while the real robot only exposes sparse events, noisy force estimates, delayed vision, and safety interlocks. A reward term that was useful for training may be invalid as an evaluation metric.

Mechanism

The mechanism is a transfer ledger. Put each quantity into one of four buckets: reuse, recalibrate, constrain, or discard. Reuse task-level structure, recalibrate dynamics and sensors, constrain unsafe actions, and discard simulator-only rewards or privileged state channels.

Worked Example

Code Fragment 20.2.1 builds a small transfer ledger for a drawer-opening policy. The output distinguishes reusable task structure from components that need hardware calibration or safety constraints.

# Classify policy components before hardware deployment.
# The ledger prevents simulator-only conveniences from masquerading as transfer.
components = {
    "contact_phase_detector": "reuse",
    "camera_exposure_threshold": "recalibrate",
    "maximum_pull_force": "constrain",
    "privileged_hinge_angle_reward": "discard",
}

for name, decision in components.items():
    print(f"{name}: {decision}")

contact_phase_detector: reuse camera_exposure_threshold: recalibrate maximum_pull_force: constrain privileged_hinge_angle_reward: discard

Code Fragment 20.2.1 classifies contact_phase_detector, camera_exposure_threshold, maximum_pull_force, and privileged_hinge_angle_reward into transfer decisions. The ledger makes the deployment review concrete before any hardware trial begins.

Expected output: each policy component has a deployment decision. A transfer plan that says "deploy the policy" without this inventory hides the most important engineering choices.

Library Shortcut

In practical RL stacks, Gymnasium, Isaac Lab, Stable-Baselines3, RSL-RL, and rl_games can preserve policy checkpoints and rollout metadata. They do not decide what transfers. The builder still has to audit observation channels, action scaling, reward definitions, termination rules, and safety limits.

Practical Recipe

List the learned components: encoder, memory state, policy head, value head, reward terms, termination logic, and controller interface.
Classify each component as reuse, recalibrate, constrain, or discard.
Remove privileged simulator inputs from the deployed observation path.
Replace simulator rewards with hardware-measurable evaluation metrics.
Run a small hardware gate for every component marked reuse, especially if it touches force, contact, or delay.

Common Failure Mode

The common mistake is to assume that a transferable representation implies a transferable controller. A vision encoder may localize the handle well while the learned torque policy still fails because the real actuator saturates or arrives late.

Practical Example

For a legged robot, gait phase and foot-contact reflexes may transfer, but ground friction, motor heating, actuator delay, and fall recovery thresholds must be revalidated. The team should not report one "sim-to-real score" until it can show which parts were reused and which were calibrated on hardware.

Fun Note

The simulator is a very convincing liar. It gets contact forces wrong, friction wrong, actuator delay wrong, and camera noise wrong, all at once, all plausibly. The skill that transfers is not the policy. It is knowing which parts of the policy to trust.

Research Frontier

A major research direction is modular transfer: reuse the pieces that capture task invariants, then wrap them with calibrated adapters, safety filters, or residual controllers. The open question is how to identify those transferable pieces before expensive hardware trials.

Self Check

For a simulated grasping policy, identify one component you would reuse, one you would recalibrate, one you would constrain, and one you would discard. What hardware evidence would justify each decision?

The idea in this section becomes useful when each learned quantity has an explicit deployment status. A policy checkpoint is not a single artifact from a transfer perspective. It contains representations, action distributions, value estimates, normalization statistics, and assumptions about reward and termination.

The graduate-level habit is to separate four claims. The invariance claim says what remains true across sim and real. The calibration claim says what must be measured on hardware. The safety claim says what must be constrained before exploration. The evidence claim says which hardware gate proves the decision was justified.

Practical Tool Choices For This Section

Tool or Library	Role in the Topic	Builder Advice
Gymnasium	Interface compatibility	Use it to check whether the observation and action spaces match between training and evaluation wrappers.
Stable-Baselines3	Policy and normalization artifacts	Use it carefully, because normalization statistics and wrappers are part of what transfers or fails.
RSL-RL	High-throughput locomotion training	Use it when testing which locomotion behaviors survive actuator and terrain changes.
ROS 2	Hardware interface validation	Use it to verify that action scaling, timing, and safety interlocks match the policy assumptions.
LeRobot	Robot data and policy packaging	Use it to keep datasets, policies, and evaluation metadata tied together during transfer reviews.

A robust implementation starts with a transfer ledger and a hardware gate. The ledger records the deployment decision for each component. The gate records the smallest hardware test that can falsify that decision.

Create a component ledger before loading the trained policy on hardware.
Attach a validation gate to every reuse and recalibrate decision.
Prohibit privileged simulator channels from the hardware observation contract.
Save the ledger, gate result, trace, and failure label in one artifact.
Report transfer only for components whose evidence was collected under the same robot protocol.

When transfer fails, ask which ledger decision was wrong. A reuse failure means the task invariant was overestimated. A recalibration failure means the measurement protocol was too weak. A constrain failure means the safety envelope did not cover the robot's actual behavior. A discard failure means a simulator-only shortcut leaked into deployment.

Evaluation Recipe

For transfer inventories, compare only construct-matched metrics that are co-computed in one pass on one configuration: same policy checkpoint, same wrapper stack, same normalization statistics, same hardware gate, and the same success definition. Save the ledger and rollout traces together so every "transfers" claim is backed by the same run.

Key Takeaway

Good sim-to-real engineering does not ask whether "the policy transfers" as one block. It asks which learned quantities transfer, which need calibration, which require constraints, and which must be removed.

Exercise 20.2.1

Take a policy trained with privileged simulator state and write a four-row transfer ledger: reuse, recalibrate, constrain, and discard. For each row, name the hardware gate that would validate the decision.

What's Next?

This section turned what transfers and what does not into a testable embodied-learning contract: define the loop, choose the tool, save one comparable artifact, and diagnose failure by interface. Next, continue with Section 20.3, where the same evaluation habit carries into the next reinforcement-learning decision.

References & Further Reading

Foundational Papers, Tools, and Practice References

Tobin, J. et al. (2017). Domain Randomization for Transferring Deep Neural Networks from Simulation to the Real World. IROS.

Demonstrates that training with randomized visual and physical parameters forces policies to learn features invariant to simulator appearance, enabling direct transfer to a physical robot without fine-tuning. Read to understand the gap between visual sim-to-real and dynamics sim-to-real; this paper focuses on the visual side.

Paper

Peng, X. B. et al. (2018). Sim-to-Real Transfer of Robotic Control with Dynamics Randomization. ICRA.

This paper shows dynamics randomization for transferring learned control policies.

Paper

Kumar, A. et al. (2021). RMA: Rapid Motor Adaptation for Legged Robots. RSS.

Introduces RMA, which separates a base policy trained with full privileged state from a lightweight adaptation module trained online from proprioception only. Read Section 3 for the two-phase training procedure; RMA is one of the clearest demonstrations that explicit adaptation at inference time outperforms domain randomization alone for legged locomotion.

Paper

Tan, J. et al. (2018). Sim-to-Real: Learning Agile Locomotion for Quadruped Robots. RSS.

This work is a clear example of transferring locomotion policies from simulation to hardware.

Paper

NVIDIA Isaac Lab documentation.

NVIDIA's GPU-accelerated robot learning framework that runs thousands of parallel environments on a single GPU. Read the documentation for task configuration, domain randomization APIs, and the sim-to-real export path; massively parallel training with Isaac Lab is how locomotion and dexterous manipulation policies achieve the sample counts needed for sim-to-real transfer.

Tool

Drake documentation.

Drake is relevant when transfer work needs explicit dynamics, constraints, and system identification.

Tool