Section 13.3: Curriculum and automatic randomization | Building Embodied AI: From Perception to Autonomous Action

A Careful Control Loop

Technical illustration for Section 13.3: Curriculum and automatic randomization. — Figure 13.3A: Automatic domain randomization (ADR) as an adaptive loop: the randomization range expands when the policy succeeds and contracts when it fails, maintaining a difficulty level just above the current skill horizon.

Big Picture

Curriculum and automatic randomization solve a pacing problem. If the simulator starts too easy, the policy memorizes a narrow slice of the world; if it starts too hard, the policy never discovers useful behavior. A curriculum widens the domain as competence grows, while automatic randomization adjusts the domain to keep training informative.

For Curriculum and automatic randomization, the transfer argument should name which simulator gap is randomized, which real variable it approximates, and which evaluation panel checks whether transfer improved.

What This Section Builds

This section makes curriculum and automatic randomization operational. It explains how to expand ranges only when the policy is stable enough to benefit, and how to shrink or rebalance ranges when training produces only failures.

The goal is to record the difficulty schedule as part of the experiment, not as an invisible training convenience. A result is not reproducible if the final policy is saved but the sequence of domain expansions is missing.

Transfer Is The Test

Automatic randomization is a feedback controller over the training distribution. Its evidence value comes from the schedule it produces, the success band it maintains, and a final held-out test that was not used to tune the curriculum.

Theory

A curriculum defines a time-indexed training distribution $p_k(\theta)$, where $k$ is the training phase and $\theta$ contains randomized domain parameters. Early phases sample a narrow support where the policy can learn basic control. Later phases widen the support toward the deployment envelope.

Automatic domain randomization closes the loop: if rolling success is above the target band, expand the domain; if it is below the band, hold or narrow the domain; if failures cluster in one factor, rebalance that factor rather than widening everything. The design resembles an outer-loop controller whose plant is the learning process itself.

Mechanism

The mechanism is difficulty regulation. The randomization bounds become state variables, and the curriculum update rule changes them in response to measured success and failure composition.

Worked Example

The following snippet shows a small automatic randomization update. It expands the friction range when rolling success is high, keeps it fixed inside the target band, and narrows it when the learner is failing too often.

# Update one curriculum range from rolling success.
# The goal is to keep training challenging without destroying exploration.
def update_friction_range(current_range, rolling_success):
    low, high = current_range
    if rolling_success > 0.80:
        return (max(0.10, low - 0.05), min(1.00, high + 0.05))
    if rolling_success < 0.45:
        center = (low + high) / 2
        return (center - 0.10, center + 0.10)
    return current_range

for success in [0.88, 0.63, 0.32]:
    print(success, update_friction_range((0.35, 0.65), success))

0.88 (0.3, 0.7000000000000001) 0.63 (0.35, 0.65) 0.32 (0.4, 0.6)

Code Fragment 1: The update_friction_range function turns rolling success into a curriculum decision. High success expands the range, mid-band success keeps it fixed, and low success narrows it so the policy can recover useful behavior.

Library Shortcut

The from-scratch fragment is for understanding the update rule. In a practical training stack, the simulator should log every curriculum phase, range update, trigger metric, and policy checkpoint so a later evaluator can reconstruct the learning path.

Practical Recipe

Choose a small set of curriculum-controlled factors, such as goal distance, friction span, clutter count, or lighting range.
Define a target success band before training, for example 55 to 80 percent over the most recent evaluation window.
Expand only the factors whose current band is mastered, and rebalance factors that dominate failure labels.
Freeze a final transfer panel that the curriculum controller never sees.
Save the phase schedule, range updates, trigger metrics, and final policy checkpoint together.

Randomization Evidence Rule

A curriculum is evidence only when its update rule, trigger metric, phase schedule, held-out real measurements, and failure labels are saved. Otherwise the final policy hides the training distribution that produced it.

Common Failure Mode

The common mistake is curriculum leakage. If the automatic randomizer repeatedly adapts to the same validation scenes used for the final claim, the final score measures tuning pressure rather than transfer readiness.

Practical Example

A quadruped team might begin on flat terrain with mild friction variation, then add slopes, payload changes, delay, and rough patches only after stable locomotion appears. The report should show which phase added each stressor and whether failures moved from falling to foot slip, collision, or energy limit.

Memory Hook

A curriculum is a coach, not a confetti cannon. It adds difficulty when the learner is ready enough to learn from it.

Research Frontier

The research frontier is shifting toward curricula that adapt per failure mode instead of expanding all ranges uniformly. The open question is how to choose updates that improve real transfer while avoiding overfitting to the simulator's own difficulty signals.

Paper Spotlight

"Solving Rubik's Cube with a Robot Hand" (OpenAI, Science Robotics, 2020) demonstrates Automatic Domain Randomization (ADR) at scale on a 13-DOF dexterous hand. ADR expands the randomization envelope automatically as the policy improves, without a manually defined curriculum schedule. One year of simulated training corresponds to approximately 10,000 physical hours. The key finding: sufficiently wide randomization eliminates the need for system identification, because the policy learns to handle any plausible physical configuration rather than the measured one.

Self Check

Can you name the success band, update rule, controlled factors, phase schedule, and final held-out panel for the curriculum? If not, the training process is not reproducible.

Curriculum and automatic randomization become useful when the training distribution is treated as a controlled system. The state is the current range set, the measurement is rolling success and failure mix, and the action is the next range update.

The graduate-level habit is to separate three claims. The learning claim says the schedule lets the policy acquire behavior. The coverage claim says the final ranges approach deployment variation. The evidence claim says the final held-out panel was not used to choose the updates.

Practical Tool Choices For This Section

Tool or Library	Role in the Topic	Builder Advice
Isaac Lab	Parallel curriculum phases	Use it when many environments need phase-specific ranges and logged success windows.
MuJoCo or MJX	Fast dynamics range sweeps	Use it when automatic updates target friction, mass, actuator, or delay parameters.
Hydra or similar config tools	Curriculum schedule tracking	Use it to version range sets, phase names, seeds, and update thresholds.
Weights and Biases or MLflow	Training trace comparison	Use it to plot rolling success, range expansion, and failure labels on the same run.
LeRobot	Final real panel comparison	Use it to evaluate whether the curriculum-trained policy transfers to recorded or live robot episodes.

A robust implementation starts with a curriculum ledger. Code Fragment 2 records the update rule, phase count, final range, and leakage guard in the same artifact as the evaluation metric.

Write a one-paragraph task contract with observation, action, success, and failure fields.
Start with the smallest simulator, dataset, or wrapper that exposes the task contract faithfully.
Run one deterministic smoke test and one perturbation test before scaling.
Save a single result artifact containing configuration, seed, metrics, videos or traces, and failure labels.
Compare methods only when one script evaluates them on the same task panel.

Expected output: the printed trace should expose the update rule, final range, metric, and leakage guard. If one of those fields is missing, the example is not yet an evaluation artifact.

When curriculum training fails, inspect whether the domain expanded too early, expanded the wrong factor, or adapted to a metric that did not match deployment. Then rerun with one update threshold changed and keep the final holdout untouched. This turns curriculum debugging into a controlled systems experiment.

Key Takeaway

Curriculum and automatic randomization are useful when the training distribution widens for documented reasons and the final transfer claim is measured on a panel the curriculum never tuned against.

Exercise 13.3.1

Design a curriculum for two randomized factors. Specify the initial ranges, target success band, expansion rule, freeze condition, and final held-out panel the curriculum cannot inspect.

What's Next?

Section 13.4 → moves from when to widen variation to how rendered cameras, labels, and photoreal scenes make perception data trustworthy.

Bibliography and Further Reading

Foundational Papers

Tobin, J. et al. (2017). "Domain Randomization for Transferring Deep Neural Networks from Simulation to the Real World." IROS.

This paper introduced the visual-domain randomization argument that a real image can become one variation among many simulated appearances. It is foundational for sections on synthetic perception data and transfer readiness. Readers should connect this source to curriculum and automatic randomization when deciding what is reusable, what is benchmark-specific, and what must be remeasured.

Paper

Peng, X. B. et al. (2018). "Sim-to-Real Transfer of Robotic Control with Dynamics Randomization." ICRA.

This paper studies randomized dynamics for robotic control transfer. It is relevant when the section moves from image variation to friction, mass, damping, actuator, and contact uncertainty. Readers should connect this source to curriculum and automatic randomization when deciding what is reusable, what is benchmark-specific, and what must be remeasured.

Paper

Research Foundations

Chen, X., Hu, J., Jin, C., Li, L., and Wang, L. (2021). "Understanding Domain Randomization for Sim-to-real Transfer." arXiv.

This work gives a theoretical view of domain randomization as transfer across a family of parameterized MDPs. Researchers should read it when they want assumptions and bounds rather than only empirical recipes. Readers should connect this source to curriculum and automatic randomization when deciding what is reusable, what is benchmark-specific, and what must be remeasured.

Paper

Tools And Libraries

NVIDIA. "Omniverse Replicator Documentation."

Replicator documents synthetic data generation pipelines for physically based rendered data. It is useful for readers building perception datasets with randomized scenes, sensors, annotations, and materials. Readers should connect this source to curriculum and automatic randomization when deciding what is reusable, what is benchmark-specific, and what must be remeasured.

Tool

DLR-RM. "BlenderProc Documentation and Examples."

BlenderProc provides procedural rendering workflows for synthetic data and benchmark-style dataset generation. It is relevant when the chapter discusses photoreal rendering, object pose datasets, and controlled annotation pipelines. Readers should connect this source to curriculum and automatic randomization when deciding what is reusable, what is benchmark-specific, and what must be remeasured.

Tool