A Careful Control Loop
Curriculum and automatic randomization solve a pacing problem. If the simulator starts too easy, the policy memorizes a narrow slice of the world; if it starts too hard, the policy never discovers useful behavior. A curriculum widens the domain as competence grows, while automatic randomization adjusts the domain to keep training informative.
For Curriculum and automatic randomization, the transfer argument should name which simulator gap is randomized, which real variable it approximates, and which evaluation panel checks whether transfer improved.
What This Section Builds
This section makes curriculum and automatic randomization operational. It explains how to expand ranges only when the policy is stable enough to benefit, and how to shrink or rebalance ranges when training produces only failures.
The goal is to record the difficulty schedule as part of the experiment, not as an invisible training convenience. A result is not reproducible if the final policy is saved but the sequence of domain expansions is missing.
Automatic randomization is a feedback controller over the training distribution. Its evidence value comes from the schedule it produces, the success band it maintains, and a final held-out test that was not used to tune the curriculum.
Theory
A curriculum defines a time-indexed training distribution $p_k(\theta)$, where $k$ is the training phase and $\theta$ contains randomized domain parameters. Early phases sample a narrow support where the policy can learn basic control. Later phases widen the support toward the deployment envelope.
Automatic domain randomization closes the loop: if rolling success is above the target band, expand the domain; if it is below the band, hold or narrow the domain; if failures cluster in one factor, rebalance that factor rather than widening everything. The design resembles an outer-loop controller whose plant is the learning process itself.
The mechanism is difficulty regulation. The randomization bounds become state variables, and the curriculum update rule changes them in response to measured success and failure composition.
Worked Example
The following snippet shows a small automatic randomization update. It expands the friction range when rolling success is high, keeps it fixed inside the target band, and narrows it when the learner is failing too often.
# Update one curriculum range from rolling success.
# The goal is to keep training challenging without destroying exploration.
def update_friction_range(current_range, rolling_success):
low, high = current_range
if rolling_success > 0.80:
return (max(0.10, low - 0.05), min(1.00, high + 0.05))
if rolling_success < 0.45:
center = (low + high) / 2
return (center - 0.10, center + 0.10)
return current_range
for success in [0.88, 0.63, 0.32]:
print(success, update_friction_range((0.35, 0.65), success))
update_friction_range function turns rolling success into a curriculum decision. High success expands the range, mid-band success keeps it fixed, and low success narrows it so the policy can recover useful behavior.The from-scratch fragment is for understanding the update rule. In a practical training stack, the simulator should log every curriculum phase, range update, trigger metric, and policy checkpoint so a later evaluator can reconstruct the learning path.
Practical Recipe
- Choose a small set of curriculum-controlled factors, such as goal distance, friction span, clutter count, or lighting range.
- Define a target success band before training, for example 55 to 80 percent over the most recent evaluation window.
- Expand only the factors whose current band is mastered, and rebalance factors that dominate failure labels.
- Freeze a final transfer panel that the curriculum controller never sees.
- Save the phase schedule, range updates, trigger metrics, and final policy checkpoint together.
A curriculum is evidence only when its update rule, trigger metric, phase schedule, held-out real measurements, and failure labels are saved. Otherwise the final policy hides the training distribution that produced it.
The common mistake is curriculum leakage. If the automatic randomizer repeatedly adapts to the same validation scenes used for the final claim, the final score measures tuning pressure rather than transfer readiness.
A quadruped team might begin on flat terrain with mild friction variation, then add slopes, payload changes, delay, and rough patches only after stable locomotion appears. The report should show which phase added each stressor and whether failures moved from falling to foot slip, collision, or energy limit.
A curriculum is a coach, not a confetti cannon. It adds difficulty when the learner is ready enough to learn from it.
The research frontier is shifting toward curricula that adapt per failure mode instead of expanding all ranges uniformly. The open question is how to choose updates that improve real transfer while avoiding overfitting to the simulator's own difficulty signals.
"Solving Rubik's Cube with a Robot Hand" (OpenAI, Science Robotics, 2020) demonstrates Automatic Domain Randomization (ADR) at scale on a 13-DOF dexterous hand. ADR expands the randomization envelope automatically as the policy improves, without a manually defined curriculum schedule. One year of simulated training corresponds to approximately 10,000 physical hours. The key finding: sufficiently wide randomization eliminates the need for system identification, because the policy learns to handle any plausible physical configuration rather than the measured one.
Can you name the success band, update rule, controlled factors, phase schedule, and final held-out panel for the curriculum? If not, the training process is not reproducible.
Curriculum and automatic randomization become useful when the training distribution is treated as a controlled system. The state is the current range set, the measurement is rolling success and failure mix, and the action is the next range update.
The graduate-level habit is to separate three claims. The learning claim says the schedule lets the policy acquire behavior. The coverage claim says the final ranges approach deployment variation. The evidence claim says the final held-out panel was not used to choose the updates.
| Tool or Library | Role in the Topic | Builder Advice |
|---|---|---|
| Isaac Lab | Parallel curriculum phases | Use it when many environments need phase-specific ranges and logged success windows. |
| MuJoCo or MJX | Fast dynamics range sweeps | Use it when automatic updates target friction, mass, actuator, or delay parameters. |
| Hydra or similar config tools | Curriculum schedule tracking | Use it to version range sets, phase names, seeds, and update thresholds. |
| Weights and Biases or MLflow | Training trace comparison | Use it to plot rolling success, range expansion, and failure labels on the same run. |
| LeRobot | Final real panel comparison | Use it to evaluate whether the curriculum-trained policy transfers to recorded or live robot episodes. |
A robust implementation starts with a curriculum ledger. Code Fragment 2 records the update rule, phase count, final range, and leakage guard in the same artifact as the evaluation metric.
- Write a one-paragraph task contract with observation, action, success, and failure fields.
- Start with the smallest simulator, dataset, or wrapper that exposes the task contract faithfully.
- Run one deterministic smoke test and one perturbation test before scaling.
- Save a single result artifact containing configuration, seed, metrics, videos or traces, and failure labels.
- Compare methods only when one script evaluates them on the same task panel.
Expected output: the printed trace should expose the update rule, final range, metric, and leakage guard. If one of those fields is missing, the example is not yet an evaluation artifact.
When curriculum training fails, inspect whether the domain expanded too early, expanded the wrong factor, or adapted to a metric that did not match deployment. Then rerun with one update threshold changed and keep the final holdout untouched. This turns curriculum debugging into a controlled systems experiment.
Curriculum and automatic randomization are useful when the training distribution widens for documented reasons and the final transfer claim is measured on a panel the curriculum never tuned against.
Design a curriculum for two randomized factors. Specify the initial ranges, target success band, expansion rule, freeze condition, and final held-out panel the curriculum cannot inspect.
Section 13.4 → moves from when to widen variation to how rendered cameras, labels, and photoreal scenes make perception data trustworthy.
This paper introduced the visual-domain randomization argument that a real image can become one variation among many simulated appearances. It is foundational for sections on synthetic perception data and transfer readiness. Readers should connect this source to curriculum and automatic randomization when deciding what is reusable, what is benchmark-specific, and what must be remeasured.
This paper studies randomized dynamics for robotic control transfer. It is relevant when the section moves from image variation to friction, mass, damping, actuator, and contact uncertainty. Readers should connect this source to curriculum and automatic randomization when deciding what is reusable, what is benchmark-specific, and what must be remeasured.
This work gives a theoretical view of domain randomization as transfer across a family of parameterized MDPs. Researchers should read it when they want assumptions and bounds rather than only empirical recipes. Readers should connect this source to curriculum and automatic randomization when deciding what is reusable, what is benchmark-specific, and what must be remeasured.
NVIDIA. "Omniverse Replicator Documentation."
Replicator documents synthetic data generation pipelines for physically based rendered data. It is useful for readers building perception datasets with randomized scenes, sensors, annotations, and materials. Readers should connect this source to curriculum and automatic randomization when deciding what is reusable, what is benchmark-specific, and what must be remeasured.
DLR-RM. "BlenderProc Documentation and Examples."
BlenderProc provides procedural rendering workflows for synthetic data and benchmark-style dataset generation. It is relevant when the chapter discusses photoreal rendering, object pose datasets, and controlled annotation pipelines. Readers should connect this source to curriculum and automatic randomization when deciding what is reusable, what is benchmark-specific, and what must be remeasured.