A Careful Control Loop
Teacher-student and privileged-information distillation turns simulator access into training guidance without making the deployed policy depend on simulator-only state. The teacher can see privileged variables; the student must act from onboard observations.
For Teacher-student and privileged-information distillation, GPU RL depends on simulator fidelity, PPO rollout semantics, reward terms, and reset distribution being versioned in the same training artifact.
This section develops the distillation contract for fast simulator-trained policies. The teacher is trained or evaluated with privileged state such as terrain heights, contact impulses, object poses, or exact base velocity; the student learns to imitate useful teacher actions from deployable signals such as proprioception, commands, history, and exteroception.
The key question is practical: which information is legal at deployment, and how do we prove that the student evaluation uses only that legal interface?
The teacher may see the simulator's answer key, but the student must pass the exam without it. The audit question is whether privileged information shaped the target action without leaking into the deployed observation tensor.
Theory
A simple distillation objective is $$\mathcal{L}_{\text{distill}} = \frac{1}{B}\sum_{i=1}^{B}\| \pi_s(o_i^{\text{deploy}}) - \pi_t(o_i^{\text{priv}}) \|_2^2,$$ where $\pi_t$ is the privileged teacher, $\pi_s$ is the student, $o_i^{\text{priv}}$ includes simulator-only information, and $o_i^{\text{deploy}}$ contains only signals available on the robot.
The formula is easy to write and easy to misuse. The batch must cover the same command, terrain, contact, and disturbance distribution used for evaluation, and the loss should be reported alongside closed-loop student performance, not as a standalone success metric.
The mechanism is two-stage: first train or select a teacher that uses privileged simulator state to produce strong actions, then train a student to match those actions from deployable observations. The final evaluation runs the student only, with privileged tensors removed from the actor path.
Worked Example
Code Fragment 17.5.1 computes a tiny distillation loss. The mask keeps failed or invalid teacher steps from becoming imitation targets, which matters when the teacher is still imperfect.
# Compute a masked action-distillation loss for teacher-student RL.
# The student matches teacher actions only on valid rollout steps.
import numpy as np
teacher_actions = np.array([
[0.20, -0.10, 0.05],
[0.35, -0.08, 0.02],
[1.50, 0.90, -1.20],
])
student_actions = np.array([
[0.18, -0.12, 0.04],
[0.30, -0.10, 0.01],
[0.10, 0.05, -0.02],
])
valid_teacher_step = np.array([1.0, 1.0, 0.0])
squared_error = ((student_actions - teacher_actions) ** 2).mean(axis=1)
masked_loss = (squared_error * valid_teacher_step).sum() / valid_teacher_step.sum()
print(f"per-step error: {squared_error.round(4)}")
print(f"masked distillation loss: {masked_loss:.4f}")
Expected output: the trace should show both per-step error and the masked loss. A distillation run that reports only average imitation loss can hide whether failed teacher states were included as targets.
In Isaac Lab or MJX-style workflows, privileged observations are usually already present for asymmetric critics and diagnostics. The shortcut is to reuse those tensors for teacher training while maintaining a strict exported-student interface that contains only deployable observations.
Practical Recipe
- List privileged fields and deployable fields before training the teacher.
- Train or select a teacher whose advantage comes from privileged state, not from evaluation leakage.
- Collect teacher actions on a diverse rollout panel and mark invalid teacher steps.
- Train the student from deployable observations using action matching, feature matching, or DAgger-style relabeling when needed.
- Evaluate only the student on held-out seeds with privileged actor inputs disabled.
The common mistake is to leave a privileged field in the student observation wrapper during evaluation. The run may look excellent, but the exported policy cannot reproduce it on hardware.
A terrain-walking teacher may observe the exact height map under every foot, while the student receives proprioception, command history, and a noisy local height scan. The evaluation artifact should include a schema diff proving that exact terrain state was removed from the student's actor input.
Privileged distillation is a tutoring session where the teacher can read the answer key, but the final exam confiscates it.
The frontier is moving from simple action matching toward richer policy transfer: privileged critics, latent terrain encoders, history-based students, residual teachers, and datasets that mix expert rollouts with student recovery states. The open question is how to use privileged information without training brittle students that imitate actions they cannot explain from their own sensors.
Can you list teacher inputs, student inputs, critic-only inputs, invalid-step masks, distillation loss, and the evaluation wrapper that removes privileged actor fields? If not, the distillation result is not deployment-safe.
The idea in this section becomes useful when privilege is treated as a controlled variable. A teacher may use more information, but every extra field must be named, justified, and blocked from the exported actor. Otherwise distillation becomes information leakage with a nicer name.
The graduate-level habit is to evaluate three claims separately. The teacher claim says privileged state improves expert behavior. The imitation claim says the student matches useful teacher actions on valid states. The deployment claim says the student still succeeds when privileged actor inputs are removed.
| Tool or Library | Role in the Topic | Builder Advice |
|---|---|---|
| Privileged teacher | Policy or expert with simulator-only state | Use it to generate strong targets, but log every field the teacher sees. |
| Deployable student | Policy with onboard observations only | Use it as the only policy in final evaluation and export. |
| Asymmetric critic | Training-time value function with extra state | Use it to stabilize learning while keeping the actor interface deployment-safe. |
| Invalid-step mask | Filter for teacher falls, resets, or unsafe actions | Use it so the student does not imitate bad teacher moments. |
| Schema diff | Audit of teacher, student, and critic fields | Use it to prove that privileged information did not leak into the student actor. |
A robust implementation starts with an observation schema manifest. The manifest makes leakage visible by listing teacher-only, student, and critic-only fields separately.
- Freeze the deployable observation schema before collecting teacher data.
- Store teacher-only fields and critic-only fields as explicit lists.
- Save the mask rule for excluding failed teacher steps.
- Report imitation loss and closed-loop student metrics from the same held-out panel.
- Export a student checkpoint with a wrapper that rejects privileged actor inputs.
# Record the observation schema for privileged-information distillation.
# The exported actor must accept only fields listed under student_obs.
from dataclasses import dataclass, asdict
@dataclass
class DistillationSchema:
teacher_only: tuple[str, ...]
student_obs: tuple[str, ...]
critic_only: tuple[str, ...]
mask_rule: str
eval_wrapper: str
def as_row(self) -> dict[str, object]:
return asdict(self)
schema = DistillationSchema(
teacher_only=("exact_terrain_heights", "contact_impulses"),
student_obs=("joint_pos", "joint_vel", "command", "history"),
critic_only=("base_velocity",),
mask_rule="exclude falls, timeouts, and saturated actions",
eval_wrapper="student_actor_only",
)
print(schema.as_row())
student_obs tuple is the deployed actor contract, while teacher_only and critic_only fields are allowed only during training.When a distilled student fails, decide whether the issue is teacher quality, state coverage, observation insufficiency, or leakage removal. A student that imitates well offline but fails closed-loop usually needs recovery-state data, history, or an interactive relabeling loop, not another blind epoch over the same expert states.
For privileged-information distillation, compare only construct-matched metrics that are co-computed in one pass on one configuration: same held-out seed panel, same deployable student wrapper, same command distribution, same perturbation suite, and the same success definition. Save teacher reward, imitation loss, student success, fall rate, schema diff, and leakage checks as one artifact.
Privileged-information distillation is useful when simulator secrets improve training targets and then disappear from the deployed actor. The student result is credible only when the evaluation wrapper proves that disappearance.
Design a privileged teacher for rough-terrain locomotion. List teacher-only fields, student observations, critic-only fields, mask rules, distillation loss, and the evaluation check that proves the student actor does not receive privileged state.
What's Next?
This section turned privileged-information distillation into a deployment-safe schema: teacher-only fields, student fields, critic-only fields, valid-step masks, and leakage-free evaluation. Next, continue with Section 17.6, where the same discipline is applied to throughput, wall-clock, GPU memory, and cost.
Isaac Gym matters for this section because privileged simulator state is easiest to collect when thousands of environments already expose internal physics variables. That access is powerful only if the student interface stays deployment-safe.
Brax is relevant when teacher data and student data are generated inside a batched JAX workflow. Its array-based design makes observation schemas and masks explicit, which helps prevent leakage.
NVIDIA Isaac Lab documentation.
Isaac Lab is useful for defining separate observation groups for actors, critics, and diagnostics. That makes it a natural setting for privileged teachers and deployable students when the wrapper contract is audited.
Google DeepMind MuJoCo MJX documentation.
MJX provides another route to simulator state that can support privileged teachers. The section's warning still applies: any exact simulator field used by a teacher must be removed from the exported actor path.
Rudin et al. are relevant because fast locomotion training often combines asymmetric critics, privileged state, and deployable actors. Use the paper to connect distillation and privilege to a real locomotion workload.
RSL-RL is useful for inspecting how locomotion codebases represent actor observations and critic observations. That distinction is exactly what privileged-information distillation must preserve.