Section 17.5: Teacher-student and privileged-information distillation

A Careful Control Loop
Technical illustration with a teacher robot reading hidden simulator state while a student robot sees only onboard sensors, illustrating privileged-information distillation for deployment-safe policies.
Figure 17.5A: Privileged-information distillation lets the teacher use simulator secrets during training while the student learns the policy it can actually execute at deployment.
Big Picture

Teacher-student and privileged-information distillation turns simulator access into training guidance without making the deployed policy depend on simulator-only state. The teacher can see privileged variables; the student must act from onboard observations.

For Teacher-student and privileged-information distillation, GPU RL depends on simulator fidelity, PPO rollout semantics, reward terms, and reset distribution being versioned in the same training artifact.

This section develops the distillation contract for fast simulator-trained policies. The teacher is trained or evaluated with privileged state such as terrain heights, contact impulses, object poses, or exact base velocity; the student learns to imitate useful teacher actions from deployable signals such as proprioception, commands, history, and exteroception.

The key question is practical: which information is legal at deployment, and how do we prove that the student evaluation uses only that legal interface?

Privilege Is A Training Tool, Not A Deployment Input

The teacher may see the simulator's answer key, but the student must pass the exam without it. The audit question is whether privileged information shaped the target action without leaking into the deployed observation tensor.

Theory

A simple distillation objective is $$\mathcal{L}_{\text{distill}} = \frac{1}{B}\sum_{i=1}^{B}\| \pi_s(o_i^{\text{deploy}}) - \pi_t(o_i^{\text{priv}}) \|_2^2,$$ where $\pi_t$ is the privileged teacher, $\pi_s$ is the student, $o_i^{\text{priv}}$ includes simulator-only information, and $o_i^{\text{deploy}}$ contains only signals available on the robot.

The formula is easy to write and easy to misuse. The batch must cover the same command, terrain, contact, and disturbance distribution used for evaluation, and the loss should be reported alongside closed-loop student performance, not as a standalone success metric.

Mechanism

The mechanism is two-stage: first train or select a teacher that uses privileged simulator state to produce strong actions, then train a student to match those actions from deployable observations. The final evaluation runs the student only, with privileged tensors removed from the actor path.

Worked Example

Code Fragment 17.5.1 computes a tiny distillation loss. The mask keeps failed or invalid teacher steps from becoming imitation targets, which matters when the teacher is still imperfect.

# Compute a masked action-distillation loss for teacher-student RL.
# The student matches teacher actions only on valid rollout steps.
import numpy as np

teacher_actions = np.array([
    [0.20, -0.10, 0.05],
    [0.35, -0.08, 0.02],
    [1.50, 0.90, -1.20],
])
student_actions = np.array([
    [0.18, -0.12, 0.04],
    [0.30, -0.10, 0.01],
    [0.10, 0.05, -0.02],
])
valid_teacher_step = np.array([1.0, 1.0, 0.0])

squared_error = ((student_actions - teacher_actions) ** 2).mean(axis=1)
masked_loss = (squared_error * valid_teacher_step).sum() / valid_teacher_step.sum()

print(f"per-step error: {squared_error.round(4)}")
print(f"masked distillation loss: {masked_loss:.4f}")
per-step error: [0.0003 0.001 1.8705] masked distillation loss: 0.0007
Code Fragment 17.5.1 demonstrates masked teacher-student action matching. The third teacher action is excluded because it represents an invalid rollout step, preventing a bad privileged teacher moment from training the deployable student.

Expected output: the trace should show both per-step error and the masked loss. A distillation run that reports only average imitation loss can hide whether failed teacher states were included as targets.

Library Shortcut

In Isaac Lab or MJX-style workflows, privileged observations are usually already present for asymmetric critics and diagnostics. The shortcut is to reuse those tensors for teacher training while maintaining a strict exported-student interface that contains only deployable observations.

Practical Recipe

  1. List privileged fields and deployable fields before training the teacher.
  2. Train or select a teacher whose advantage comes from privileged state, not from evaluation leakage.
  3. Collect teacher actions on a diverse rollout panel and mark invalid teacher steps.
  4. Train the student from deployable observations using action matching, feature matching, or DAgger-style relabeling when needed.
  5. Evaluate only the student on held-out seeds with privileged actor inputs disabled.
Common Failure Mode

The common mistake is to leave a privileged field in the student observation wrapper during evaluation. The run may look excellent, but the exported policy cannot reproduce it on hardware.

Practical Example

A terrain-walking teacher may observe the exact height map under every foot, while the student receives proprioception, command history, and a noisy local height scan. The evaluation artifact should include a schema diff proving that exact terrain state was removed from the student's actor input.

Memory Hook

Privileged distillation is a tutoring session where the teacher can read the answer key, but the final exam confiscates it.

Research Frontier

The frontier is moving from simple action matching toward richer policy transfer: privileged critics, latent terrain encoders, history-based students, residual teachers, and datasets that mix expert rollouts with student recovery states. The open question is how to use privileged information without training brittle students that imitate actions they cannot explain from their own sensors.

Self Check

Can you list teacher inputs, student inputs, critic-only inputs, invalid-step masks, distillation loss, and the evaluation wrapper that removes privileged actor fields? If not, the distillation result is not deployment-safe.

The idea in this section becomes useful when privilege is treated as a controlled variable. A teacher may use more information, but every extra field must be named, justified, and blocked from the exported actor. Otherwise distillation becomes information leakage with a nicer name.

The graduate-level habit is to evaluate three claims separately. The teacher claim says privileged state improves expert behavior. The imitation claim says the student matches useful teacher actions on valid states. The deployment claim says the student still succeeds when privileged actor inputs are removed.

Practical Tool Choices For This Section
Tool or LibraryRole in the TopicBuilder Advice
Privileged teacherPolicy or expert with simulator-only stateUse it to generate strong targets, but log every field the teacher sees.
Deployable studentPolicy with onboard observations onlyUse it as the only policy in final evaluation and export.
Asymmetric criticTraining-time value function with extra stateUse it to stabilize learning while keeping the actor interface deployment-safe.
Invalid-step maskFilter for teacher falls, resets, or unsafe actionsUse it so the student does not imitate bad teacher moments.
Schema diffAudit of teacher, student, and critic fieldsUse it to prove that privileged information did not leak into the student actor.

A robust implementation starts with an observation schema manifest. The manifest makes leakage visible by listing teacher-only, student, and critic-only fields separately.

  1. Freeze the deployable observation schema before collecting teacher data.
  2. Store teacher-only fields and critic-only fields as explicit lists.
  3. Save the mask rule for excluding failed teacher steps.
  4. Report imitation loss and closed-loop student metrics from the same held-out panel.
  5. Export a student checkpoint with a wrapper that rejects privileged actor inputs.
# Record the observation schema for privileged-information distillation.
# The exported actor must accept only fields listed under student_obs.
from dataclasses import dataclass, asdict

@dataclass
class DistillationSchema:
    teacher_only: tuple[str, ...]
    student_obs: tuple[str, ...]
    critic_only: tuple[str, ...]
    mask_rule: str
    eval_wrapper: str

    def as_row(self) -> dict[str, object]:
        return asdict(self)

schema = DistillationSchema(
    teacher_only=("exact_terrain_heights", "contact_impulses"),
    student_obs=("joint_pos", "joint_vel", "command", "history"),
    critic_only=("base_velocity",),
    mask_rule="exclude falls, timeouts, and saturated actions",
    eval_wrapper="student_actor_only",
)
print(schema.as_row())
{'teacher_only': ('exact_terrain_heights', 'contact_impulses'), 'student_obs': ('joint_pos', 'joint_vel', 'command', 'history'), 'critic_only': ('base_velocity',), 'mask_rule': 'exclude falls, timeouts, and saturated actions', 'eval_wrapper': 'student_actor_only'}
Code Fragment 17.5.2 records the schema that prevents privileged-information leakage. The student_obs tuple is the deployed actor contract, while teacher_only and critic_only fields are allowed only during training.

When a distilled student fails, decide whether the issue is teacher quality, state coverage, observation insufficiency, or leakage removal. A student that imitates well offline but fails closed-loop usually needs recovery-state data, history, or an interactive relabeling loop, not another blind epoch over the same expert states.

Evaluation Recipe

For privileged-information distillation, compare only construct-matched metrics that are co-computed in one pass on one configuration: same held-out seed panel, same deployable student wrapper, same command distribution, same perturbation suite, and the same success definition. Save teacher reward, imitation loss, student success, fall rate, schema diff, and leakage checks as one artifact.

Key Takeaway

Privileged-information distillation is useful when simulator secrets improve training targets and then disappear from the deployed actor. The student result is credible only when the evaluation wrapper proves that disappearance.

Exercise 17.5.1

Design a privileged teacher for rough-terrain locomotion. List teacher-only fields, student observations, critic-only fields, mask rules, distillation loss, and the evaluation check that proves the student actor does not receive privileged state.

What's Next?

This section turned privileged-information distillation into a deployment-safe schema: teacher-only fields, student fields, critic-only fields, valid-step masks, and leakage-free evaluation. Next, continue with Section 17.6, where the same discipline is applied to throughput, wall-clock, GPU memory, and cost.

References & Further Reading
Foundational Papers, Tools, and Practice References

Makoviychuk, V. et al. (2021). Isaac Gym: High Performance GPU-Based Physics Simulation For Robot Learning. arXiv.

Isaac Gym matters for this section because privileged simulator state is easiest to collect when thousands of environments already expose internal physics variables. That access is powerful only if the student interface stays deployment-safe.

Paper

Freeman, C. D. et al. (2021). Brax: A Differentiable Physics Engine for Large Scale Rigid Body Simulation. arXiv.

Brax is relevant when teacher data and student data are generated inside a batched JAX workflow. Its array-based design makes observation schemas and masks explicit, which helps prevent leakage.

Paper

NVIDIA Isaac Lab documentation.

Isaac Lab is useful for defining separate observation groups for actors, critics, and diagnostics. That makes it a natural setting for privileged teachers and deployable students when the wrapper contract is audited.

Tool

Google DeepMind MuJoCo MJX documentation.

MJX provides another route to simulator state that can support privileged teachers. The section's warning still applies: any exact simulator field used by a teacher must be removed from the exported actor path.

Tool

Rudin, N. et al. (2022). Learning to Walk in Minutes Using Massively Parallel Deep Reinforcement Learning. CoRL.

Rudin et al. are relevant because fast locomotion training often combines asymmetric critics, privileged state, and deployable actors. Use the paper to connect distillation and privilege to a real locomotion workload.

Paper

RSL-RL repository.

RSL-RL is useful for inspecting how locomotion codebases represent actor observations and critic observations. That distinction is exactly what privileged-information distillation must preserve.

Tool