Section 18.5: Human preferences and learned reward models (RLHF for control)

A Careful Control Loop
Technical illustration for Section 18.5: Human preferences and learned reward models (RLHF for control).
Figure 18.5A: Human preference learning for robot control: a human compares two short trajectory clips, their preference signal trains a reward model, and that reward model drives PPO fine-tuning of the robot policy.
Big Picture

Preference learning replaces a brittle hand-written reward with a learned judgment model. Humans compare short clips, trajectories, or state summaries, and a reward model learns which behaviors people prefer. The policy then optimizes that learned model, so the model becomes part of the control loop.

For Human preferences and learned reward models (RLHF for control), reward design must expose objective term, safety interaction, exploration effect, and deployment risk instead of hiding them inside one scalar return.

This section develops the contract for reward models in embodied control. A learned reward model maps a trajectory segment $\tau$ to a scalar score $\hat R(\tau)$. The training data is not a numeric reward label for every state. It is usually a set of comparisons such as "clip A is safer, smoother, or more successful than clip B."

The key question is practical: when should the builder trust learned preferences more than a hand-written reward, and how can the builder detect when the policy starts exploiting the reward model?

The Judge Becomes The Reward

Once a reward model is trained, the policy optimizes the model's scores, not the human raters directly. The model therefore needs the same audit discipline as any other reward: calibration, coverage, disagreement checks, and adversarial rollouts.

Theory

A common preference model uses a Bradley-Terry form. If a rater prefers trajectory $\tau_A$ over $\tau_B$, the model predicts

$$P(\tau_A \succ \tau_B)=\frac{\exp(\hat R(\tau_A))}{\exp(\hat R(\tau_A))+\exp(\hat R(\tau_B))}.$$

Training minimizes the negative log probability of the observed preference. In embodied control, the rater prompt must be precise: should the rater prioritize task completion, smoothness, speed, clearance from humans, object damage, or recovery behavior? A vague prompt trains a vague reward model.

Mechanism

Preference learning moves reward design from equation writing to data design. The core artifacts are the rater protocol, the comparison dataset, the reward-model validation set, and the policy rollouts that test whether the model is being exploited.

Worked Example

Suppose raters prefer a slower grasp that avoids scraping the table over a faster grasp that succeeds but collides. Code Fragment 1 computes the probability assigned to the preferred clip and the corresponding loss.

# Compute one preference-model loss for two trajectory clips.
# The preferred clip should receive the higher learned reward score.
from math import exp, log

reward_safe = 1.4
reward_fast_collision = 0.2
prob_safe_preferred = exp(reward_safe) / (exp(reward_safe) + exp(reward_fast_collision))
loss = -log(prob_safe_preferred)

print("P(safe preferred)=", round(prob_safe_preferred, 3))
print("preference_loss=", round(loss, 3))
P(safe preferred)= 0.769 preference_loss= 0.263
Code Fragment 1: The variables reward_safe and reward_fast_collision stand in for reward-model scores on two clips. The lower preference_loss shows that the model assigns higher probability to the rater's preferred safe trajectory.

Expected output: the probability should rise when the preferred clip receives a higher score. If the score gap grows on training data but fails on held-out clips, the reward model has memorized the comparison set rather than learned the rater's criterion.

Library Shortcut

Use preference-learning or RLHF tooling for batching comparisons, training reward models, and logging rater agreement, but keep the embodied artifacts close: clips, state traces, collision logs, intervention flags, and final task metrics. The tool can train the scorer; it cannot decide what the rater should value.

Practical Recipe

  1. Define the rater rubric before collecting comparisons.
  2. Sample diverse trajectory pairs, including failures, recoveries, near misses, and easy successes.
  3. Track rater disagreement and remove or investigate ambiguous pairs.
  4. Validate the reward model on held-out clips and adversarial high-score rollouts.
  5. Report policy performance using task success and safety costs, not reward-model score alone.
Common Failure Mode

A learned reward model can be hacked by the policy that optimizes it. If the model learned that smooth-looking video implies safety, the policy may learn visually smooth motions that hide contact forces or damage. Always evaluate optimized policies with independent physical metrics.

Practical Example

For a home-assistance robot, raters might compare two clips of placing a cup on a table. The rubric should say whether a slow but careful placement beats a fast placement with a hard contact, and the logged artifact should include force or contact proxies so the reward model can be checked against physical evidence.

Memory Hook

A reward model is a judge with a very large clipboard. The policy will eventually learn which boxes on the clipboard matter and which real-world details the judge forgot to ask about.

Research Frontier

Preference learning for control is expanding from clip comparisons toward multimodal feedback, language critiques, active query selection, and reward models trained across robot datasets. The frontier challenge is robustness under optimization pressure: the reward model must keep judging correctly as the policy searches for unusual high-score behavior.

Self Check

Can you state the rater rubric, the disagreement rate, the held-out validation result, and the independent deployment metric? If not, the learned reward model is under-audited.

Learned rewards are most useful when the desired behavior is hard to write as a formula but easy for trained raters to compare. Smooth recovery, respectful distance, gentle contact, and task style often fit this pattern. They are least useful when the rater cannot observe the relevant evidence, such as hidden force, internal wear, or a delayed safety consequence.

The graduate-level habit is to treat reward-model optimization as a distribution shift. The model is trained on comparison clips, then the policy searches for actions that maximize it. High reward-model scores should therefore trigger adversarial review, uncertainty checks, and independent embodied metrics.

Practical Tool Choices For This Section
Tool or LibraryRole in the TopicBuilder Advice
Preference data toolComparison collectionUse it to present clips consistently and store rater IDs, rubric version, and disagreement.
GymnasiumPolicy optimizationUse consistent environments so reward-model scores and task metrics come from the same rollouts.
MuJoCoHidden physical checksLog contact, force proxies, and object state so visual preferences can be audited.
LeRobotClip and dataset managementConnect comparison labels to demonstrations, replay videos, and policy rollouts.
ROS 2Hardware validationRecord real controller and safety topics when reward-model policies move beyond simulation.

A robust implementation records the human-labeling process as carefully as the RL run. Without the rater protocol and comparison distribution, a learned reward score is not interpretable.

  1. Write a rater rubric with ranked criteria and tie rules.
  2. Collect comparison pairs across easy, hard, unsafe, and ambiguous cases.
  3. Train the reward model and report held-out pair accuracy plus calibration.
  4. Optimize the policy under the reward model while saving high-score rollouts.
  5. Evaluate with independent success, safety, and intervention metrics.

Code Fragment 2 records the minimum audit fields for a learned reward used in control.

# Build one preference-reward audit record for control.
# The fields connect human labels, model validation, and policy evaluation.
from dataclasses import dataclass, asdict

@dataclass
class PreferenceRewardAudit:
    section: str
    rater_rubric: str
    validation_check: str
    exploitation_probe: str
    deployment_metrics: list[str]

    def as_row(self) -> dict[str, object]:
        return asdict(self)

record = PreferenceRewardAudit(
    section="18.5",
    rater_rubric="prefer task success, then gentle contact, then smooth recovery",
    validation_check="held-out pair accuracy plus disagreement review",
    exploitation_probe="top reward-model rollouts inspected for hidden contact",
    deployment_metrics=["task_success", "contact_cost", "human_intervention_rate"],
)
print(record.as_row())
{'section': '18.5', 'rater_rubric': 'prefer task success, then gentle contact, then smooth recovery', 'validation_check': 'held-out pair accuracy plus disagreement review', 'exploitation_probe': 'top reward-model rollouts inspected for hidden contact', 'deployment_metrics': ['task_success', 'contact_cost', 'human_intervention_rate']}
Code Fragment 2: The PreferenceRewardAudit record ties the rater rubric to validation, exploitation probes, and deployment metrics. The exploitation_probe field is essential because optimizing a reward model creates new behaviors the raters may never have labeled.

When a learned reward policy fails, assign the failure to rater ambiguity, dataset coverage, reward-model overfitting, optimization exploit, or missing physical evidence. Then add comparison pairs that target that failure and rerun the policy evaluation on the same seed panel.

Evaluation Recipe

For learned reward models, compare reward-model score, task success, safety cost, rater-agreement diagnostics, and intervention rate only when they are co-computed in one pass on one configuration. Save comparison data, rubric version, reward-model checkpoint, policy checkpoint, traces, and failure labels in one artifact.

Key Takeaway

A learned reward model is useful when it captures human judgment that a formula missed, and safe enough only when optimized policies still pass independent embodied metrics.

Exercise 18.5.1

Design a rater rubric for a robot pouring task. Include three comparison criteria, one tie rule, one hidden physical metric the rater cannot see, and one exploitation probe for the trained reward model.

What's Next?

This section showed how preferences can become a learned reward while introducing new audit requirements. Next, Section 18.6 separates reward maximization from hard safety constraints and cost budgets.

References & Further Reading
Foundational Papers, Tools, and Practice References

Ng, A. Y., Harada, D., and Russell, S. (1999). Policy invariance under reward transformations. ICML.

Potential-based shaping is a formal contrast to learned reward models. It shows a case where reward changes have a policy-invariance guarantee, while preference rewards require empirical validation.

Paper

Andrychowicz, M. et al. (2017). Hindsight Experience Replay. NeurIPS.

HER is a useful comparison point because it changes labels in replay, while preference learning changes the reward estimator. Both require clear separation between training signal and final evaluation.

Paper

Amodei, D. et al. (2016). Concrete Problems in AI Safety. arXiv.

The reward-hacking categories apply directly to learned reward models. A policy can exploit the model's blind spots even when the original labels came from humans.

Paper

Christiano, P. F. et al. (2017). Deep reinforcement learning from human preferences. NeurIPS.

This is the central paper for learning rewards from human preferences. It motivates the comparison-loss formulation and the audit need for held-out clips, rater agreement, and optimized-policy review.

Paper

Ray, A., Achiam, J., and Amodei, D. (2019). Benchmarking Safe Exploration in Deep Reinforcement Learning. OpenAI.

Safety Gym is relevant because human raters may not observe every safety cost. Explicit cost channels give an independent check on policies optimized against learned rewards.

Paper

Farama Foundation Safety Gymnasium documentation.

Safety Gymnasium helps evaluate learned-reward policies with separate safety costs. It is a practical way to catch reward-model exploitation that looks acceptable in video comparisons.

Tool