Section 18.3: Goal-conditioned policies; hindsight experience replay | Building Embodied AI: From Perception to Autonomous Action

A Careful Control Loop

Technical illustration for Section 18.3: Goal-conditioned policies; hindsight experience replay. — Figure 18.3A: Hindsight experience replay in action: a failed trajectory (agent reached the wrong block) is relabeled with the wrong block as the goal, turning the failure into a successful transition for that relabeled goal.

Big Picture

Goal-conditioned reinforcement learning asks one policy to solve many tasks. Instead of training a separate policy for every target pose, the agent receives the desired goal as part of the input and learns what action is useful for the current state-goal pair.

For Goal-conditioned policies; hindsight experience replay, reward design must expose objective term, safety interaction, exploration effect, and deployment risk instead of hiding them inside one scalar return.

This section develops the contract for policies of the form $\pi(a \mid s,g)$, where $g$ is a desired goal such as a target object pose, waypoint, door angle, drawer position, or language-grounded task state. The reward is also goal-indexed: $r_g(s,a,s')$ asks whether the transition moved the world toward that particular goal.

The key question is practical: how can a robot learn from an attempt that missed the requested goal but still reached a different, useful state?

Every Attempt Reaches Something

Hindsight Experience Replay does not pretend a failed attempt solved the original task. It says the same transition can be valid supervision for a different goal, the one the agent actually achieved.

Theory

A goal-conditioned replay buffer stores transitions as $(s_t, a_t, s_{t+1}, g, r_g)$. In sparse-goal tasks, the original reward may be zero for most failed trials. HER adds extra replay entries by replacing the desired goal $g$ with a goal $\tilde g$ that was achieved later in the same episode, then recomputing the reward $r_{\tilde g}$.

The mechanism works because the physical transition did happen. The same push, step, or grasp attempt can teach the value of moving from $s_t$ to $s_{t+1}$ under a different goal label. The evaluation, however, must remain on the original goal distribution, otherwise relabeling becomes a way to lower the task rather than learn it.

Paper Spotlight

Hindsight Experience Replay (Andrychowicz et al., NeurIPS 2017) — relabeling failed trajectories with the goals they actually achieved enables learning from binary sparse reward on robotic manipulation. It is the technique that lets a robot extract useful supervision from every attempt, even the many that miss the requested goal.

Mechanism

HER converts sparse failure into dense training data by changing labels, not physics. It is most natural when the environment can report both desired_goal and achieved_goal and when reward can be recomputed from those fields.

Worked Example

Suppose a block was supposed to end at position 10, but the robot pushed it to position 7. The original transition failed. HER can relabel part of the replay entry with goal 7, because that is what the trajectory actually achieved. Code Fragment 1 shows the relabeling step.

# Relabel failed transitions with goals the episode actually achieved.
# The original task remains unchanged for final evaluation.
episode = [
    {"state": 0, "action": "push", "achieved_goal": 3},
    {"state": 3, "action": "push", "achieved_goal": 7},
]
requested_goal = 10

for transition in episode:
    original_reward = int(transition["achieved_goal"] == requested_goal)
    relabeled_goal = transition["achieved_goal"]
    hindsight_reward = int(transition["achieved_goal"] == relabeled_goal)
    print(transition["action"], "g=", requested_goal, "r=", original_reward, "her_g=", relabeled_goal, "her_r=", hindsight_reward)

push g= 10 r= 0 her_g= 3 her_r= 1 push g= 10 r= 0 her_g= 7 her_r= 1

Code Fragment 1: The requested_goal remains 10, so the original reward is zero. The HER entries use each achieved_goal as a relabeled goal, producing extra positive training examples without changing the final evaluation task.

Expected output: the same physical transitions are visible under two labels: failed for goal 10, successful for goals 3 and 7. The relabeled reward is a replay trick, not a deployment metric.

Library Shortcut

Use goal-aware Gymnasium environments and replay buffers that expose observation, desired_goal, and achieved_goal. Stable-Baselines3 and related RL libraries can handle the replay mechanics, but the builder must ensure reward recomputation matches the environment's goal semantics.

Practical Recipe

Define the goal representation in task coordinates, not only in pixels.
Log both desired and achieved goals at every step.
Write a reward function that can be recomputed for any candidate goal.
Relabel replay entries with achieved goals from the same episode.
Evaluate only on held-out requested goals, with the original success threshold.

Common Failure Mode

HER can create a misleading sense of progress if the relabeled goals are much easier than the requested goals. A robot that often bumps the block somewhere learns many hindsight successes, but it still may not learn precise placement unless evaluation remains tied to the original goal distribution.

Practical Example

In a drawer-opening task, the desired goal might be a handle pose or drawer opening angle. If the robot opens the drawer halfway instead of fully, HER can relabel the episode as a success for the halfway goal, while the deployment metric still asks for the commanded angle.

Memory Hook

HER is the robot saying, "I missed your target, but I did hit this other one. Please file that under useful experience, not victory."

Research Frontier

Goal-conditioned policies now connect to language-conditioned control, foundation-model goal proposals, and offline robot datasets with many task labels. The hard part is still grounding: a goal embedding must correspond to a verifiable state change, not only to a plausible instruction string.

Self Check

Can your environment recompute reward from achieved_goal and desired_goal without peeking at training history? If not, HER will be hard to make reproducible.

The main design choice is the goal space. A pose goal, image goal, language goal, and contact-state goal create different generalization problems. Pose goals are easy to score but may miss semantic intent. Image goals can capture rich state but may reward visual coincidence. Language goals are flexible but need grounding into measurable state.

The graduate-level habit is to separate relabeling from evaluation. Relabeling changes the training distribution in the replay buffer. Evaluation samples desired goals from the task distribution and computes success without hindsight. Mixing these two distributions invalidates the comparison because the training trick becomes part of the reported task.

Practical Tool Choices For This Section

Tool or Library	Role in the Topic	Builder Advice
Gymnasium goal API	Goal fields	Use `desired_goal` and `achieved_goal` fields so reward can be recomputed cleanly.
Stable-Baselines3 HER	Replay relabeling	Use maintained replay buffers once the reward function has been tested by hand.
MuJoCo	Goal measurement	Compute achieved goals from simulator state, such as object pose or joint angle, with explicit units.
LeRobot	Dataset goals	Use demonstration metadata to check whether goals are observable and consistently labeled.
ROS 2	Hardware goals	Publish desired and achieved goal topics so controller logs can be audited after real rollouts.

A robust implementation starts by testing the reward recomputation function independently. If the function cannot score arbitrary achieved-goal and desired-goal pairs, the replay buffer cannot relabel transitions safely.

Choose a goal representation and success tolerance.
Verify reward recomputation on hand-written state-goal pairs.
Store desired goal, achieved goal, action, next achieved goal, and done flag.
Relabel a controlled fraction of replay samples with future achieved goals.
Report original-goal success, hindsight sample ratio, and goal-distribution coverage.

Code Fragment 2 captures the audit fields that make a HER run interpretable.

# Build one HER audit record for a goal-conditioned policy.
# The record separates replay relabeling from deployment evaluation.
from dataclasses import dataclass, asdict

@dataclass
class HERAudit:
    section: str
    goal_space: str
    relabel_strategy: str
    reward_recompute_test: str
    report_metrics: list[str]

    def as_row(self) -> dict[str, object]:
        return asdict(self)

record = HERAudit(
    section="18.3",
    goal_space="object xy pose in meters",
    relabel_strategy="future achieved goals from the same episode",
    reward_recompute_test="score three hand-written achieved/desired goal pairs",
    report_metrics=["requested_goal_success", "hindsight_ratio", "goal_coverage"],
)
print(record.as_row())

{'section': '18.3', 'goal_space': 'object xy pose in meters', 'relabel_strategy': 'future achieved goals from the same episode', 'reward_recompute_test': 'score three hand-written achieved/desired goal pairs', 'report_metrics': ['requested_goal_success', 'hindsight_ratio', 'goal_coverage']}

Code Fragment 2: The HERAudit record names the goal space, the relabeling rule, and the reward recomputation test. The report_metrics list protects the final report from presenting hindsight success as requested-goal success.

When HER fails, check whether the achieved goals are too narrow, too noisy, or not physically meaningful. Then inspect whether the relabeled goals match states the robot can intentionally reproduce. The failure label should distinguish exploration failure, goal-representation failure, reward-recompute bug, and evaluation-distribution mismatch.

Evaluation Recipe

For goal-conditioned policies, compare requested-goal success, hindsight relabel ratio, safety cost, and goal coverage only when they are co-computed in one pass on one configuration. Save desired goals, achieved goals, relabeled goals, reward values, and failure labels so every number can be traced back to the same replay and evaluation settings.

Key Takeaway

HER is powerful because it learns from missed attempts, but the final claim must still be measured on the goals the system was actually asked to achieve.

Exercise 18.3.1

For a pushing or drawer task, define desired_goal, achieved_goal, and the success tolerance. Then write two legal HER relabels and one illegal relabel that would corrupt evaluation.

What's Next?

This section showed how goal conditioning and hindsight relabeling reuse failed experience without changing the requested task. Next, Section 18.4 turns to the failure mode that appears when a policy finds a shortcut in the reward itself.

References & Further Reading

Foundational Papers, Tools, and Practice References

Ng, A. Y., Harada, D., and Russell, S. (1999). Policy invariance under reward transformations. ICML.

Potential-based shaping is useful contrast for HER. Shaping adds dense reward through a potential, while HER changes replay labels and recomputes goal-conditioned rewards.

Paper

Andrychowicz, M. et al. (2017). Hindsight Experience Replay. NeurIPS.

This is the canonical paper for Hindsight Experience Replay. It explains how failed attempts become training data for achieved goals while evaluation remains tied to requested goals.

Paper

Amodei, D. et al. (2016). Concrete Problems in AI Safety. arXiv.

The safety categories matter for goal-conditioned policies because relabeled success can hide side effects. A HER experiment still needs constraint and failure labels for the original task distribution.

Paper

Christiano, P. F. et al. (2017). Deep reinforcement learning from human preferences. NeurIPS.

Preference data can help define goals that are hard to express as poses or thresholds. It also introduces the same grounding problem: the goal label must correspond to verifiable state change.

Paper

Ray, A., Achiam, J., and Amodei, D. (2019). Benchmarking Safe Exploration in Deep Reinforcement Learning. OpenAI.

Safety Gym is relevant when goal-conditioned exploration creates unsafe intermediate states. It encourages reporting requested-goal success alongside safety costs.

Paper

Farama Foundation Safety Gymnasium documentation.

Safety Gymnasium can be used to test goal-conditioned policies under explicit cost channels. It is especially useful when relabeling improves learning but may also increase risky exploration.

Tool