Section 26.3: Skill discovery and hierarchical RL | Building Embodied AI: From Perception to Autonomous Action

A Careful Control Loop

Technical illustration for Section 26.3: Skill discovery and hierarchical RL. — Figure 26.3A: Skill discovery via unsupervised bottleneck detection: subgoal states are identified where trajectories frequently converge, option policies are trained to reach each subgoal, and a meta-controller selects options.

Big Picture

Skill discovery and hierarchical RL treats action as a hierarchy rather than a flat stream of motor commands. A skill gives the planner a reusable temporal abstraction with an initiation condition, an internal policy, a termination rule, and a verification contract.

Why Hierarchy Matters

For Skill discovery and hierarchical RL, hierarchy separates timing, contact, recovery, and sequencing so a high-level planner can select skills without pretending every low-level policy is deterministic.

Skill discovery asks whether useful behaviors can be learned before the final task reward is known. Hierarchical RL then decides how to select and compose those behaviors for a downstream objective.

Skill Equals Promise

For Skill discovery and hierarchical RL, treat the skill as an interface: initiation set, internal controller, progress signal, termination rule, verifier, and recovery status must be explicit.

Formal Contract

For Skill discovery and hierarchical RL, use the option tuple as an audit checklist: initiation states, internal policy, termination probability, and verifier must match the robot task.

$$\max_{\pi,z}\; I(z;s_T) + \mathbb{E}\left[\sum_t r_{\mathrm{task}}(s_t,a_t)\right],$$

For Skill discovery and hierarchical RL, map the option fields onto behavior trees, task graphs, finite-state machines, or task-and-motion planning nodes so start, act, stop, and verify remain inspectable.

Figure 26.3.B: The diagram highlights the two-stage nature of hierarchical RL: discover or define reusable behaviors, then learn a selector that composes them for a task.

Worked Implementation

Code Fragment 1 for Skill discovery and hierarchical RL should expose initiation, progress, termination, verification, and failure reporting before connecting the skill to ROS 2, BehaviorTree.CPP, Drake, or a learned policy.

# Cluster short trajectory summaries into candidate skills.
# This toy discovery pass groups behaviors by displacement and contact evidence.
import numpy as np

summaries = np.array([
    [1.0, 0.0, 0.0],
    [0.9, 0.1, 0.0],
    [0.0, 0.0, 1.0],
    [0.1, 0.0, 0.9],
])
skill_names = []
for forward, lateral, contact in summaries:
    if contact > 0.5:
        skill_names.append("manipulate")
    elif forward > 0.5:
        skill_names.append("navigate")
    else:
        skill_names.append("unknown")
print(skill_names)

['navigate', 'navigate', 'manipulate', 'manipulate']

The expected output is only useful because each cluster now has a stable semantic label. A higher-level controller can select navigate or manipulate as reusable skills, whereas unlabeled embeddings would still be hard to schedule or verify.

Code Fragment 1: The tiny clustering rule illustrates the interface that learned discovery methods must eventually provide. A discovered behavior becomes useful only when it can be named, verified, and selected by a higher-level policy.

Algorithm: Verified Skill Execution

Check whether the current state satisfies the skill initiation predicate.
Execute the skill policy while monitoring progress, time, force, and perception confidence.
Terminate when the skill succeeds, violates a safety guard, or reaches a timeout.
Run a verifier that checks the postcondition in sensor space and task space.
Return success, retry, fallback, or escalate to the high-level planner.

Practical Recipe

Name each skill with a verb and object: navigate_to_station, grasp_handle, dock_drone, or change_lane.
Write preconditions, postconditions, safety guards, timeout, and recovery behavior before training a policy.
Represent sequencing as a finite-state graph, behavior tree, or task-and-motion plan so failures have explicit routes.
Use language as a planner only after commands are grounded into a typed skill library with affordance checks.
Evaluate composition, not only individual success. Many failures occur when two correct skills meet at a bad boundary.

Library Shortcut

For Skill discovery and hierarchical RL, use BehaviorTree.CPP, ROS 2 lifecycle nodes, Drake systems, or task-and-motion planning to handle scheduling and fallback while preserving explicit skill contracts.

Practical Example

For Skill discovery and hierarchical RL, decompose the household command into navigation, inspection, reachability, grasp, carry, and handoff only if each subskill exposes a verifier and recovery route.

Skill Interface Checklist

Field	Question	Example For A Mobile Manipulator
Initiation	When may it start?	Object detected, arm clear, base within reach.
Policy	What controller runs?	Visual servoing plus impedance control.
Termination	When does it stop?	Grasp force stable for 0.5 seconds.
Verification	How is success proved?	Object pose follows gripper during lift.
Recovery	What happens after failure?	Open gripper, re-localize, retry from a safer pose.

Composition Failure

For Skill discovery and hierarchical RL, test hierarchy failures caused by mismatched postconditions, hidden frames, stale perception, and planners treating probabilistic skills as deterministic.

Research Frontier

For Skill discovery and hierarchical RL, connect skill learning to VLA models and task-and-motion planning only when feasibility, verification, and recovery are represented for this body and scene.

Self Check

For Skill discovery and hierarchical RL, the test is whether initiation set, internal policy, termination rule, verifier, and recovery route can be written for the target robot skill.

Key Takeaway

Skill discovery and hierarchical RL is useful when it makes the perception-action loop more reliable, not when it merely adds a more impressive model name.

Exercise 26.3.1

Design a method-matched experiment for Skill discovery and hierarchical RL. Specify the environment, observation schema, action interface, metric, and one perturbation that targets the section's core assumption.

What's Next

This section grounded skill discovery and hierarchical rl in an explicit robot-data contract: observations, actions, demonstrations, evaluation splits, and failure labels. The next reading step is Section 26.4, where the same contract is carried into the next technique or chapter.

References & Further Reading

Foundational Papers

Sutton, R. S., Precup, D., and Singh, S. (1999). Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning.

This paper formalizes options as temporally extended actions with initiation, policy, and termination conditions. It is the canonical reference for the chapter's skill hierarchy vocabulary.

Paper

Bacon, P. L., Harb, J., and Precup, D. (2017). The Option-Critic Architecture.

Option-Critic learns options end to end within reinforcement learning. It helps readers compare hand-specified skills with learned temporal abstractions.

Paper

Eysenbach, B. et al. (2018). Diversity is All You Need: Learning Skills Without a Reward Function.

DIAYN studies unsupervised skill discovery by maximizing distinguishable behaviors. It is useful for understanding when skills can be learned before a downstream task is specified.

Paper

Technical Reports and Project Pages

Open X-Embodiment and RT-X Project Website.

Cross-embodiment datasets make skill reuse a practical question rather than only a theory topic. The project helps readers connect hierarchy to robot foundation models and shared behavior repertoires.

Tutorial

Tools and Libraries

BehaviorTree.CPP Documentation.

Behavior trees are a production-friendly way to compose skills with fallback and monitoring logic. They complement learned policies by making high-level task decomposition explicit and inspectable.

Tool