Section 26.2: The options framework

A Careful Control Loop
Technical illustration for Section 26.2: The options framework.
Figure 26.2A: The options framework: each option has an initiation set (where it can start), a policy (what it does), and a termination condition (when it ends), composing into a semi-MDP over skills.
Big Picture

The options framework treats action as a hierarchy rather than a flat stream of motor commands. A skill gives the planner a reusable temporal abstraction with an initiation condition, an internal policy, a termination rule, and a verification contract.

Why Hierarchy Matters

For The options framework, hierarchy separates timing, contact, recovery, and sequencing so a high-level planner can select skills without pretending every low-level policy is deterministic.

Options are the mathematical bridge between reinforcement learning and robot skills. They let a high-level policy choose temporally extended actions while the low-level policy handles many primitive steps before returning control.

Skill Equals Promise

For The options framework, treat the skill as an interface: initiation set, internal controller, progress signal, termination rule, verifier, and recovery status must be explicit.

Formal Contract

An option $\omega$ is a triple

$$\omega = (I,\, \pi,\, \beta),$$

where $I \subseteq \mathcal{S}$ is the initiation set (the states from which $\omega$ may be selected), $\pi: \mathcal{S} \to \Delta(\mathcal{A})$ is the intra-option policy (the primitive-action distribution the skill follows while running), and $\beta: \mathcal{S} \to [0,1]$ is the termination function (the probability of handing control back to the high-level policy in each state). The option executes for a random duration $\tau$ determined by $\beta$: at each step, it terminates with probability $\beta(s_t)$ and continues otherwise. The resulting process is a semi-MDP (SMDP) at the high level, because transitions take variable time.

The SMDP Q-function over options values starting state $s$ and selected option $\omega$:

$$Q_\Omega(s,\omega)=\mathbb{E}\left[\sum_{k=0}^{\tau-1}\gamma^k r_{t+k}+\gamma^\tau V_\Omega(s_{t+\tau})\mid s_t=s,\omega_t=\omega\right].$$

Here $\tau$ is the (random) option duration, the discounted sum accumulates primitive rewards over the option's execution, and $V_\Omega(s_{t+\tau})$ is the value at the state where the option terminates. The high-level policy selects from $\Omega$ (the set of all options) at every option boundary, not at every primitive time step. Use this tuple as an audit checklist when designing a skill: if any field of $(I, \pi, \beta)$ is implicit or missing, the option will fail silently at task boundaries.

Hierarchical robot policy from mission goal to task graph to verified skills Mission goal Task graph ordering and fallback Navigate Manipulate Recover Verifier
Figure 26.2.B: The diagram maps the option equation onto a robot execution stack: a high-level choice expands into several low-level actions before the verifier returns control.

Worked Implementation

Code Fragment 1 for The options framework should expose initiation, progress, termination, verification, and failure reporting before connecting the skill to ROS 2, BehaviorTree.CPP, Drake, or a learned policy.

# Simulate an option that moves through several primitive control ticks.
# The high-level controller sees one option outcome, not every inner action.
def execute_option(start_position, goal_position, max_steps=5):
    position = start_position
    trace = []
    for step in range(max_steps):
        delta = min(0.4, goal_position - position)
        position += delta
        trace.append(round(position, 2))
        if abs(goal_position - position) < 0.05:
            return "terminated", trace
    return "timeout", trace

status, trace = execute_option(0.0, 1.0)
print(status, trace)
terminated [0.4, 0.8, 1.0]

The expected output trace shows exactly why the option is useful to a high-level planner: three internal control updates are compressed into one symbolic result. The final position reaches the goal within tolerance, so the option terminates cleanly instead of exposing each intermediate motor step.

Code Fragment 1: The function compresses three primitive moves into one option outcome. This is the temporal abstraction that lets a planner reason about reaching a waypoint instead of choosing every control tick.
Algorithm: Verified Skill Execution
  1. Check whether the current state satisfies the skill initiation predicate.
  2. Execute the skill policy while monitoring progress, time, force, and perception confidence.
  3. Terminate when the skill succeeds, violates a safety guard, or reaches a timeout.
  4. Run a verifier that checks the postcondition in sensor space and task space.
  5. Return success, retry, fallback, or escalate to the high-level planner.

Practical Recipe

  1. Name each skill with a verb and object: navigate_to_station, grasp_handle, dock_drone, or change_lane.
  2. Write preconditions, postconditions, safety guards, timeout, and recovery behavior before training a policy.
  3. Represent sequencing as a finite-state graph, behavior tree, or task-and-motion plan so failures have explicit routes.
  4. Use language as a planner only after commands are grounded into a typed skill library with affordance checks.
  5. Evaluate composition, not only individual success. Many failures occur when two correct skills meet at a bad boundary.
Library Shortcut

For The options framework, use BehaviorTree.CPP, ROS 2 lifecycle nodes, Drake systems, or task-and-motion planning to handle scheduling and fallback while preserving explicit skill contracts.

Practical Example

For The options framework, decompose the household command into navigation, inspection, reachability, grasp, carry, and handoff only if each subskill exposes a verifier and recovery route.

Skill Interface Checklist
FieldQuestionExample For A Mobile Manipulator
InitiationWhen may it start?Object detected, arm clear, base within reach.
PolicyWhat controller runs?Visual servoing plus impedance control.
TerminationWhen does it stop?Grasp force stable for 0.5 seconds.
VerificationHow is success proved?Object pose follows gripper during lift.
RecoveryWhat happens after failure?Open gripper, re-localize, retry from a safer pose.
Composition Failure

For The options framework, test hierarchy failures caused by mismatched postconditions, hidden frames, stale perception, and planners treating probabilistic skills as deterministic.

Research Frontier

For The options framework, connect skill learning to VLA models and task-and-motion planning only when feasibility, verification, and recovery are represented for this body and scene.

Self Check

For The options framework, the test is whether initiation set, internal policy, termination rule, verifier, and recovery route can be written for the target robot skill.

Key Takeaway

The options framework is useful when it makes the perception-action loop more reliable, not when it merely adds a more impressive model name.

Exercise 26.2.1

Design a method-matched experiment for The options framework. Specify the environment, observation schema, action interface, metric, and one perturbation that targets the section's core assumption.

What's Next

This section grounded the options framework in an explicit robot-data contract: observations, actions, demonstrations, evaluation splits, and failure labels. The next reading step is Section 26.3, where the same contract is carried into the next technique or chapter.

References & Further Reading
Foundational Papers

Sutton, R. S., Precup, D., and Singh, S. (1999). Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning.

This paper formalizes options as temporally extended actions with initiation, policy, and termination conditions. It is the canonical reference for the chapter's skill hierarchy vocabulary.

Paper

Bacon, P. L., Harb, J., and Precup, D. (2017). The Option-Critic Architecture.

Option-Critic learns options end to end within reinforcement learning. It helps readers compare hand-specified skills with learned temporal abstractions.

Paper

Eysenbach, B. et al. (2018). Diversity is All You Need: Learning Skills Without a Reward Function.

DIAYN studies unsupervised skill discovery by maximizing distinguishable behaviors. It is useful for understanding when skills can be learned before a downstream task is specified.

Paper
Technical Reports and Project Pages

Open X-Embodiment and RT-X Project Website.

Cross-embodiment datasets make skill reuse a practical question rather than only a theory topic. The project helps readers connect hierarchy to robot foundation models and shared behavior repertoires.

Tutorial
Tools and Libraries

BehaviorTree.CPP Documentation.

Behavior trees are a production-friendly way to compose skills with fallback and monitoring logic. They complement learned policies by making high-level task decomposition explicit and inspectable.

Tool