Section 33.5: ReKep: relational keypoint constraints | Building Embodied AI: From Perception to Autonomous Action

A Careful Control Loop

Technical illustration for Section 33.5: ReKep: relational keypoint constraints. — Figure 33.5A: ReKep relational keypoint constraints: an LLM specifies constraints between pairs of keypoints (gripper tip within 2 cm of bottle cap), a 6-DOF optimization satisfies the constraint set, and the resulting pose is tracked by a low-level controller.

Read the figure as a relational-constraint contract. ReKep-style keypoints help when relations such as near, above, aligned, and graspable are grounded to tracked 3D points that survive motion and occlusion.

Figure 33.5: A closed-loop map for ReKep: relational keypoint constraints. The diagram forces the reader to name the input, model boundary, action interface, and evidence record before trusting the system.

Build And Evaluation Checklist

Depth and self-containment. Readers should understand why keypoint relations can express tasks more compactly than dense maps or free-text constraints, and what assumptions that representation makes about perception quality.

Production and evaluation contract. The artifact should include the keypoints, the relational constraints, the cost value, and the executed trajectory so one can see whether failure came from perception or optimization.

Checklist Memory Anchor

For ReKep: relational keypoint constraints, name the language interface, grounded world state, executable action contract, and evidence artifact before trusting any claimed improvement.

Mini Audit Exercise

For ReKep: relational keypoint constraints, write one evidence row recording instruction, world-state estimate, chosen action, verifier result, and failure label. Then identify which field would change first under command misunderstanding.

Big Picture

ReKep uses relational keypoints to express manipulation goals in a form that optimizers can understand directly. It is a middle ground between symbolic steps and dense spatial maps.

This section explains how language-guided agents can turn a command into a small set of geometric constraints over keypoints and then solve the resulting optimization problem.

The practical question is when relational keypoints are expressive enough to encode the task and when they become too brittle under clutter or contact.

Action Is The Test

Keypoint constraints are powerful because they capture geometry with far less state than a full scene map, but they rely on the keypoints being the right abstraction.

Theory

Let keypoints be $k_1, \ldots, k_n$ and let a task be encoded by costs over relations among them. A trajectory optimizer can solve $$\tau^* = \arg\min_\tau \sum_j w_j c_j\bigl(k_{a_j}(\tau), k_{b_j}(\tau)\bigr),$$ where each $c_j$ measures a relation such as distance, alignment, or ordering implied by the language command.

This representation is attractive because it compresses a task into a few geometric relations that classical optimizers handle well. It is risky because errors in keypoint detection or object identity propagate directly into the objective, which can yield confident but wrong trajectories.

Mechanism

A good mental model is language to geometric predicates. The LLM or VLM identifies which relations matter, the vision system instantiates the keypoints, and the optimizer pushes the robot toward states that satisfy those relations.

Worked Example

Code Fragment 1 evaluates a tiny relational cost between two keypoints. The specific numbers are simple, but they show how the language-derived objective becomes a concrete quantity an optimizer can minimize.

# Compute a simple relational keypoint cost for a grasp target.
# The task prefers the gripper keypoint to align closely with the mug handle.
# Small geometric costs are what the optimizer ultimately tries to drive down.
gripper = (0.42, 0.18)
handle = (0.47, 0.21)

cost = round(abs(gripper[0] - handle[0]) + abs(gripper[1] - handle[1]), 3)
print({"l1_alignment_cost": cost})

{'l1_alignment_cost': 0.08}

The expected output is a small geometric cost that operationalizes the verbal relation "align the gripper with the handle." A low value here means the relational abstraction has become numerically useful to an optimizer, while a high value would mean either the keypoints or the relation itself are not yet actionable.

Code Fragment 1: This cost turns a language-grounded relation, align the gripper with the handle, into a concrete optimization target. Once the relation is instantiated geometrically, a standard optimizer can work with it without reading the original sentence again.

Library Shortcut

Keypoint detectors, VLM-grounded correspondences, and optimization libraries can produce the same pipeline in a few lines. The shortcut removes most of the tensor and geometry plumbing, but it cannot remove the need to decide which relations matter for the task.

Practical Recipe

Choose keypoints that correspond to task-relevant geometry such as handles, rims, hinges, or contact patches.
Translate the command into a small set of relational costs rather than a bag of verbal hints.
Estimate keypoint confidence and reject tasks whose geometry is too uncertain for safe optimization.
Use a planner or optimizer that can expose the final cost breakdown for debugging.
Compare keypoint-based and map-based formulations on the same task to see which abstraction is more stable.

Common Failure Mode

Relational keypoints can look elegant in sparse scenes and fragile in clutter. If the wrong point is chosen or a keypoint disappears under occlusion, the optimizer may happily satisfy the wrong relation.

Practical Example

For 'open the drawer by the handle,' a keypoint formulation can attach one point to the drawer handle and another to the gripper target, then optimize the relative pose. This is often far lighter than maintaining a dense 3D objective over the entire scene.

Memory Hook

Keypoints are the minimalist's answer to scene understanding: why carry the whole kitchen in memory if three strategically chosen points already tell you where the handle is?

Research Frontier

Recent relational-manipulation work explores stronger keypoint discovery, temporally stable correspondences, and hybrid systems that switch between keypoints and denser scene maps when clutter or contact demands it.

Self Check

Can you explain which keypoints in your task are semantically meaningful and which ones are merely easy for a detector to find but irrelevant for control?

ReKep is appealing because it lets language specify relations rather than every detail of a trajectory. That makes it a strong bridge between semantic tasking and numeric optimization, especially for manipulation tasks with a few dominant geometric constraints.

The limit is representational mismatch. Some tasks really are low dimensional in terms of keypoints. Others depend on extended surfaces, fluids, or occluded contacts. A strong engineer knows when the compact representation is a help and when it is a trap.

Tool Choices For Relational Keypoint Planning

Tool or Library	Role in the Topic	Builder Advice
ReKep paper and code	Reference implementation of keypoint-constrained planning.	Use it when you want a concrete manipulation example built around relational costs.
Keypoint or correspondence detector	Instantiates task-relevant geometric anchors.	Use a temporally stable detector when the task spans several viewpoints.
Optimization library or MPC	Consumes the relational cost.	Use it when the keypoint objective should become a physically feasible trajectory.
MoveIt 2	Motion-planning shell around relational goals.	Use it when keypoint constraints need collision-aware trajectory generation.
Open3D	Coordinate transforms and geometry utilities.	Use it when keypoints must be reconciled across frames or sensors.

Code Fragment 2 saves the keypoint relation and its cost value as part of the experiment record. This is the right level of detail for deciding whether failure came from the vision front end or the optimizer.

Log keypoint identities, coordinates, confidence, and frame.
Store the relational costs that define the objective, not only the final trajectory.
Record whether the keypoints were visible, predicted, or carried from memory.
Compare optimizer output against the same keypoints under repeated seeds or perturbations.
Fallback to clarification or a denser representation when keypoint confidence collapses.

The expected output is a compact interface record: named keypoints, an explicit relational contract, the induced cost, and the downstream optimizer. This is the right evidence shape because it reveals whether a failure came from unstable keypoints, a poor relation choice, or a trajectory optimizer that could not satisfy a valid constraint.

Code Fragment 2: This record makes the geometric abstraction explicit: the optimizer is not trying to satisfy the whole language command directly, but the relation `align_and_approach` over a small keypoint set. That clarity is essential for debugging the perception-planning interface.

If ReKep-style planning fails, inspect keypoint quality first, then relation design, then trajectory optimization. It is easy to blame the optimizer for what is actually a mis-specified or unstable geometric abstraction.

Key Takeaway

Relational keypoints are a strong language-to-optimization interface when the task geometry is low dimensional and the keypoints are semantically meaningful.

Exercise 33.5.1

Choose a manipulation task and propose three keypoints plus two relational costs that express it. Then describe one scene variation where this abstraction would likely break and need a denser representation.

Bibliography and Further Reading

Primary Sources and Tools

Huang et al. (2024). "ReKep: Spatio-Temporal Reasoning of Relational Keypoint Constraints for Robotic Manipulation." arXiv.

This is the primary ReKep reference for expressing manipulation tasks through relational keypoint constraints.

Paper or Documentation

MoveIt 2 Documentation.

MoveIt is relevant when relational constraints must become collision-aware robot trajectories.

Paper or Documentation

Open3D Documentation and Repository.

Open3D is a practical geometry toolkit for point and keypoint manipulation across frames.

Paper or Documentation