A Careful Control Loop
Read the figure as a relational-constraint contract. ReKep-style keypoints help when relations such as near, above, aligned, and graspable are grounded to tracked 3D points that survive motion and occlusion.
Build And Evaluation Checklist
Depth and self-containment. Readers should understand why keypoint relations can express tasks more compactly than dense maps or free-text constraints, and what assumptions that representation makes about perception quality.
Production and evaluation contract. The artifact should include the keypoints, the relational constraints, the cost value, and the executed trajectory so one can see whether failure came from perception or optimization.
For ReKep: relational keypoint constraints, name the language interface, grounded world state, executable action contract, and evidence artifact before trusting any claimed improvement.
For ReKep: relational keypoint constraints, write one evidence row recording instruction, world-state estimate, chosen action, verifier result, and failure label. Then identify which field would change first under command misunderstanding.
ReKep uses relational keypoints to express manipulation goals in a form that optimizers can understand directly. It is a middle ground between symbolic steps and dense spatial maps.
This section explains how language-guided agents can turn a command into a small set of geometric constraints over keypoints and then solve the resulting optimization problem.
The practical question is when relational keypoints are expressive enough to encode the task and when they become too brittle under clutter or contact.
Keypoint constraints are powerful because they capture geometry with far less state than a full scene map, but they rely on the keypoints being the right abstraction.
Theory
Let keypoints be $k_1, \ldots, k_n$ and let a task be encoded by costs over relations among them. A trajectory optimizer can solve $$\tau^* = \arg\min_\tau \sum_j w_j c_j\bigl(k_{a_j}(\tau), k_{b_j}(\tau)\bigr),$$ where each $c_j$ measures a relation such as distance, alignment, or ordering implied by the language command.
This representation is attractive because it compresses a task into a few geometric relations that classical optimizers handle well. It is risky because errors in keypoint detection or object identity propagate directly into the objective, which can yield confident but wrong trajectories.
A good mental model is language to geometric predicates. The LLM or VLM identifies which relations matter, the vision system instantiates the keypoints, and the optimizer pushes the robot toward states that satisfy those relations.
Worked Example
Code Fragment 1 evaluates a tiny relational cost between two keypoints. The specific numbers are simple, but they show how the language-derived objective becomes a concrete quantity an optimizer can minimize.
# Compute a simple relational keypoint cost for a grasp target.
# The task prefers the gripper keypoint to align closely with the mug handle.
# Small geometric costs are what the optimizer ultimately tries to drive down.
gripper = (0.42, 0.18)
handle = (0.47, 0.21)
cost = round(abs(gripper[0] - handle[0]) + abs(gripper[1] - handle[1]), 3)
print({"l1_alignment_cost": cost})
The expected output is a small geometric cost that operationalizes the verbal relation "align the gripper with the handle." A low value here means the relational abstraction has become numerically useful to an optimizer, while a high value would mean either the keypoints or the relation itself are not yet actionable.
Keypoint detectors, VLM-grounded correspondences, and optimization libraries can produce the same pipeline in a few lines. The shortcut removes most of the tensor and geometry plumbing, but it cannot remove the need to decide which relations matter for the task.
Practical Recipe
- Choose keypoints that correspond to task-relevant geometry such as handles, rims, hinges, or contact patches.
- Translate the command into a small set of relational costs rather than a bag of verbal hints.
- Estimate keypoint confidence and reject tasks whose geometry is too uncertain for safe optimization.
- Use a planner or optimizer that can expose the final cost breakdown for debugging.
- Compare keypoint-based and map-based formulations on the same task to see which abstraction is more stable.
Relational keypoints can look elegant in sparse scenes and fragile in clutter. If the wrong point is chosen or a keypoint disappears under occlusion, the optimizer may happily satisfy the wrong relation.
For 'open the drawer by the handle,' a keypoint formulation can attach one point to the drawer handle and another to the gripper target, then optimize the relative pose. This is often far lighter than maintaining a dense 3D objective over the entire scene.
Keypoints are the minimalist's answer to scene understanding: why carry the whole kitchen in memory if three strategically chosen points already tell you where the handle is?
Recent relational-manipulation work explores stronger keypoint discovery, temporally stable correspondences, and hybrid systems that switch between keypoints and denser scene maps when clutter or contact demands it.
Can you explain which keypoints in your task are semantically meaningful and which ones are merely easy for a detector to find but irrelevant for control?
ReKep is appealing because it lets language specify relations rather than every detail of a trajectory. That makes it a strong bridge between semantic tasking and numeric optimization, especially for manipulation tasks with a few dominant geometric constraints.
The limit is representational mismatch. Some tasks really are low dimensional in terms of keypoints. Others depend on extended surfaces, fluids, or occluded contacts. A strong engineer knows when the compact representation is a help and when it is a trap.
| Tool or Library | Role in the Topic | Builder Advice |
|---|---|---|
| ReKep paper and code | Reference implementation of keypoint-constrained planning. | Use it when you want a concrete manipulation example built around relational costs. |
| Keypoint or correspondence detector | Instantiates task-relevant geometric anchors. | Use a temporally stable detector when the task spans several viewpoints. |
| Optimization library or MPC | Consumes the relational cost. | Use it when the keypoint objective should become a physically feasible trajectory. |
| MoveIt 2 | Motion-planning shell around relational goals. | Use it when keypoint constraints need collision-aware trajectory generation. |
| Open3D | Coordinate transforms and geometry utilities. | Use it when keypoints must be reconciled across frames or sensors. |
Code Fragment 2 saves the keypoint relation and its cost value as part of the experiment record. This is the right level of detail for deciding whether failure came from the vision front end or the optimizer.
- Log keypoint identities, coordinates, confidence, and frame.
- Store the relational costs that define the objective, not only the final trajectory.
- Record whether the keypoints were visible, predicted, or carried from memory.
- Compare optimizer output against the same keypoints under repeated seeds or perturbations.
- Fallback to clarification or a denser representation when keypoint confidence collapses.
The expected output is a compact interface record: named keypoints, an explicit relational contract, the induced cost, and the downstream optimizer. This is the right evidence shape because it reveals whether a failure came from unstable keypoints, a poor relation choice, or a trajectory optimizer that could not satisfy a valid constraint.
If ReKep-style planning fails, inspect keypoint quality first, then relation design, then trajectory optimization. It is easy to blame the optimizer for what is actually a mis-specified or unstable geometric abstraction.
Relational keypoints are a strong language-to-optimization interface when the task geometry is low dimensional and the keypoints are semantically meaningful.
Choose a manipulation task and propose three keypoints plus two relational costs that express it. Then describe one scene variation where this abstraction would likely break and need a denser representation.
This is the primary ReKep reference for expressing manipulation tasks through relational keypoint constraints.
MoveIt is relevant when relational constraints must become collision-aware robot trajectories.
Open3D Documentation and Repository.
Open3D is a practical geometry toolkit for point and keypoint manipulation across frames.