"Affordance is the promise that perception can name what the body can actually do."
A Patient Embodied AI Agent
Affordances and graspable regions asks what the robot can do with a visible region. Category names are optional; action possibilities such as grasp, push, pull, pour, step, wipe, or avoid are the core output.
Problem First: Why This Representation Exists
Affordance prediction must bind perception to embodiment. A graspable region, traversable gap, or pushable surface is meaningful only relative to gripper geometry, base footprint, force limits, and task goal.
The contract here maps visual evidence to executable possibilities: affordance label, contact frame, body constraint, confidence, action primitive, and success predicate.
An affordance becomes embodied knowledge when it narrows the action set to options that the current robot can execute under the current constraints.
Figure 27.5.1 should be read as an affordance contract: candidate region, contact frame, gripper constraint, confidence, and verifier decide whether the grasp or push is admissible.
Mathematical Core
An affordance map scores actions over image or 3D regions under robot constraints.
$A(r,a)=P(\mathrm{success}\mid \phi(r),a,\theta_{\mathrm{robot}}),\quad a^*=\arg\max_{a\in\mathcal A} A(r,a)-\lambda C(a)$
The feature vector $\phi(r)$ may include mask shape, depth, surface normal, material cues, and semantic features. The cost term $C(a)$ penalizes collision risk, reach limits, force limits, or task time.
- Extract candidate regions from masks, depth discontinuities, or learned heatmaps.
- Estimate local geometry, including surface normals and approach directions.
- Score each region-action pair for success probability and execution cost.
- Send the selected region, pose, and uncertainty to the skill controller.
| Design Choice | Use When | Control Risk |
|---|---|---|
| Heatmap | Fast image-space grasp or push proposals | Must be lifted to metric pose before control. |
| Object-centric affordance | Reusable skills such as open, pour, wipe | Object identity can hide local contact constraints. |
| 3D contact affordance | Dexterous manipulation and humanoid hands | Requires reliable geometry and force-aware execution. |
Worked Miniature
Code Fragment 27.5.1 scores candidate grasp regions by combining learned affordance probability with reach and collision costs. The small table-like arrays stand in for a vision model output.
# Choose a graspable region from affordance and cost terms.
# The selected region must be promising and executable by the robot.
import numpy as np
affordance = np.array([0.88, 0.74, 0.67])
reach_cost = np.array([0.35, 0.08, 0.05])
collision_cost = np.array([0.10, 0.06, 0.30])
score = affordance - 0.6 * reach_cost - 0.8 * collision_cost
print(np.round(score, 3))
print(int(score.argmax()))
The expected output array tells you region 1 is best only after embodiment costs are included, not because it has the strongest raw affordance score. The index 1 therefore means "most executable grasp region under current geometry," not simply "most visually grasp-like patch."
In practice, grasp pipelines often combine learned heatmaps, depth processing in Open3D, and robot-specific inverse kinematics. The libraries shorten perception and geometry work, but the system still needs an explicit region-to-skill contract.
An affordance is not a permission slip. A mug handle may be visually graspable but unreachable from the current arm pose, blocked by clutter, or unsafe under the current force limit.
A service robot loading a dishwasher should score regions by grasp success, collision with nearby dishes, wrist clearance, and whether the chosen contact leaves the object in a stable orientation for placement.
For Affordances and graspable regions, the perception result must answer what action changed, what uncertainty changed, and what log would reproduce the decision. Otherwise the output is still visualization, not embodied evidence.
Debugging And Evaluation
Evaluate affordances through attempted actions: record candidate region, body model, predicted primitive, force or clearance margin, execution result, and affordance failure label.
Perturb object pose, gripper size, surface friction, occlusion, and task goal, then check whether the affordance changes when the executable action should change.
Affordance learning is moving from category-specific grasp detection toward open-vocabulary, language-conditioned, and 3D contact-aware skill grounding. The hard robotics problem is making those affordances executable under real robot kinematics and contact limits.
Section 27.6 asks a deeper question: rather than passively scoring what is already visible, the robot can choose where to look next, turning perception itself into a deliberate action with an information cost.
Section References
Open3D. Geometry documentation. https://www.open3d.org/docs/release/tutorial/geometry/index.html
Provides practical primitives for normals, point clouds, and geometry processing used in affordance pipelines.
Meta AI. Segment Anything Model 2. https://ai.meta.com/research/sam2/
Promptable masks can provide candidate regions, but robotics still needs affordance and constraint scoring.
Can you name the representation, the consuming action, the uncertainty or freshness field, and the failure label for Affordances and graspable regions? If any one is missing, the section is not yet ready for a robot replay log.
Affordances are action-conditioned predictions over regions, and they become useful only after reach, collision, contact, and uncertainty constraints are attached.
Pick one household object and list three visible regions. For each region, score grasp, push, and avoid actions, then name the robot constraint that could veto the top score.