"A mask is an action region only after geometry, tracking, and uncertainty survive contact with the robot."
A Patient Embodied AI Agent
Detection, segmentation, and the Segment Anything family turns boxes and masks into regions a robot can track, avoid, grasp, wipe, pour into, or ignore. A mask is only a beginning; the embodied question is whether the region remains stable, metric, and useful under motion.
Problem First: Why This Representation Exists
Promptable masks from Segment Anything style models are powerful, but a robot cannot grasp or avoid a mask unless it is metric, tracked, calibrated, and filtered for task relevance.
The contract here maps detector or segmenter output to robot-safe regions: box or mask, camera frame, calibration, confidence, temporal stability, and allowed action consumer.
A segmentation mask becomes embodied knowledge when it defines a forbidden zone, grasp patch, placement area, or inspection target with enough geometry for control.
Figure 27.2.1 should be read as a detection and mask contract: object extent, mask quality, pose frame, uncertainty, latency, and downstream grasp or navigation consumer must be explicit.
Mathematical Core
Segmentation quality is usually measured by overlap, but control also needs stability and action utility.
$\mathrm{IoU}(M,\hat M)=\frac{|M\cap\hat M|}{|M\cup\hat M|},\quad q_{\mathrm{act}}=\mathrm{IoU}\cdot p_{\mathrm{track}}\cdot \mathbf 1[\mathrm{clearance}>\epsilon]$
Intersection-over-union rewards geometric overlap. The action score multiplies it by temporal track confidence and a clearance constraint, because a beautiful mask that flickers or violates clearance can still produce a bad grasp.
- Generate boxes, prompts, or masks from the image stream.
- Filter masks by area, stability, boundary quality, and temporal identity.
- Lift candidate mask pixels through depth or a known support plane.
- Score each mask by affordance, clearance, latency, and track consistency.
| Design Choice | Use When | Control Risk |
|---|---|---|
| Detector box | Fast object proposals and coarse avoidance | Box can include unsafe background or miss thin handles. |
| Instance mask | Grasping, pouring, wiping, and contact planning | Boundary errors become contact errors near clutter. |
| SAM or SAM 2 promptable mask | Interactive data creation, video tracking, open-world regions | Prompt sensitivity and temporal drift need explicit validation. |
Worked Miniature
Code Fragment 27.2.1 computes a small action score for three candidate masks. The variables mimic what a ROS or PyTorch pipeline should publish after detection and segmentation.
# Convert mask quality into an action-safe ranking.
# IoU alone is not enough; temporal stability and clearance gate execution.
import numpy as np
mask_iou = np.array([0.91, 0.78, 0.86])
track_confidence = np.array([0.62, 0.95, 0.88])
clearance_m = np.array([0.015, 0.060, 0.035])
safe = clearance_m > 0.03
action_score = mask_iou * track_confidence * safe
print(np.round(action_score, 3))
print(int(action_score.argmax()))
The expected output shows candidate 0 was zeroed out by the safety gate even before ranking, despite its overlap score. The second line matters operationally: index 2 is the chosen mask because it remains both trackable and action-safe.
A practical stack can pair an object detector with SAM 2 style promptable segmentation, then publish masks through ROS 2 image messages. That reduces custom mask generation to a few calls while the system still owns the action score, temporal checks, and clearance threshold.
Promptable segmentation can make a mask look authoritative even when it is action-ambiguous. Always test whether the same mask identity survives camera motion, partial occlusion, and contact.
For a mobile manipulator clearing a table, boxes are useful for object proposals, masks are useful for contact boundaries, and tracks are useful for deciding whether an object moved after the last command. The robot should store all three.
For Detection, segmentation, and the Segment Anything family, the perception result must answer what action changed, what uncertainty changed, and what log would reproduce the decision. Otherwise the output is still visualization, not embodied evidence.
Debugging And Evaluation
Evaluate masks in the downstream skill: record prompt or detector class, mask polygon, depth association, frame transform, selected grasp or avoidance action, and mask-induced failure label.
Perturb clutter, transparent objects, overlapping boundaries, prompt wording, and camera viewpoint, then check whether the mask stays stable enough for the same action primitive.
SAM 2 extended promptable segmentation to images and videos with streaming memory, which is especially relevant to robotics because robots need mask identity across time. The research frontier is turning these open-world masks into calibrated, persistent, and safety-aware action regions.
Section 27.3 grounds the masks from this section in metric space: once you have a stable pixel region, you need to know exactly how far away it is before the robot can plan a safe approach.
Section References
Ravi, N. et al. SAM 2: Segment Anything in Images and Videos. arXiv, 2024. https://arxiv.org/abs/2408.00714
Introduces SAM 2, including streaming memory for video segmentation and interactive correction.
Meta AI. Introducing Segment Anything Model 2. https://ai.meta.com/research/sam2/
Official overview of SAM 2 capabilities and video memory behavior.
Can you name the representation, the consuming action, the uncertainty or freshness field, and the failure label for Detection, segmentation, and the Segment Anything family? If any one is missing, the section is not yet ready for a robot replay log.
Detection finds candidates, segmentation shapes them, tracking stabilizes them, and action scoring decides whether the robot can use them.
Choose a cluttered manipulation task and define one mask-quality metric, one tracking metric, and one action-safety metric. Explain which one should veto execution.