Section 27.2: Detection, segmentation, and the Segment Anything family | Building Embodied AI: From Perception to Autonomous Action

"A mask is an action region only after geometry, tracking, and uncertainty survive contact with the robot."
A Patient Embodied AI Agent

Scene shows a robot following a segmented object region through video while checking clearance before reaching, connecting masks to action-safe regions. — **Figure 27.2A**: A mask earns trust only after it stays trackable, metric, and safe enough for the next skill.

Big Picture

Detection, segmentation, and the Segment Anything family turns boxes and masks into regions a robot can track, avoid, grasp, wipe, pour into, or ignore. A mask is only a beginning; the embodied question is whether the region remains stable, metric, and useful under motion.

Problem First: Why This Representation Exists

Promptable masks from Segment Anything style models are powerful, but a robot cannot grasp or avoid a mask unless it is metric, tracked, calibrated, and filtered for task relevance.

The contract here maps detector or segmenter output to robot-safe regions: box or mask, camera frame, calibration, confidence, temporal stability, and allowed action consumer.

Action Is The Unit Of Meaning

A segmentation mask becomes embodied knowledge when it defines a forbidden zone, grasp patch, placement area, or inspection target with enough geometry for control.

Figure 27.2.1 should be read as a detection and mask contract: object extent, mask quality, pose frame, uncertainty, latency, and downstream grasp or navigation consumer must be explicit.

Figure 27.2.1: From promptable masks to robot-safe action regions. The dashed feedback path reminds the reader that perception quality is judged by action consequences and replayable diagnostics.

Mathematical Core

Segmentation quality is usually measured by overlap, but control also needs stability and action utility.

Formal Object

$\mathrm{IoU}(M,\hat M)=\frac{|M\cap\hat M|}{|M\cup\hat M|},\quad q_{\mathrm{act}}=\mathrm{IoU}\cdot p_{\mathrm{track}}\cdot \mathbf 1[\mathrm{clearance}>\epsilon]$

Intersection-over-union rewards geometric overlap. The action score multiplies it by temporal track confidence and a clearance constraint, because a beautiful mask that flickers or violates clearance can still produce a bad grasp.

Mask-to-affordance pipeline

Generate boxes, prompts, or masks from the image stream.
Filter masks by area, stability, boundary quality, and temporal identity.
Lift candidate mask pixels through depth or a known support plane.
Score each mask by affordance, clearance, latency, and track consistency.

Mask Choices For Action

Design Choice	Use When	Control Risk
Detector box	Fast object proposals and coarse avoidance	Box can include unsafe background or miss thin handles.
Instance mask	Grasping, pouring, wiping, and contact planning	Boundary errors become contact errors near clutter.
SAM or SAM 2 promptable mask	Interactive data creation, video tracking, open-world regions	Prompt sensitivity and temporal drift need explicit validation.

Worked Miniature

Code Fragment 27.2.1 computes a small action score for three candidate masks. The variables mimic what a ROS or PyTorch pipeline should publish after detection and segmentation.

# Convert mask quality into an action-safe ranking.
# IoU alone is not enough; temporal stability and clearance gate execution.
import numpy as np

mask_iou = np.array([0.91, 0.78, 0.86])
track_confidence = np.array([0.62, 0.95, 0.88])
clearance_m = np.array([0.015, 0.060, 0.035])
safe = clearance_m > 0.03

action_score = mask_iou * track_confidence * safe
print(np.round(action_score, 3))
print(int(action_score.argmax()))

[0. 0.741 0.757] 2

The expected output shows candidate 0 was zeroed out by the safety gate even before ranking, despite its overlap score. The second line matters operationally: index 2 is the chosen mask because it remains both trackable and action-safe.

Code Fragment 27.2.1: The `safe` gate removes a high-overlap mask that is too close to an obstacle. The winning mask is not the largest IoU alone; it is the region that remains trackable and leaves enough clearance for action.

Library Shortcut

A practical stack can pair an object detector with SAM 2 style promptable segmentation, then publish masks through ROS 2 image messages. That reduces custom mask generation to a few calls while the system still owns the action score, temporal checks, and clearance threshold.

Failure Mode To Test

Promptable segmentation can make a mask look authoritative even when it is action-ambiguous. Always test whether the same mask identity survives camera motion, partial occlusion, and contact.

Practical Example

For a mobile manipulator clearing a table, boxes are useful for object proposals, masks are useful for contact boundaries, and tracks are useful for deciding whether an object moved after the last command. The robot should store all three.

Memory Hook

For Detection, segmentation, and the Segment Anything family, the perception result must answer what action changed, what uncertainty changed, and what log would reproduce the decision. Otherwise the output is still visualization, not embodied evidence.

Debugging And Evaluation

Evaluate masks in the downstream skill: record prompt or detector class, mask polygon, depth association, frame transform, selected grasp or avoidance action, and mask-induced failure label.

Perturb clutter, transparent objects, overlapping boundaries, prompt wording, and camera viewpoint, then check whether the mask stays stable enough for the same action primitive.

Research Frontier

SAM 2 extended promptable segmentation to images and videos with streaming memory, which is especially relevant to robotics because robots need mask identity across time. The research frontier is turning these open-world masks into calibrated, persistent, and safety-aware action regions.

What's Next

Section 27.3 grounds the masks from this section in metric space: once you have a stable pixel region, you need to know exactly how far away it is before the robot can plan a safe approach.

Section References

Ravi, N. et al. SAM 2: Segment Anything in Images and Videos. arXiv, 2024. https://arxiv.org/abs/2408.00714

Introduces SAM 2, including streaming memory for video segmentation and interactive correction.

Meta AI. Introducing Segment Anything Model 2. https://ai.meta.com/research/sam2/

Official overview of SAM 2 capabilities and video memory behavior.

Self Check

Can you name the representation, the consuming action, the uncertainty or freshness field, and the failure label for Detection, segmentation, and the Segment Anything family? If any one is missing, the section is not yet ready for a robot replay log.

Key Takeaway

Detection finds candidates, segmentation shapes them, tracking stabilizes them, and action scoring decides whether the robot can use them.

Exercise 27.2.1

Choose a cluttered manipulation task and define one mask-quality metric, one tracking metric, and one action-safety metric. Explain which one should veto execution.