Section 23.3: Handheld and in-the-wild collection (UMI) | Building Embodied AI: From Perception to Autonomous Action

"I collected the demonstration in a kitchen. The robot insists it has never been to one."
A Portable Gripper

Warm educational cartoon scene connecting UMI handheld collection to robot demonstrations, operator decisions, recorded trajectories, and later policy evaluation. — **Figure 23.3A**: Handheld collection separates where humans demonstrate from where robots deploy, so the action representation must carry the bridge.

Big Picture

UMI-style collection uses handheld grippers to gather rich manipulation demonstrations away from the robot. The method is powerful because it scales scene diversity, but it works only if the policy interface converts human-held trajectories into actions a robot can execute with matching latency and geometry.

From Human Motion To Robot Action

The core abstraction is a relative trajectory. Instead of asking the policy to imitate absolute human pose, the dataset can store local end-effector displacements over a short horizon:

$$\Delta x_{t:t+H} = (T_t^{-1}T_{t+1}, T_t^{-1}T_{t+2}, \ldots, T_t^{-1}T_{t+H}),$$

where $T_t$ is the gripper pose at time $t$. Relative actions reduce dependence on a specific room coordinate frame and make it easier to deploy the learned behavior on a robot with a different base pose.

Portable Does Not Mean Uncalibrated

Handheld systems trade robot hardware cost for calibration discipline. Camera extrinsics, gripper geometry, timing, and object scale become the bridge between human demonstration and robot execution.

Code Fragment 1 shows a compact version of the relative-action transform used in many trajectory datasets. The example uses one-dimensional poses so the arithmetic is inspectable, but the same idea extends to full SE(3) transforms from Chapter 4.

# Convert absolute gripper positions into local relative actions.
# This teaches the representation before a full SE(3) trajectory library hides it.
positions_m = [0.40, 0.43, 0.47, 0.46]
relative_actions = []

for current, nxt in zip(positions_m, positions_m[1:]):
    relative_actions.append(round(nxt - current, 3))

print(relative_actions)
print("largest step:", max(abs(x) for x in relative_actions), "m")

[0.03, 0.04, -0.01] largest step: 0.04 m

Code Fragment 1: The relative_actions list stores local motion increments rather than absolute positions. This is the small numeric version of the relative-trajectory interface used to make handheld demonstrations deployable on robot hardware.

Library Shortcut

The pedagogical transform above is a few lines because it is one-dimensional. In production, use a robotics transform library, UMI tooling, or LeRobot conversion utilities to handle SE(3), camera streams, timestamps, and serialization consistently.

Practical Example

A team teaching towel-folding can collect handheld demonstrations in several real kitchens, then deploy the policy on one robot cell. The dataset card should state which towels, table heights, lighting conditions, and gripper calibrations were seen during collection.

Data Collection Recipe

Record synchronized egocentric video, gripper pose, gripper width, and task instruction.
Calibrate gripper geometry and camera extrinsics before each session.
Store both absolute sensor poses and relative policy actions.
Match inference-time latency during training so the policy sees the same delay it will experience on hardware.
Validate with a robot replay protocol before counting the episode as deployable data.

Pitfall: The Human Solves Hidden Subtasks

In-the-wild demonstrations often include human body motion, gaze, and tactile feedback that the robot will not have. The dataset card should state which cues are available to the robot and which cues were only available to the person collecting data.

Hands-On Lab: Audit A Handheld Demonstration Manifest

Duration: ~45 minutesIntermediate

Objective

Build a manifest that decides whether handheld demonstrations are ready for robot policy training.

What You'll Practice

Representing relative actions.
Checking calibration and latency fields.
Separating collection metadata from training splits.

Setup

pip install pandas

Code Fragment 2: The setup command installs pandas for the manifest table. The lab uses a table because dataset-readiness decisions should be inspectable before training starts.

Steps

Step 1: Create the manifest rows

Start with two episodes and leave one calibration field as a latency_review so the reader must decide what evidence is missing.

# Build a handheld-demonstration manifest with one deliberate audit field.
# The example includes the stress condition explicitly so the audit can run end to end.
episodes = [
    {"id": "umi_001", "camera_calibrated": True, "latency_ms": 65, "split": "train"},
    {"id": "umi_002", "camera_calibrated": "latency_review", "latency_ms": 140, "split": "stress"},
]
print(episodes)

Code Fragment 3: The starter manifest exposes camera calibration and latency as first-class fields. Episode umi_002 should not enter the clean training split until its calibration status is resolved.

Step 2: Add a readiness rule

Flag episodes with missing calibration or excessive latency.

# Apply a readiness rule that separates clean data from stress or repair data.
for episode in episodes:
    ready = episode["camera_calibrated"] is True and episode["latency_ms"] <= 100
    episode["split"] = "train" if ready else "review"
summary = {split: sum(ep["split"] == split for ep in episodes) for split in {"train", "review"}}
print(summary)

Code Fragment 4: The readiness rule combines calibration and latency instead of relying on a single success flag. That makes the manifest useful for both training and postmortem review.

Expected Output

The lab should print one ready episode and one review episode, then prompt a short note explaining whether the review row belongs in a stress split or should be recollected.

Stretch Goals

Add relative-action range checks for gripper translation and rotation.
Add language annotations and verify that each instruction matches the visible task.

Complete Solution

# Complete manifest audit for handheld demonstrations.
# It keeps clean training data separate from risky but useful stress examples.
episodes = [
    {"id": "umi_001", "camera_calibrated": True, "latency_ms": 65, "split": "train"},
    {"id": "umi_002", "camera_calibrated": False, "latency_ms": 140, "split": "stress"},
]
for episode in episodes:
    ready = episode["camera_calibrated"] is True and episode["latency_ms"] <= 100
    route = "clean-train" if ready else "stress-or-recollect"
    print(episode["id"], route)

Code Fragment 5: The complete solution routes umi_002 away from clean training because both calibration and latency are problematic. Keeping that row as stress data can still help evaluate robustness.

Research Frontier

UMI demonstrates that data can be collected outside the robot deployment site, but the open research question is how far this portability can go. As tasks become deformable, tool-heavy, or force-sensitive, missing tactile and compliance information may become the limiting factor.

Self Check

Can you state which information exists in the handheld demonstration but will not exist at robot deployment time? Those missing cues are the first place to look when zero-shot transfer fails.

Key Takeaway

Handheld collection scales diversity by decoupling demonstration from robot deployment. It succeeds only when relative actions, calibration, and latency make that decoupling explicit.

Exercise 23.3.1

Write a dataset-card paragraph that explains why a handheld-gripper split tests object generalization rather than memorization of one collector's motion style.

What's Next

Section 23.4 studies immersive and VR teleoperation, where active visual feedback changes what the operator can perceive during collection.

References & Further Reading

Teleoperation Systems

Zhao, T. Z. et al. (2023). Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware.

Introduces ALOHA and ACT, making the connection between low-cost bimanual teleoperation, action chunking, and real-world manipulation data explicit.

Paper

Wu, P. et al. (2023). GELLO: A General, Low-Cost, and Intuitive Teleoperation Framework for Robot Manipulators.

A kinematically matched leader device study that directly compares teleoperation ergonomics and reliability against other low-cost interfaces.

Paper

Chi, C. et al. (2024). Universal Manipulation Interface: In-The-Wild Robot Teaching Without In-The-Wild Robots.

Defines the handheld gripper approach, latency matching, and relative-trajectory action interface used in portable demonstration collection.

Paper

Cheng, X. et al. (2024). Open-TeleVision: Teleoperation with Immersive Active Visual Feedback.

A current reference for immersive visual feedback, active perception, and VR-style operator embodiment in data collection.

Paper

Tools

Hugging Face LeRobot Documentation.

Documents dataset conversion, policy training, and robot-control utilities that turn teleoperation logs into reusable learning artifacts.

Tool