"I collected the demonstration in a kitchen. The robot insists it has never been to one."
A Portable Gripper
UMI-style collection uses handheld grippers to gather rich manipulation demonstrations away from the robot. The method is powerful because it scales scene diversity, but it works only if the policy interface converts human-held trajectories into actions a robot can execute with matching latency and geometry.
From Human Motion To Robot Action
The core abstraction is a relative trajectory. Instead of asking the policy to imitate absolute human pose, the dataset can store local end-effector displacements over a short horizon:
$$\Delta x_{t:t+H} = (T_t^{-1}T_{t+1}, T_t^{-1}T_{t+2}, \ldots, T_t^{-1}T_{t+H}),$$
where $T_t$ is the gripper pose at time $t$. Relative actions reduce dependence on a specific room coordinate frame and make it easier to deploy the learned behavior on a robot with a different base pose.
Handheld systems trade robot hardware cost for calibration discipline. Camera extrinsics, gripper geometry, timing, and object scale become the bridge between human demonstration and robot execution.
Code Fragment 1 shows a compact version of the relative-action transform used in many trajectory datasets. The example uses one-dimensional poses so the arithmetic is inspectable, but the same idea extends to full SE(3) transforms from Chapter 4.
# Convert absolute gripper positions into local relative actions.
# This teaches the representation before a full SE(3) trajectory library hides it.
positions_m = [0.40, 0.43, 0.47, 0.46]
relative_actions = []
for current, nxt in zip(positions_m, positions_m[1:]):
relative_actions.append(round(nxt - current, 3))
print(relative_actions)
print("largest step:", max(abs(x) for x in relative_actions), "m")
The pedagogical transform above is a few lines because it is one-dimensional. In production, use a robotics transform library, UMI tooling, or LeRobot conversion utilities to handle SE(3), camera streams, timestamps, and serialization consistently.
A team teaching towel-folding can collect handheld demonstrations in several real kitchens, then deploy the policy on one robot cell. The dataset card should state which towels, table heights, lighting conditions, and gripper calibrations were seen during collection.
Data Collection Recipe
- Record synchronized egocentric video, gripper pose, gripper width, and task instruction.
- Calibrate gripper geometry and camera extrinsics before each session.
- Store both absolute sensor poses and relative policy actions.
- Match inference-time latency during training so the policy sees the same delay it will experience on hardware.
- Validate with a robot replay protocol before counting the episode as deployable data.
In-the-wild demonstrations often include human body motion, gaze, and tactile feedback that the robot will not have. The dataset card should state which cues are available to the robot and which cues were only available to the person collecting data.
Hands-On Lab: Audit A Handheld Demonstration Manifest
Objective
Build a manifest that decides whether handheld demonstrations are ready for robot policy training.
What You'll Practice
- Representing relative actions.
- Checking calibration and latency fields.
- Separating collection metadata from training splits.
Setup
pip install pandasSteps
Step 1: Create the manifest rows
Start with two episodes and leave one calibration field as a latency_review so the reader must decide what evidence is missing.
# Build a handheld-demonstration manifest with one deliberate audit field.
# The example includes the stress condition explicitly so the audit can run end to end.
episodes = [
{"id": "umi_001", "camera_calibrated": True, "latency_ms": 65, "split": "train"},
{"id": "umi_002", "camera_calibrated": "latency_review", "latency_ms": 140, "split": "stress"},
]
print(episodes)Step 2: Add a readiness rule
Flag episodes with missing calibration or excessive latency.
# Apply a readiness rule that separates clean data from stress or repair data.
for episode in episodes:
ready = episode["camera_calibrated"] is True and episode["latency_ms"] <= 100
episode["split"] = "train" if ready else "review"
summary = {split: sum(ep["split"] == split for ep in episodes) for split in {"train", "review"}}
print(summary)Expected Output
The lab should print one ready episode and one review episode, then prompt a short note explaining whether the review row belongs in a stress split or should be recollected.
Stretch Goals
- Add relative-action range checks for gripper translation and rotation.
- Add language annotations and verify that each instruction matches the visible task.
Complete Solution
# Complete manifest audit for handheld demonstrations.
# It keeps clean training data separate from risky but useful stress examples.
episodes = [
{"id": "umi_001", "camera_calibrated": True, "latency_ms": 65, "split": "train"},
{"id": "umi_002", "camera_calibrated": False, "latency_ms": 140, "split": "stress"},
]
for episode in episodes:
ready = episode["camera_calibrated"] is True and episode["latency_ms"] <= 100
route = "clean-train" if ready else "stress-or-recollect"
print(episode["id"], route)UMI demonstrates that data can be collected outside the robot deployment site, but the open research question is how far this portability can go. As tasks become deformable, tool-heavy, or force-sensitive, missing tactile and compliance information may become the limiting factor.
Can you state which information exists in the handheld demonstration but will not exist at robot deployment time? Those missing cues are the first place to look when zero-shot transfer fails.
Handheld collection scales diversity by decoupling demonstration from robot deployment. It succeeds only when relative actions, calibration, and latency make that decoupling explicit.
Write a dataset-card paragraph that explains why a handheld-gripper split tests object generalization rather than memorization of one collector's motion style.
What's Next
Section 23.4 studies immersive and VR teleoperation, where active visual feedback changes what the operator can perceive during collection.
Zhao, T. Z. et al. (2023). Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware.
Introduces ALOHA and ACT, making the connection between low-cost bimanual teleoperation, action chunking, and real-world manipulation data explicit.
A kinematically matched leader device study that directly compares teleoperation ergonomics and reliability against other low-cost interfaces.
Defines the handheld gripper approach, latency matching, and relative-trajectory action interface used in portable demonstration collection.
Cheng, X. et al. (2024). Open-TeleVision: Teleoperation with Immersive Active Visual Feedback.
A current reference for immersive visual feedback, active perception, and VR-style operator embodiment in data collection.
Hugging Face LeRobot Documentation.
Documents dataset conversion, policy training, and robot-control utilities that turn teleoperation logs into reusable learning artifacts.