"Every dataset has a junk drawer. The question is whether you labeled it."
A Patient Data Curator
Teleoperation produces raw experience, not automatically useful training data. Quality gates decide whether each episode is clean supervision, recovery supervision, stress evaluation, relabeling work, or discard.
Quality As A Labeling Problem
An episode has at least three truth layers: what the operator intended, what the robot executed, and what the environment actually did. Labels should distinguish these layers. "Failure" is too coarse; "object slipped after contact", "operator intervened", "camera dropped frames", and "reset distribution mismatch" teach different lessons.
A useful data-quality score can be written as a weighted checklist:
$$Q(e) = w_s S(e) + w_c C(e) + w_t T(e) + w_l L(e) - w_i I(e),$$
where $S$ is task success, $C$ calibration validity, $T$ timing validity, $L$ label completeness, and $I$ intervention burden. The weights should be stated in the dataset card rather than rediscovered from code.
A failed rollout can be more useful than a success if it localizes the missing skill. Label the mechanism of failure, not only the outcome.
Use dataset validators, Pydantic schemas, and LeRobot metadata checks to automate the boring parts of quality control. Human review should be reserved for semantic labels, ambiguous failures, and split decisions that require task knowledge.
The following example scores episodes and routes them into splits. Notice that the rule keeps stress data rather than deleting every imperfect row.
# Route episodes by quality gates instead of using one vague keep flag.
# Stress and repair examples stay useful because their failure type is explicit.
episodes = [
{"id": "a", "success": 1, "calibrated": 1, "synced": 1, "labels": 1, "interventions": 0},
{"id": "b", "success": 0, "calibrated": 1, "synced": 1, "labels": 1, "interventions": 2},
{"id": "c", "success": 1, "calibrated": 0, "synced": 0, "labels": 0, "interventions": 0},
]
for e in episodes:
score = e["success"] + e["calibrated"] + e["synced"] + e["labels"] - 0.5 * e["interventions"]
route = "train" if score >= 3.5 else "stress" if e["labels"] else "repair"
print(e["id"], score, route)
The expected output sends episode a to training, episode b to stress evaluation, and episode c to repair. That routing is the mechanism behind useful curation: the pipeline does not ask whether an episode is good or bad in the abstract. It asks what scientific role the episode can play after calibration, synchronization, labels, and interventions are known.
Annotation Schema
| Field | Examples | Use |
|---|---|---|
| Task outcome | success, partial, fail, abort | Evaluation and filtering. |
| Failure mechanism | slip, collision, missed grasp, timeout, perception error | Error analysis and recovery training. |
| Intervention | none, human correction, emergency stop | Safety and autonomy measurement. |
| Data health | synced, dropped frames, calibration stale | Quality routing. |
| Instruction | language command, goal image, task id | Language-conditioned policy training. |
- Validate timestamps and stream lengths.
- Check calibration version against the collection session.
- Run automated label sanity checks.
- Manually inspect a stratified sample by task, operator, and outcome.
- Freeze split assignment and save the manifest hash.
Mechanism: Separating Outcome From Cause
Outcome labels answer whether the task succeeded. Cause labels answer why it succeeded or failed. A robot can fail because the perception system localized the object incorrectly, because the gripper command saturated, because the operator intervened late, or because the reset placed the object outside the intended distribution. These cases should not be merged into one negative class because they imply different repairs.
For training, cause labels can support recovery data, filtering, or curriculum design. For evaluation, they allow per-failure reporting: a new policy may improve missed grasps while leaving occlusion failures unchanged. That is a more useful research result than a single aggregate score with no diagnosis.
Do not place near-duplicate episodes from the same collection burst into both train and validation. The model can appear to generalize while merely repeating a neighboring trajectory.
For a bin-picking dataset, the validation split should hold out object instances or clutter layouts, not merely every tenth video. Otherwise validation measures replay familiarity rather than deployment readiness.
Large robot datasets increasingly need active data selection: choose the next demonstrations that reduce uncertainty, fill coverage gaps, or stress known failure modes. The research challenge is to make this selection reliable without overfitting to a narrow benchmark.
Can each failure in your dataset be routed to perception, calibration, action representation, timing, contact dynamics, or task specification? If not, the labels are too coarse for serious improvement.
Data quality is not the absence of failures. It is the presence of typed, synchronized, split-aware evidence that tells the learner and the researcher what each episode means.
Write five failure labels for a pouring task and specify which labels belong in train, stress validation, or repair queues.
What's Next
Section 23.6 shows how a standardized dataset format, especially LeRobotDataset, turns these quality decisions into reusable files and metadata.
Zhao, T. Z. et al. (2023). Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware.
Introduces ALOHA and ACT, making the connection between low-cost bimanual teleoperation, action chunking, and real-world manipulation data explicit.
A kinematically matched leader device study that directly compares teleoperation ergonomics and reliability against other low-cost interfaces.
Defines the handheld gripper approach, latency matching, and relative-trajectory action interface used in portable demonstration collection.
Cheng, X. et al. (2024). Open-TeleVision: Teleoperation with Immersive Active Visual Feedback.
A current reference for immersive visual feedback, active perception, and VR-style operator embodiment in data collection.
Hugging Face LeRobot Documentation.
Documents dataset conversion, policy training, and robot-control utilities that turn teleoperation logs into reusable learning artifacts.