Section 23.5: Data quality, diversity, and labeling | Building Embodied AI: From Perception to Autonomous Action

"Every dataset has a junk drawer. The question is whether you labeled it."
A Patient Data Curator

Warm educational cartoon scene connecting data quality and labeling to robot demonstrations, operator decisions, recorded trajectories, and later policy evaluation. — **Figure 23.5A**: Data quality gates turn messy demonstrations into deliberate train, validation, stress, and repair assets.

Big Picture

Teleoperation produces raw experience, not automatically useful training data. Quality gates decide whether each episode is clean supervision, recovery supervision, stress evaluation, relabeling work, or discard.

Quality As A Labeling Problem

An episode has at least three truth layers: what the operator intended, what the robot executed, and what the environment actually did. Labels should distinguish these layers. "Failure" is too coarse; "object slipped after contact", "operator intervened", "camera dropped frames", and "reset distribution mismatch" teach different lessons.

A useful data-quality score can be written as a weighted checklist:

$$Q(e) = w_s S(e) + w_c C(e) + w_t T(e) + w_l L(e) - w_i I(e),$$

where $S$ is task success, $C$ calibration validity, $T$ timing validity, $L$ label completeness, and $I$ intervention burden. The weights should be stated in the dataset card rather than rediscovered from code.

Failures Are Typed Evidence

A failed rollout can be more useful than a success if it localizes the missing skill. Label the mechanism of failure, not only the outcome.

Library Shortcut

Use dataset validators, Pydantic schemas, and LeRobot metadata checks to automate the boring parts of quality control. Human review should be reserved for semantic labels, ambiguous failures, and split decisions that require task knowledge.

The following example scores episodes and routes them into splits. Notice that the rule keeps stress data rather than deleting every imperfect row.

# Route episodes by quality gates instead of using one vague keep flag.
# Stress and repair examples stay useful because their failure type is explicit.
episodes = [
    {"id": "a", "success": 1, "calibrated": 1, "synced": 1, "labels": 1, "interventions": 0},
    {"id": "b", "success": 0, "calibrated": 1, "synced": 1, "labels": 1, "interventions": 2},
    {"id": "c", "success": 1, "calibrated": 0, "synced": 0, "labels": 0, "interventions": 0},
]

for e in episodes:
    score = e["success"] + e["calibrated"] + e["synced"] + e["labels"] - 0.5 * e["interventions"]
    route = "train" if score >= 3.5 else "stress" if e["labels"] else "repair"
    print(e["id"], score, route)

a 4.0 train b 2.0 stress c 1.0 repair

Code Fragment 1: The routing rule separates clean training examples from labeled stress examples and unlabeled repair work. This is better than deleting every failure, because the stress split preserves evidence about where policies break.

The expected output sends episode a to training, episode b to stress evaluation, and episode c to repair. That routing is the mechanism behind useful curation: the pipeline does not ask whether an episode is good or bad in the abstract. It asks what scientific role the episode can play after calibration, synchronization, labels, and interventions are known.

Annotation Schema

Minimum Episode Labels

Field	Examples	Use
Task outcome	success, partial, fail, abort	Evaluation and filtering.
Failure mechanism	slip, collision, missed grasp, timeout, perception error	Error analysis and recovery training.
Intervention	none, human correction, emergency stop	Safety and autonomy measurement.
Data health	synced, dropped frames, calibration stale	Quality routing.
Instruction	language command, goal image, task id	Language-conditioned policy training.

Quality Gate Before Training

Validate timestamps and stream lengths.
Check calibration version against the collection session.
Run automated label sanity checks.
Manually inspect a stratified sample by task, operator, and outcome.
Freeze split assignment and save the manifest hash.

Mechanism: Separating Outcome From Cause

Outcome labels answer whether the task succeeded. Cause labels answer why it succeeded or failed. A robot can fail because the perception system localized the object incorrectly, because the gripper command saturated, because the operator intervened late, or because the reset placed the object outside the intended distribution. These cases should not be merged into one negative class because they imply different repairs.

For training, cause labels can support recovery data, filtering, or curriculum design. For evaluation, they allow per-failure reporting: a new policy may improve missed grasps while leaving occlusion failures unchanged. That is a more useful research result than a single aggregate score with no diagnosis.

Pitfall: Label Leakage

Do not place near-duplicate episodes from the same collection burst into both train and validation. The model can appear to generalize while merely repeating a neighboring trajectory.

Practical Example

For a bin-picking dataset, the validation split should hold out object instances or clutter layouts, not merely every tenth video. Otherwise validation measures replay familiarity rather than deployment readiness.

Research Frontier

Large robot datasets increasingly need active data selection: choose the next demonstrations that reduce uncertainty, fill coverage gaps, or stress known failure modes. The research challenge is to make this selection reliable without overfitting to a narrow benchmark.

Self Check

Can each failure in your dataset be routed to perception, calibration, action representation, timing, contact dynamics, or task specification? If not, the labels are too coarse for serious improvement.

Key Takeaway

Data quality is not the absence of failures. It is the presence of typed, synchronized, split-aware evidence that tells the learner and the researcher what each episode means.

Exercise 23.5.1

Write five failure labels for a pouring task and specify which labels belong in train, stress validation, or repair queues.

What's Next

Section 23.6 shows how a standardized dataset format, especially LeRobotDataset, turns these quality decisions into reusable files and metadata.

References & Further Reading

Teleoperation Systems

Zhao, T. Z. et al. (2023). Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware.

Introduces ALOHA and ACT, making the connection between low-cost bimanual teleoperation, action chunking, and real-world manipulation data explicit.

Paper

Wu, P. et al. (2023). GELLO: A General, Low-Cost, and Intuitive Teleoperation Framework for Robot Manipulators.

A kinematically matched leader device study that directly compares teleoperation ergonomics and reliability against other low-cost interfaces.

Paper

Chi, C. et al. (2024). Universal Manipulation Interface: In-The-Wild Robot Teaching Without In-The-Wild Robots.

Defines the handheld gripper approach, latency matching, and relative-trajectory action interface used in portable demonstration collection.

Paper

Cheng, X. et al. (2024). Open-TeleVision: Teleoperation with Immersive Active Visual Feedback.

A current reference for immersive visual feedback, active perception, and VR-style operator embodiment in data collection.

Paper

Tools

Hugging Face LeRobot Documentation.

Documents dataset conversion, policy training, and robot-control utilities that turn teleoperation logs into reusable learning artifacts.

Tool