Section 24.5: Curating and mixing data

"More data helped until the easy task brought all its friends."

A Mixture Weight Optimizer
Warm educational cartoon scene connecting curating and mixing robot data to robot demonstrations, operator decisions, recorded trajectories, and later policy evaluation.
Figure 24.5A: Curating and mixing data is a balancing act between coverage, bias, licensing, and the evaluation claim.
Big Picture

Data mixing decides which experiences a policy sees often, rarely, or not at all. A mixture can improve generalization by broadening coverage, or it can damage learning by drowning rare but important tasks in easy high-volume data.

Mixture Weights

Let dataset sources be $D_1,\ldots,D_K$ with sampling weights $p_1,\ldots,p_K$. Training samples source $k$ with probability $p_k$, then samples an episode from that source. Uniform-over-episodes favors large datasets; uniform-over-sources favors small datasets; temperature sampling interpolates between them:

$$p_k = n_k^{ au} / \sum_j n_j^{ au},$$

where $n_k$ is source size and $ au$ controls how strongly size influences sampling. $ au = 1$ is proportional to size, while $ au = 0$ is uniform over sources.

The Mixture Is A Claim

A data mixture says what the model should care about. If the mixture is undocumented, the experiment hides one of its most important design choices.

Library Shortcut

Use a dataset mixer in PyTorch, LeRobot, or WebDataset-style pipelines once the source weights are written down. The tool can sample efficiently across shards, but it should read weights from a manifest that reviewers can inspect.

Code Fragment 1 computes temperature-sampled mixture weights. This is the smallest practical tool for making a mixing policy explicit.

# Compute source-sampling weights for a mixed robot dataset.
# Temperature tau controls whether large sources dominate the training stream.
sources = {"OpenX": 1_000_000, "DROID": 76_000, "BridgeDataV2": 60_096}
tau = 0.5
raw = {name: count ** tau for name, count in sources.items()}
total = sum(raw.values())
weights = {name: round(value / total, 3) for name, value in raw.items()}
print(weights)
{'OpenX': 0.656, 'DROID': 0.181, 'BridgeDataV2': 0.162}
Code Fragment 1: Temperature sampling keeps OpenX largest without letting raw trajectory count completely dominate the training stream. Changing tau is an experimental choice that should be saved with the run config.

The expected output shows a middle ground: OpenX remains the largest source, but DROID and BridgeData V2 receive enough probability to affect training. If $\tau$ were 1, raw size would dominate more strongly; if $\tau$ were 0, every source would receive equal probability regardless of size. The correct value depends on the deployment claim and should be chosen before looking at final evaluation scores.

Bias And Coverage Audits

Mixture Audit Questions
AuditQuestionRepair
Task balanceAre some skills overrepresented because they are easy to collect?Task-aware sampling or capped repeats.
Embodiment balanceDoes one robot body dominate the action statistics?Embodiment-aware batches and per-robot metrics.
Scene balanceAre labs overrepresented relative to homes or offices?Held-out scene splits and source weights.
License compatibilityCan all sources be mixed and redistributed together?Separate training recipes or exclude incompatible sources.
Protocol: Build A Mixing Manifest
  1. List every source with license, robot, task families, scenes, and trajectory count.
  2. Choose a sampling rule and save the weights.
  3. Run per-source and aggregate validation.
  4. Inspect failure cases by source, task, and embodiment.
  5. Report wins only when the same evaluation artifact supports every compared method.

Mechanism: Mixtures Change Gradient Pressure

During training, a source with higher sampling probability contributes more gradient updates. That means the mixture determines which visual backgrounds, robot bodies, task verbs, and action ranges the model practices most often. If one large source contains mostly easy pick-and-place episodes, the model can become excellent at those motions while under-practicing rare tool-use or recovery behaviors.

A good mixing manifest therefore records both source weights and batch composition rules. Some teams use per-source batches so every update sees multiple embodiments. Others use task-balanced sampling so rare skills are not drowned out. Either choice is defensible when the manifest makes it reproducible and the evaluation reports per-source outcomes.

Failure Analysis For Data Mixtures

When a mixed-data policy fails, the first question is whether the failure came from lack of coverage, negative transfer, or source conflict. Lack of coverage means the deployment condition barely appears in any source. Negative transfer means another source teaches a behavior that is actively wrong for the target robot or task. Source conflict means two datasets use similar observations or instructions but incompatible action semantics, success definitions, or reset distributions.

The repair depends on the diagnosis. Lack of coverage calls for new data or higher sampling weight on the relevant source. Negative transfer calls for source conditioning, adapter layers, or per-source filtering. Source conflict calls for schema repair and split redesign before another training run. This is why a mixture manifest should include not only weights, but also the intended role of each source: pretraining diversity, target-domain supervision, stress evaluation, or recovery examples.

Mixture Debugging Signal

Train a small source classifier on policy embeddings or sampled batches. If the classifier can identify source from irrelevant visual artifacts such as background, camera border, or compression pattern, the model may learn dataset identity instead of task-relevant structure. That signal does not automatically invalidate the mixture, but it tells the researcher where to inspect bias before claiming generalization.

Pitfall: More Data Can Amplify Shortcuts

If a source contains a visual shortcut, such as a unique table color for one task, oversampling it may teach the policy the shortcut more confidently. Curating means auditing correlations, not merely maximizing rows.

Practical Example

A kitchen robot team might mix DROID-style in-the-wild data with a smaller internal dataset. The internal data can get a higher sampling weight if it matches the deployment robot, while DROID contributes visual and scene diversity.

Research Frontier

Open problems include automatic mixture optimization, dataset deduplication across public releases, source-aware policy evaluation, and methods that learn when to trust or downweight a source. These questions are now central because robot foundation models increasingly depend on heterogeneous public and private data.

Self Check

Can you reproduce the exact source weights that trained a policy checkpoint? If not, the checkpoint's behavior cannot be traced back to its data diet.

Key Takeaway

Data curation is policy design through the training distribution. The best mixture is the one whose weights, licenses, coverage, and per-source outcomes match the deployment claim.

Exercise 24.5.1

Create a three-source mixing manifest and choose a value of $ au$. Explain which source you are protecting from being drowned out and why.

What's Next

Chapter 25 uses these curated datasets for offline reinforcement learning and dataset-based robot learning, where logged actions become the training world.

References & Further Reading
Robot Datasets

Open X-Embodiment Collaboration. (2023). Open X-Embodiment: Robotic Learning Datasets and RT-X Models.

The central reference for cross-embodiment robot data, standardized dataset release, and RT-X style transfer across robot bodies.

Dataset

Khazatsky, A. et al. (2024). DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset.

Provides an in-the-wild manipulation dataset with diverse scenes, collectors, tasks, and detailed hardware reproduction guidance.

Dataset

Walke, H. R. et al. (2023). BridgeData V2: A Dataset for Robot Learning at Scale.

A large manipulation dataset designed around open-vocabulary multi-task learning, goal images, language, and data-scale experiments.

Dataset

Google DeepMind Open X-Embodiment Repository.

Shows the released dataset structure and RLDS episode organization used by the Open X-Embodiment ecosystem.

Repository
Tools

LeRobotDataset v3.0 Documentation.

The practical reference for standardized multimodal robot time-series data, metadata, indexing, and Hub visualization.

Tool