Section 24.1: The major datasets: Open X-Embodiment, DROID, BridgeData V2, RH20T, RoboMIND, AgiBot World | Building Embodied AI: From Perception to Autonomous Action

"A million trajectories sound large until you ask how many robot bodies they remember."
A Dataset Cartographer

Warm educational cartoon scene connecting major robot datasets to robot demonstrations, operator decisions, recorded trajectories, and later policy evaluation. — **Figure 24.1A**: Major datasets differ not only in size, but in robot bodies, scenes, tasks, annotations, and split philosophy.

Big Picture

Robot datasets are infrastructure for comparing policies, studying generalization, and training robot foundation models. Open X-Embodiment, DROID, BridgeData V2, RoboNet, RH20T, RoboMIND, and AgiBot-style releases should be read as different answers to the same question: what physical experience should a general robot policy inherit?

Dataset Families

Open X-Embodiment pools many robots and tasks to study RT-X style cross-robot transfer. DROID emphasizes in-the-wild collection across diverse scenes and data collectors. BridgeData V2 emphasizes large-scale manipulation trajectories with language and goal-conditioning compatibility. RoboNet is historically important because it pushed multi-robot video prediction and self-supervised interaction data before the current foundation-model wave.

How To Read A Robot Dataset

Dataset Question	Why It Matters	Example Evidence
Embodiments	Determines whether cross-robot transfer is being tested.	Robot type, arm count, gripper, base, sensors.
Scenes	Determines visual and physical diversity.	Homes, labs, kitchens, offices, institutions.
Tasks	Determines skill coverage.	Pick, place, open, wipe, pour, tool use.
Annotations	Determines what policies can condition on.	Language, goal images, success, failure, interventions.
Splits	Determines what generalization means.	Held-out tasks, held-out scenes, held-out objects, held-out robots.

Dataset Size Is Multidimensional

A dataset with fewer trajectories can be more valuable for a specific research question if it covers the held-out factor that deployment changes. Size, diversity, annotation richness, and split design are separate axes.

Library Shortcut

Use project repositories and Hugging Face dataset loaders before writing custom download code. The maintained loaders preserve released split names, metadata conventions, and feature schemas that custom scripts often flatten away.

The snippet below encodes a small dataset card and computes a crude coverage score. It is not a substitute for careful evaluation, but it forces the reader to separate trajectory count from embodiment and task diversity.

# Compare dataset cards by coverage factors, not only trajectory count.
# The score is a teaching proxy for reading dataset claims critically.
datasets = {
    "Open X-Embodiment": {"robots": 22, "tasks": 527, "scenes": 21},
    "DROID": {"robots": 1, "tasks": 86, "scenes": 564},
    "BridgeData V2": {"robots": 1, "tasks": 24, "scenes": 24},
}

for name, card in datasets.items():
    coverage_proxy = card["robots"] * card["tasks"] * card["scenes"]
    print(name, coverage_proxy)

Open X-Embodiment 243474 DROID 48504 BridgeData V2 576

Code Fragment 1: The coverage_proxy is intentionally simple, but it changes the reading habit. A dataset card should expose the factors behind "large" so a researcher can ask whether scale comes from robots, tasks, scenes, or repeated trajectories.

The expected output ranks Open X-Embodiment highest because the proxy multiplies robots, tasks, and scenes. That does not mean it is best for every research question. It means the dataset card exposes more cross-factor coverage under this particular proxy, while DROID's high scene count may be more relevant for in-the-wild visual generalization and BridgeData V2's task framing may be more relevant for language-conditioned tabletop studies.

Mechanism: What The Dataset Teaches

Each major dataset teaches a policy through a different pressure. Open X-Embodiment pressures the model to find representations that survive changes in robot body and task source. DROID pressures the model to survive scene, collector, and household variation. BridgeData V2 pressures the model to connect manipulation behavior with language or goal-conditioned task descriptions. RoboNet-style video-interaction datasets pressure predictive models to learn how robot actions change visual futures.

This distinction matters when choosing pretraining data. A model trained on many robots may learn robust visual affordances but still need an adapter for a new action space. A model trained in many homes may learn visual robustness but remain tied to one hardware platform. A model trained with rich language may follow instructions better while still failing on contact dynamics that the dataset under-sampled.

Failure Analysis By Dataset Source

When a policy trained on a major dataset fails, ask which distribution did not transfer: visual scene, object category, task instruction, action representation, robot morphology, or operator style. This source-aware diagnosis is stronger than saying "the dataset was too small" because it points to the next data collection or adaptation step.

Toolchain And Split Advice

A practical researcher should load the official release first, preserve its native split names, then create a project-specific split manifest only after deciding what generalization claim is being tested. For Open X-Embodiment, that may be held-out robot or held-out task. For DROID, it may be held-out scene, collector, or object. For BridgeData V2, it may be held-out instruction, environment, or goal image. A split that ignores the dataset's main source of diversity wastes the reason to use that dataset.

Algorithm: Choosing A Dataset

Name the deployment gap: robot, scene, object, task, language, or contact regime.
Choose the dataset whose diversity axis matches that gap.
Preserve official metadata and split fields during loading.
Create one held-out split that changes the deployment gap explicitly.
Report per-source results before reporting aggregate success.

Pitfall: Comparing Dataset Numbers Across Papers

Trajectory counts, task counts, scene counts, and language labels are often defined differently. Treat them as dataset descriptors, not as directly comparable leaderboard numbers unless one audit script normalizes them.

Practical Example

A researcher studying kitchen generalization may prefer DROID-style scene diversity. A researcher studying robot-body transfer may prefer Open X-Embodiment. A researcher studying language-conditioned tabletop manipulation may begin with BridgeData V2.

Research Frontier

RT-X style work asks whether robotics can benefit from the same broad pretraining pattern that transformed language and vision. The unsolved part is that robot embodiments change the action space itself, so pooling data requires more than tokenizing observations.

Self Check

For each major dataset, can you name the held-out factor its evaluation most strongly tests? If not, you know the dataset name but not the scientific claim.

Key Takeaway

Major robot datasets are not interchangeable warehouses. Each encodes a choice about embodiment, scenes, tasks, annotations, and what kind of generalization deserves evidence.

Exercise 24.1.1

Choose two robot datasets and write a two-row dataset card comparing robot bodies, scenes, tasks, labels, license, and split design.

What's Next

Section 24.2 turns those comparison questions into a concrete schema for dataset structure, metadata, licensing, and dataset cards.

References & Further Reading

Robot Datasets

Open X-Embodiment Collaboration. (2023). Open X-Embodiment: Robotic Learning Datasets and RT-X Models.

The central reference for cross-embodiment robot data, standardized dataset release, and RT-X style transfer across robot bodies.

Dataset

Khazatsky, A. et al. (2024). DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset.

Provides an in-the-wild manipulation dataset with diverse scenes, collectors, tasks, and detailed hardware reproduction guidance.

Dataset

Walke, H. R. et al. (2023). BridgeData V2: A Dataset for Robot Learning at Scale.

A large manipulation dataset designed around open-vocabulary multi-task learning, goal images, language, and data-scale experiments.

Dataset

Google DeepMind Open X-Embodiment Repository.

Shows the released dataset structure and RLDS episode organization used by the Open X-Embodiment ecosystem.

Repository

Tools

LeRobotDataset v3.0 Documentation.

The practical reference for standardized multimodal robot time-series data, metadata, indexing, and Hub visualization.

Tool