Section 24.2: Dataset structure, embodiment metadata, and licensing | Building Embodied AI: From Perception to Autonomous Action

"The policy read the video. The researcher read the license. Only one of them was ready to ship."
A Dataset Card Lawyer

Warm educational cartoon scene connecting dataset structure and metadata to robot demonstrations, operator decisions, recorded trajectories, and later policy evaluation. — **Figure 24.2A**: A robot dataset card binds technical fields, embodiment metadata, splits, and licensing into one reproducible artifact.

Big Picture

A robot dataset schema is the contract between data collectors, model trainers, evaluators, and downstream users. It must describe the time series, the robot body, the task semantics, the splits, and the legal permission to use the data.

Minimum Schema

At frame $t$ in episode $i$, a dataset row should identify $i$, $t$, observation features $o_{i,t}$, action features $a_{i,t}$, and metadata $m_i$. The metadata is not optional bookkeeping; it defines whether two rows are comparable.

Dataset Card Fields

Category	Required Fields	Failure If Missing
Embodiment	robot model, joints, gripper, base, control mode	Action labels cannot be interpreted.
Sensors	camera names, resolution, fps, extrinsics, proprioception	Observations cannot be aligned or reproduced.
Actions	units, frame, rate, saturation, command versus executed action	Policy output space is ambiguous.
Task	language, goal, reset, success, failure labels	Evaluation semantics drift.
Governance	license, consent constraints, redistribution rules	The dataset cannot be safely reused.

Metadata Is Part Of The Data

If robot type, action units, camera calibration, or split policy are outside the dataset, they will eventually be separated from the results that depend on them.

Library Shortcut

LeRobotDataset v3 already provides conventions for multimodal time-series data, metadata, indexing, and Hub visualization. Use it after writing the dataset card, so the standard format implements a scientific contract rather than replacing one.

Code Fragment 1 validates a dataset card before training. The example uses plain Python so the logic is visible; in production, a Pydantic schema or LeRobot metadata validator would enforce the same contract.

Hands-On Lab: Build A Robot Dataset Card

Duration: ~50 minutesIntermediate

Objective

Create a dataset card and validation rule for a small robot learning dataset.

What You'll Practice

Specifying embodiment metadata.
Writing split and license fields.
Checking a schema before training.

Setup

pip install pydantic

Code Fragment 2: The setup command installs Pydantic for schema validation. The lab uses it to convert a dataset card from prose into a checked object.

Steps

Step 1: Define the schema

Write a card model with robot, sensor, action, split, and license fields.

# Define a dataset card that makes embodiment and governance explicit.
from pydantic import BaseModel

class RobotDatasetCard(BaseModel):
    robot: str
    camera_fps: int
    action_units: str
    split_policy: str
    license: str
    def as_row(self) -> dict[str, object]:
        return self.model_dump()

robot_dataset_card = RobotDatasetCard(
    robot="mobile_manipulator",
    camera_fps=30,
    action_units="action_units_example",
    split_policy="split_policy_example",
    license="CC-BY-4.0"
)
print(robot_dataset_card.as_row())

Code Fragment 3: The RobotDatasetCard model forces split_policy and license to appear beside technical fields. Add calibration_version so replay and training can identify stale sensor geometry.

Step 2: Instantiate and inspect

Create one card and print the normalized representation.

# Create one card and inspect its normalized dictionary.
card = RobotDatasetCard(
    robot="franka",
    camera_fps=30,
    action_units="end_effector_delta_m_rad",
    split_policy="held_out_scenes",
    license="CC-BY-4.0",
)
print(card.model_dump())

Code Fragment 4: The card.model_dump call produces the artifact that should travel with the dataset. The split policy makes the generalization claim explicit.

Expected Output

The lab should print a complete dataset-card dictionary and a short note explaining what kind of generalization the split tests.

Stretch Goals

Add separate licenses for video, robot state, and language annotations.
Add a validator that rejects unknown action units.

Complete Solution

# Complete dataset-card solution with calibration metadata.
# This is enough structure for a small internal robot dataset.
from pydantic import BaseModel

class RobotDatasetCard(BaseModel):
    robot: str
    camera_fps: int
    action_units: str
    split_policy: str
    license: str
    calibration_version: str

    def as_row(self) -> dict[str, object]:
        return self.model_dump()

card = RobotDatasetCard(
    robot="franka",
    camera_fps=30,
    action_units="end_effector_delta_m_rad",
    split_policy="held_out_scenes",
    license="CC-BY-4.0",
    calibration_version="calib_2026_06_21",
)
print(card.as_row())

Code Fragment 5: The complete solution adds calibration_version so sensor geometry is versioned with the dataset. This keeps replay, training, and evaluation tied to the same physical setup.

Pitfall: License Afterthoughts

License and redistribution terms must be known before a dataset is mixed with other data. Once derived checkpoints exist, separating incompatible sources can become impossible.

Practical Example

A public tabletop dataset may allow research use but restrict commercial redistribution of videos. A dataset card that separates raw video license, derived feature license, and checkpoint training permission prevents accidental misuse later.

Research Frontier

Dataset cards for robot learning are becoming as important as model cards. The next step is machine-readable cards that training pipelines can use to block unsafe mixes, report held-out factors, and preserve provenance automatically.

Self Check

Could a new lab reproduce your dataset split and action units from the card alone? If not, the card is a brochure rather than a scientific artifact.

Key Takeaway

A robot dataset is reusable only when its schema, embodiment metadata, split policy, and license are explicit enough to travel with the files.

Exercise 24.2.1

Draft a dataset card for a bimanual manipulation dataset and include one field that would prevent an invalid comparison.

What's Next

Section 24.3 studies the hardest schema problem: pooling data across different embodiments without pretending all action spaces are the same.

References & Further Reading

Robot Datasets

Open X-Embodiment Collaboration. (2023). Open X-Embodiment: Robotic Learning Datasets and RT-X Models.

The central reference for cross-embodiment robot data, standardized dataset release, and RT-X style transfer across robot bodies.

Dataset

Khazatsky, A. et al. (2024). DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset.

Provides an in-the-wild manipulation dataset with diverse scenes, collectors, tasks, and detailed hardware reproduction guidance.

Dataset

Walke, H. R. et al. (2023). BridgeData V2: A Dataset for Robot Learning at Scale.

A large manipulation dataset designed around open-vocabulary multi-task learning, goal images, language, and data-scale experiments.

Dataset

Google DeepMind Open X-Embodiment Repository.

Shows the released dataset structure and RLDS episode organization used by the Open X-Embodiment ecosystem.

Repository

Tools

LeRobotDataset v3.0 Documentation.

The practical reference for standardized multimodal robot time-series data, metadata, indexing, and Hub visualization.

Tool