"The policy read the video. The researcher read the license. Only one of them was ready to ship."
A Dataset Card Lawyer
A robot dataset schema is the contract between data collectors, model trainers, evaluators, and downstream users. It must describe the time series, the robot body, the task semantics, the splits, and the legal permission to use the data.
Minimum Schema
At frame $t$ in episode $i$, a dataset row should identify $i$, $t$, observation features $o_{i,t}$, action features $a_{i,t}$, and metadata $m_i$. The metadata is not optional bookkeeping; it defines whether two rows are comparable.
| Category | Required Fields | Failure If Missing |
|---|---|---|
| Embodiment | robot model, joints, gripper, base, control mode | Action labels cannot be interpreted. |
| Sensors | camera names, resolution, fps, extrinsics, proprioception | Observations cannot be aligned or reproduced. |
| Actions | units, frame, rate, saturation, command versus executed action | Policy output space is ambiguous. |
| Task | language, goal, reset, success, failure labels | Evaluation semantics drift. |
| Governance | license, consent constraints, redistribution rules | The dataset cannot be safely reused. |
If robot type, action units, camera calibration, or split policy are outside the dataset, they will eventually be separated from the results that depend on them.
LeRobotDataset v3 already provides conventions for multimodal time-series data, metadata, indexing, and Hub visualization. Use it after writing the dataset card, so the standard format implements a scientific contract rather than replacing one.
Code Fragment 1 validates a dataset card before training. The example uses plain Python so the logic is visible; in production, a Pydantic schema or LeRobot metadata validator would enforce the same contract.
Hands-On Lab: Build A Robot Dataset Card
Objective
Create a dataset card and validation rule for a small robot learning dataset.
What You'll Practice
- Specifying embodiment metadata.
- Writing split and license fields.
- Checking a schema before training.
Setup
pip install pydanticSteps
Step 1: Define the schema
Write a card model with robot, sensor, action, split, and license fields.
# Define a dataset card that makes embodiment and governance explicit.
from pydantic import BaseModel
class RobotDatasetCard(BaseModel):
robot: str
camera_fps: int
action_units: str
split_policy: str
license: str
def as_row(self) -> dict[str, object]:
return self.model_dump()
robot_dataset_card = RobotDatasetCard(
robot="mobile_manipulator",
camera_fps=30,
action_units="action_units_example",
split_policy="split_policy_example",
license="CC-BY-4.0"
)
print(robot_dataset_card.as_row())Step 2: Instantiate and inspect
Create one card and print the normalized representation.
# Create one card and inspect its normalized dictionary.
card = RobotDatasetCard(
robot="franka",
camera_fps=30,
action_units="end_effector_delta_m_rad",
split_policy="held_out_scenes",
license="CC-BY-4.0",
)
print(card.model_dump())Expected Output
The lab should print a complete dataset-card dictionary and a short note explaining what kind of generalization the split tests.
Stretch Goals
- Add separate licenses for video, robot state, and language annotations.
- Add a validator that rejects unknown action units.
Complete Solution
# Complete dataset-card solution with calibration metadata.
# This is enough structure for a small internal robot dataset.
from pydantic import BaseModel
class RobotDatasetCard(BaseModel):
robot: str
camera_fps: int
action_units: str
split_policy: str
license: str
calibration_version: str
def as_row(self) -> dict[str, object]:
return self.model_dump()
card = RobotDatasetCard(
robot="franka",
camera_fps=30,
action_units="end_effector_delta_m_rad",
split_policy="held_out_scenes",
license="CC-BY-4.0",
calibration_version="calib_2026_06_21",
)
print(card.as_row())License and redistribution terms must be known before a dataset is mixed with other data. Once derived checkpoints exist, separating incompatible sources can become impossible.
A public tabletop dataset may allow research use but restrict commercial redistribution of videos. A dataset card that separates raw video license, derived feature license, and checkpoint training permission prevents accidental misuse later.
Dataset cards for robot learning are becoming as important as model cards. The next step is machine-readable cards that training pipelines can use to block unsafe mixes, report held-out factors, and preserve provenance automatically.
Could a new lab reproduce your dataset split and action units from the card alone? If not, the card is a brochure rather than a scientific artifact.
A robot dataset is reusable only when its schema, embodiment metadata, split policy, and license are explicit enough to travel with the files.
Draft a dataset card for a bimanual manipulation dataset and include one field that would prevent an invalid comparison.
What's Next
Section 24.3 studies the hardest schema problem: pooling data across different embodiments without pretending all action spaces are the same.
The central reference for cross-embodiment robot data, standardized dataset release, and RT-X style transfer across robot bodies.
Khazatsky, A. et al. (2024). DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset.
Provides an in-the-wild manipulation dataset with diverse scenes, collectors, tasks, and detailed hardware reproduction guidance.
Walke, H. R. et al. (2023). BridgeData V2: A Dataset for Robot Learning at Scale.
A large manipulation dataset designed around open-vocabulary multi-task learning, goal images, language, and data-scale experiments.
Google DeepMind Open X-Embodiment Repository.
Shows the released dataset structure and RLDS episode organization used by the Open X-Embodiment ecosystem.
LeRobotDataset v3.0 Documentation.
The practical reference for standardized multimodal robot time-series data, metadata, indexing, and Hub visualization.