A Careful Control Loop
Lifelong and language-conditioned: LIBERO, CALVIN, Meta-World asks what a benchmark number actually measures. A task suite is useful only when its episodes, splits, seeds, metrics, wrappers, and failure labels match the embodied construct the paper claims to evaluate.
For Lifelong and language-conditioned: LIBERO, CALVIN, Meta-World, benchmark discipline means freezing environment API, task panel, success definition, seed policy, and artifact schema before comparing methods.
What This Section Builds
This section makes lifelong and language-conditioned benchmarks operational. LIBERO tests knowledge transfer across language-conditioned manipulation suites, CALVIN tests long-horizon language-conditioned tabletop behavior, and Meta-World tests multi-task and meta-learning across a structured set of manipulation tasks.
The goal is to evaluate learning over distributions, not isolated episodes. That means freezing task order, language templates, held-out goals, adaptation budget, replay budget, seeds, and the rule for measuring forgetting.
For Lifelong and language-conditioned: LIBERO, CALVIN, Meta-World, treat the leaderboard as an instrument: it is interpretable only when the benchmark isolates the capability, fixes the protocol, and records rerunnable context.
Theory
A lifelong benchmark is a sequence of task distributions rather than one static test set. Let $S_{k,t}$ be success on task family $k$ after training stage $t$. A strong final average can hide catastrophic forgetting if earlier tasks lose success after later training, so the evidence artifact should include the whole task-by-stage matrix, not only the last column.
Language conditioning adds another split boundary. A policy can overfit to instruction templates, object names, or goal phrasing while failing to ground new combinations. The evaluation must say whether language, objects, spatial relations, goals, or task families are held out.
The mechanism is transfer under controlled exposure. LIBERO separates kinds of knowledge shift, such as objects, spatial relations, goals, and mixtures. CALVIN emphasizes language-conditioned task sequences. Meta-World separates multi-task training from meta-learning adaptation, with held-out tasks in the meta-learning settings. Each suite needs a different split audit.
Worked Example
Code Fragment 1 computes a simple forgetting diagnostic. A method that improves the final task while damaging earlier tasks needs a different claim than a method that retains old skills and transfers to new ones.
# Measure forgetting from a task-by-stage success matrix.
# Rows are task families, columns are training stages after each
# new family has been introduced in the same order for every method.
success_by_stage = {
"spatial": [0.68, 0.61, 0.55],
"object": [0.00, 0.64, 0.58],
"goal": [0.00, 0.00, 0.71],
}
forgetting = {}
for task, scores in success_by_stage.items():
best_before_final = max(scores[:-1])
if best_before_final == 0:
forgetting[task] = "not_previously_trained"
else:
forgetting[task] = round(best_before_final - scores[-1], 2)
print(forgetting)
success_by_stage matrix preserves the evidence that a final average would hide. The forgetting values show how much success on earlier task families dropped after later training stages, which is central for LIBERO-style lifelong claims.The maintained suites provide task definitions and loaders, but they do not protect you from unfair transfer comparisons. The evaluation script must enforce the same task order, adaptation steps, replay buffer, prompt templates, demonstration access, and seed list for every method.
Practical Recipe
- State the transfer claim: new objects, new spatial relations, new goals, new task families, longer language-conditioned chains, or faster adaptation.
- Freeze task order, train/test task split, instruction templates, adaptation budget, replay budget, seeds, and stopping rule.
- Report the full task-by-stage matrix for lifelong learning, plus final average success.
- For Meta-World, separate multi-task performance from meta-learning adaptation to held-out tasks.
- Label failures as language grounding error, task-order forgetting, adaptation overfit, object confusion, goal ambiguity, or controller failure.
For Lifelong and language-conditioned: LIBERO, CALVIN, Meta-World, compare only metrics co-computed in one benchmark pass with the same task panel, wrappers, seed policy, success definition, and logged failure labels.
The common mistake is comparing methods with different task orders or different adaptation budgets. A learner that sees more replay data, extra prompt variants, or additional tuning episodes is not being compared on the same lifelong benchmark, even if the final success table uses the same task names.
A team evaluating LIBERO, CALVIN, or Meta-World should log task family, task order, language instruction, held-out condition, adaptation steps, replay source, seed, per-stage success, and final success. Those fields reveal whether a method transfers knowledge, memorizes prompt templates, or forgets earlier skills.
A lifelong benchmark is a diary, not a trophy photo. The final row matters, but the intermediate pages tell you whether the robot kept its old skills.
The frontier is moving from single-task success toward policies that retain, compose, and extend skills under language guidance. The open evaluation problem is making transfer claims precise enough that gains cannot be explained by easier task orders, prompt leakage, extra replay, or hidden adaptation.
Can you name the task order, held-out condition, language-template split, adaptation budget, replay budget, seed list, per-stage metric, and forgetting measure? If not, the experiment boundary is still too vague.
Lifelong and language-conditioned benchmarks become useful when they expose the sequence of learning, not only the end state. A LIBERO result should show which knowledge shift was tested. A CALVIN result should show how language-conditioned chains were sampled and scored. A Meta-World result should say whether it evaluates multi-task training or adaptation to held-out tasks.
The graduate-level habit is to separate final competence from learning dynamics. Final success asks whether the policy can solve the last evaluation panel. Forward transfer asks whether earlier training helps later tasks. Backward transfer and forgetting ask whether later training preserves earlier skills. These are different constructs and must be co-computed from one task-by-stage artifact.
| Suite | Primary construct | Protocol detail to freeze |
|---|---|---|
| LIBERO | Lifelong robot learning and knowledge transfer under language-conditioned manipulation tasks | Task suite, order, prompt templates, replay budget, adaptation budget, and forgetting metric. |
| CALVIN | Long-horizon language-conditioned tabletop behavior | Instruction distribution, chain length, start states, horizon, success predicate, and seed list. |
| Meta-World MT settings | Multi-task learning across manipulation tasks | Task set, shared observation/action interface, per-task success, and aggregation rule. |
| Meta-World ML settings | Meta-learning and adaptation to held-out tasks or goals | Train/test task split, support episodes, adaptation steps, and evaluation episodes. |
| Language-conditioned variants | Grounding instructions into manipulation behavior | Template split, paraphrase access, object vocabulary, and held-out language-object combinations. |
A robust transfer benchmark starts with a schedule file. The schedule file lists each training stage, task family, prompt set, replay source, adaptation budget, and evaluation panel. Every method reads the same schedule, which keeps the comparison tied to learning dynamics rather than run-specific choices.
- Write a schedule with task order, prompt templates, train/test tasks, adaptation budget, replay budget, and seeds.
- Evaluate after each stage on the same held-out panel, not only after final training.
- Save the task-by-stage success matrix, final average, forward transfer, backward transfer, and forgetting.
- Stratify results by task family or language condition so one easy family does not dominate the claim.
- Keep prompt variants and demonstration access identical across methods.
Code Fragment 2 records a transfer schedule. This is the file you want reviewers to inspect before they trust a lifelong-learning number.
# Define one transfer schedule shared by every compared method.
# This schedule fixes task order, adaptation budget, replay budget,
# and held-out evaluation so final averages are construct-matched.
from dataclasses import dataclass, asdict
@dataclass
class TransferSchedule:
suite: str
task_order: tuple[str, ...]
heldout_condition: str
adaptation_steps: int
replay_budget: int
seeds: tuple[int, ...]
def as_row(self) -> dict[str, object]:
return asdict(self)
schedule = TransferSchedule(
suite="LIBERO",
task_order=("spatial", "object", "goal"),
heldout_condition="new_language_object_combinations",
adaptation_steps=0,
replay_budget=200,
seeds=(0, 1, 2),
)
print(schedule.as_row())
TransferSchedule makes task order and adaptation budget explicit before any lifelong result is reported. The heldout_condition field tells the reader whether the number tests new language-object combinations, new goals, new tasks, or another transfer boundary.Expected output: the printed schedule should expose task order, held-out condition, adaptation budget, replay budget, and seeds. If two methods use different schedules, their final success values do not belong in one comparison.
When a lifelong or language-conditioned experiment fails, inspect the matrix before changing the model. A low final average may come from one hard task family, a prompt-grounding failure, forgetting of earlier tasks, insufficient adaptation budget, or a controller failure unrelated to transfer. The failure label should say which one.
Lifelong and language-conditioned benchmarks are useful when they report task order, held-out conditions, adaptation budget, replay budget, per-stage success, and forgetting from one shared evaluation schedule.
Design a LIBERO, CALVIN, or Meta-World comparison for a transfer claim. Specify the task order, held-out condition, adaptation budget, replay budget, seed list, and the forgetting metric you would report.
Section 12.4 → moves from transfer schedules to long-horizon household tasks, where predicate progress and failure points must be saved with final success.
ManiSkill Contributors. "ManiSkill Documentation."
ManiSkill provides manipulation tasks, demonstrations, GPU-parallel workflows, and documentation for robot-learning experiments. It is relevant when this section asks how benchmark design turns simulator capability into comparable evidence. Readers should connect this source to lifelong and language-conditioned: libero, calvin, meta-world when deciding what is reusable, what is benchmark-specific, and what must be remeasured.
RoboCasa Team. "RoboCasa Documentation."
RoboCasa documents everyday manipulation tasks and simulation assets, including the 2024 release lineage and later RoboCasa365 expansion. Readers should use it to study how task diversity and environment generation affect benchmark claims. Readers should connect this source to lifelong and language-conditioned: libero, calvin, meta-world when deciding what is reusable, what is benchmark-specific, and what must be remeasured.
Mandlekar, A. et al. "robomimic Documentation."
robomimic provides datasets and algorithms for learning from demonstrations. It matters here because benchmark evaluation often depends as much on dataset format and split discipline as on simulator physics. Readers should connect this source to lifelong and language-conditioned: libero, calvin, meta-world when deciding what is reusable, what is benchmark-specific, and what must be remeasured.
James, S. et al. (2019). "RLBench: The Robot Learning Benchmark and Learning Environment." arXiv.
RLBench frames a large set of vision-guided manipulation tasks with demonstrations and task variation. It is useful for readers studying few-shot, multi-task, and manipulation benchmark design. Readers should connect this source to lifelong and language-conditioned: libero, calvin, meta-world when deciding what is reusable, what is benchmark-specific, and what must be remeasured.
Stanford Vision and Learning Lab. "BEHAVIOR-1K."
BEHAVIOR-1K grounds household embodied AI tasks in human needs and long-horizon mobile manipulation. It gives benchmark designers a concrete example of task suites that go beyond isolated tabletop success rates. Readers should connect this source to lifelong and language-conditioned: libero, calvin, meta-world when deciding what is reusable, what is benchmark-specific, and what must be remeasured.