Section 51.1: Closed- vs. open-world tasks | Building Embodied AI: From Perception to Autonomous Action

A closed-world benchmark is a tidy kitchen; deployment is the drawer where somebody put batteries, tape, and one mysterious screw.
A Junk Drawer Deployment

Technical illustration for Section 51.1: Closed- vs. open-world tasks. — Figure 51.1A: Closed-world tasks (a fixed set of objects, instructions, and goals) vs. open-world tasks (novel objects encountered at test time), with a benchmark suite axis showing how evaluation protocols change across the two settings.

Big Picture

An open-world robot encounters objects, environments, and task variations outside its training distribution. Unlike closed-world benchmarks with fixed categories, deployment is a moving target: the categories themselves expand. This chapter addresses novelty detection, graceful degradation under distribution shift, and the triggers for on-the-fly adaptation.

closed- vs. open-world tasks becomes useful when it is tied to a named interface, a replayable scenario, a failure diagnostic, and an artifact that records what changed in the action loop.

The key question is practical: What is fixed by the benchmark, what may change in deployment, and what evidence proves recovery rather than memorization?

Action Is The Test

A representation earns its place when it changes the measurable action interface. In closed- vs. open-world tasks, the reader should keep asking which decision becomes easier, safer, or more reliable.

Theory

For Closed- vs. open-world tasks, the practical design rule is to make the interface inspectable before optimization begins: inputs, outputs, units, latency, bounds, and failure labels should all be visible in the saved artifact.

Mechanism

The mechanism in Closed- vs. open-world tasks is the contract between representation and action. Name what enters the module, what leaves it, which assumptions make that transformation valid, and which log would reveal a bad handoff.

Worked Example

Consider a mobile manipulator trained to pick known cups from known shelves. In the open world, the cup may be translucent, the shelf may move, and the instruction may refer to a new location.

# pip install gymnasium
import gymnasium as gym

env = gym.make("CartPole-v1")
obs, info = env.reset(seed=7)
for step in range(5):
    action = env.action_space.sample()
    obs, reward, terminated, truncated, info = env.step(action)
    print(step, action, reward, terminated or truncated)

Expected output: five short transition records with action, reward, and termination status for the seeded environment.

Code Fragment 51.1.1 turns Closed- vs. open-world tasks into an executable trace with explicit observation, action, and outcome fields.

Library Shortcut

The Gymnasium fragment is a few lines for a closed environment loop. For open-world work, pair Gymnasium-style controlled shifts with simulator variation, LeRobot datasets, and deployment logs; the tools handle repeatable panels and replay while the simple loop shows what a closed task fixes.

Practical Recipe

Write the observation, action, and success metric before choosing a model.
Build a baseline that is simple enough to debug by inspection.
Add the library implementation only after the baseline behavior is understood.
Record failures as structured cases: perception error, state error, planning error, control error, or evaluation error.
Run at least one perturbation test before trusting the result.

Common Failure Mode

The common mistake in Closed- vs. open-world tasks is to celebrate the component score before checking the closed-loop handoff. The failure usually appears at the boundary: stale state, wrong frame, delayed action, saturated actuator, or metric that ignores the real task cost.

Practical Example

An open-world evaluation should save the known-task score, novelty type, detection signal, adaptation action, retained old-task score, and failure label in one artifact.

Research Frontier

Research studies out-of-distribution detection, embodied foundation models, adaptive policies, and evaluation panels that mix old and new tasks. The strongest claims separate novelty detection from safe adaptation.

DreamerV3 (Hafner et al., 2023) is a landmark result for open-world generalization: a single world-model agent with one fixed hyperparameter set achieves strong performance across a diverse suite of tasks spanning Atari, continuous control, and 3D environments, without any task-specific tuning. This demonstrates that a sufficiently expressive world model can internalize enough structure to handle task diversity rather than relying on curriculum or domain randomization. GR00T N1.5 (NVIDIA, 2024) extends this principle to physical robots: a cross-embodiment foundation model pretrained on diverse robot demonstrations and fine-tuned with small per-robot datasets, showing that open-world generalization in manipulation benefits from scale and embodiment diversity in the training distribution.

Self Check

Can you name the observation, state estimate, action, success metric, and most likely failure mode for closed- vs. open-world tasks? If not, the system boundary is still too vague.

Closed- vs. open-world tasks becomes useful when it is tied to a closed-loop contract for Open-World and Novelty-Robust Embodiment. The contract names the participants, observations, action authority, timing budget, logging artifact, and recovery rule. Without that contract, a system can look capable in a notebook while failing the first time a partner delays, a person corrects it, or a deployment scene changes.

For Closed- vs. open-world tasks, separate the conceptual claim, the systems claim, and the evidence claim. A plausible mechanism, a clean interface, and a closed-loop result are different claims; the section should keep their evidence separate.

Practical Tool Choices For This Section

Tool or Library	Role in the Topic	Builder Advice
Gymnasium	Closed- vs. open-world tasks	Create controlled shifts that separate closed-world competence from open-world recovery.
LeRobot	Closed- vs. open-world tasks	Reuse recorded robot episodes for replay, adaptation, and regression checks.
ROS 2	Closed- vs. open-world tasks	Log deployment events and safety interventions while the environment changes.
MuJoCo	Closed- vs. open-world tasks	Inject object, contact, and dynamics variation before real deployment.
PettingZoo	Closed- vs. open-world tasks	Model open-world interaction when other agents create changing goals or hazards.

For Closed- vs. open-world tasks, the baseline and maintained-tool version should produce the same artifact schema and run on one task panel. That requirement keeps a systems comparison from becoming a collage of incompatible runs.

Write a one-paragraph task contract with observation, action, success, and failure fields.
Start with the smallest simulator, dataset, or wrapper that exposes the task contract faithfully.
Run one deterministic smoke test and one perturbation test before scaling.
Save a single result artifact containing configuration, seed, metrics, videos or traces, and failure labels.
Compare methods only when one script evaluates them on the same task panel.

When Closed- vs. open-world tasks fails, avoid labeling the whole method as weak. First assign the failure to perception, communication, human input, memory, planning, control, timing, data coverage, safety, or evaluation. Then rerun one controlled perturbation that isolates the suspected cause. This pattern turns a disappointing rollout into a reusable diagnostic asset.

Agent Checklist Applied

The 42-agent production pass treats closed- vs. open-world tasks as a buildable system, not a definition. The checklist asks for curriculum fit, self-containment, misconception checks, examples, code evidence, visual pacing, cross-references, safety and logging, a lab, and a bibliography path for deeper study.

Cross-Reference Trail

For Closed- vs. open-world tasks, connect partial observability, exploration, memory, robustness, and evaluation through a lifelong-learning log that records what changed and how the robot noticed.

Misconception Check

A common misconception is that a larger training set automatically makes the task open-world. The diagnostic question is: what does the agent do when it knows it does not know?

Mini Lab

Build a two-panel evaluation: one familiar object and one shifted object. Report success, novelty flag, fallback action, and old-task retention together.

Memory Hook

A closed-world benchmark is a tidy kitchen; deployment is the drawer where somebody put batteries, tape, and one mysterious screw.

Technical Core

Closed- vs. open-world tasks needs a topic-native core: variables, equations or system contracts, an algorithmic procedure, an expected output, and a failure diagnosis. Figure 51.1.T summarizes the chain this section must preserve when moving from a teaching example to a real embodied system.

Figure 51.1.T: The technical core for Closed- vs. open-world tasks connects assumptions, model, algorithm, evidence, and failure analysis.

Formal Object

$f_\theta(o)\to (\hat y,\hat p),\quad \text{act if } \max_k \hat p_k \ge \tau,\quad \text{otherwise abstain and query}$

The defining difference between closed-world and open-world tasks is not model size, it is the right to abstain. In a closed-world benchmark the label space and task graph are fixed. In open-world embodiment the agent must detect when the current observation lies outside that contract and choose a safe fallback.

Open-set action gate

Train the base policy on the known task set, then calibrate confidence on held-out known data.
Create novelty panels with unseen objects, scene layouts, instructions, and interaction partners.
Choose a threshold $\tau$ that balances false alarms against unsafe confident errors.
Define the abstention action: ask, slow down, hand off, or switch to exploration mode.

Closed-World Success Versus Open-World Competence

Evaluation Item	Closed-World Reading	Open-World Reading
High success rate	Policy solves the benchmark.	Only meaningful if novelty detection is also measured.
Confidence score	Convenient ranking statistic.	Launch gate for safe action or abstention.
Replay trace	Debugging artifact.	Evidence for recovery after distribution shift.
Failure	One more bad episode.	Potential proof that the task contract was violated.

# Decide whether to act or abstain.
confidence = {"known_mug": 0.91, "unknown_object": 0.42}
threshold = 0.70

for item, score in confidence.items():
    decision = "act" if score >= threshold else "query_or_fallback"
    print(item, score, decision)

known_mug 0.91 act
unknown_object 0.42 query_or_fallback

Code Fragment 51.1.T captures the core open-world move: the policy must sometimes refuse to pretend the benchmark still applies.

The second line is the one readers should remember. Open-world embodiment is not defined by solving new cases immediately; it is defined by detecting that a new case arrived and choosing a controllable recovery path rather than an overconfident action.

Failure Mode To Test

An open-world system fails when confidence is reported but never connected to action gating. Always test whether low confidence changes behavior, because a detector that does not alter control is only a dashboard ornament.

Key Takeaway

Closed-world competence is the baseline; open-world embodiment is measured by detection, recovery, and retained skill.

Exercise 51.1.1

Design a method-matched experiment for Closed- vs. open-world tasks. Specify the environment, observation schema, action interface, metric, and one perturbation that targets the section's core assumption.

Section References

Parisi, G. I. et al. Continual Lifelong Learning with Neural Networks: A Review. Neural Networks, 2019.

Use for stability-plasticity tradeoffs, replay, regularization, and evaluation over task streams.

Kirkpatrick, J. et al. Overcoming catastrophic forgetting in neural networks. PNAS, 2017.

Use for elastic weight consolidation and the limits of parameter-importance methods.