A Careful Control Loop
Household and long-horizon: BEHAVIOR-1K / OmniGibson asks what a benchmark number actually measures. A task suite is useful only when its episodes, splits, seeds, metrics, wrappers, and failure labels match the embodied construct the paper claims to evaluate.
For Household and long-horizon: BEHAVIOR-1K / OmniGibson, benchmark discipline means freezing environment API, task panel, success definition, seed policy, and artifact schema before comparing methods.
What This Section Builds
This section makes household and long-horizon benchmarks operational. It focuses on tasks whose success depends on object states, room layouts, navigation, manipulation, and ordered subgoals rather than one isolated grasp.
The goal is to avoid a misleading binary score. A household policy can fail the final task while making meaningful predicate progress, or pass a task through a shortcut that ignores the intended behavior. The evaluation artifact must record both final success and subgoal evidence.
For Household and long-horizon: BEHAVIOR-1K / OmniGibson, treat the leaderboard as an instrument: it is interpretable only when the benchmark isolates the capability, fixes the protocol, and records rerunnable context.
Theory
A long-horizon household benchmark is a predicate sequence over a simulated home. The policy must move through rooms, manipulate objects, change states, and satisfy a goal condition such as objects being placed, cleaned, opened, or arranged. The aggregate score should therefore include final success, predicate completion, path or action efficiency, and failure point.
The split must protect scene and object generalization. If train and test share the same layouts, object placements, or task templates, a policy may learn household shortcuts rather than robust behavior. If the same simulator seed appears in tuning and testing, the result is no longer a clean held-out household evaluation.
The mechanism is a staged evaluation trace: scene sampling, initial-state sampling, object-state predicates, action execution, partial-progress scoring, and final success. OmniGibson supplies rich interactive household simulation, while BEHAVIOR-1K defines human-need-inspired task suites where long horizons and object states are central to the construct.
Worked Example
Code Fragment 1 turns long-horizon evaluation into predicate accounting. The key idea is that final success and partial progress should be co-computed from the same replay, not reconstructed from separate logs.
# Score household progress from a single ordered predicate trace.
# Final success and partial progress come from the same episode,
# which prevents mixing incompatible logs in one result table.
required_predicates = ("find_mug", "open_cabinet", "place_mug", "close_cabinet")
completed_predicates = ("find_mug", "open_cabinet", "place_mug")
completed = set(completed_predicates)
progress = sum(predicate in completed for predicate in required_predicates) / len(required_predicates)
final_success = all(predicate in completed for predicate in required_predicates)
print(f"progress={progress:.2f}, final_success={final_success}")
required_predicates trace distinguishes partial progress from final success. That distinction matters in BEHAVIOR-1K and OmniGibson because a long-horizon policy can make real household progress while still failing the last state predicate.The maintained household suite should provide the scene, physics, task predicates, and assets. Your evaluation layer should still record the task template, scene split, object instance split, initial-state seed, predicate trace, final state, video, and failure point.
Practical Recipe
- Define the household construct: navigation plus manipulation, object-state reasoning, task planning, recovery, or full long-horizon completion.
- Freeze scene layouts, object instances, task templates, initial-state seeds, action horizon, and predicate definitions.
- Report final success together with predicate progress, failure step, action count, and recovery events.
- Stratify results by task length and scene novelty so short tasks do not hide long-horizon failures.
- Save replays that show the object-state predicates changing over time.
For Household and long-horizon: BEHAVIOR-1K / OmniGibson, compare only metrics co-computed in one benchmark pass with the same task panel, wrappers, seed policy, success definition, and logged failure labels.
The common mistake is reducing a household benchmark to one final binary score. That hides whether the policy failed at search, grasping, state change, planning order, recovery, or final predicate scoring.
A household benchmark run should log scene ID, task template, object instances, initial-state seed, action horizon, predicate trace, final success, progress score, failure step, and replay path. Those fields reveal whether the method solves long-horizon household behavior or passes easier layouts and shorter tasks.
For long-horizon tasks, the scorecard should read like a checklist on a refrigerator: which chores are complete, which one blocked progress, and whether the robot noticed.
The frontier is household evaluation with richer assets, longer horizons, and policies that combine language, perception, planning, and manipulation. The hard research problem is scoring progress without rewarding shortcuts that satisfy predicates while missing the intended household behavior.
Can you name the scene split, object split, task template split, initial-state seeds, predicate definitions, horizon, progress metric, final success rule, and failure taxonomy? If not, the experiment boundary is still too vague.
Household and long-horizon benchmarks become useful when they preserve the causal story of the episode. The policy may find the object, move it, change its state, and still fail because a cabinet remains open or a target condition is not satisfied. A single final success number loses that structure.
The graduate-level habit is to co-compute final success, progress, and failure labels from one replay. A method that improves progress on hard scenes has a meaningful result even when final completion remains hard. A method that improves final success by exploiting easy templates needs a narrower claim.
| Evidence field | Why it matters | Failure it catches |
|---|---|---|
| Scene split | Tests whether the policy works in unseen homes or layouts. | Memorized navigation routes and familiar object placements. |
| Object split | Tests whether object handling transfers across instances. | Recognition or grasp policies tuned to familiar assets. |
| Predicate trace | Shows which subgoals became true during the episode. | Binary scores that hide partial progress or shortcut behavior. |
| Failure step | Localizes where the long horizon broke. | Misdiagnosing a planning failure as a manipulation failure. |
| Replay artifact | Lets reviewers inspect state changes and recovery behavior. | Metrics that cannot be traced back to an episode. |
A robust household evaluation starts by writing the predicate schema. The schema should list the subgoals, the state variables that make each predicate true, and the time step at which each predicate is evaluated. That schema becomes the bridge between long-horizon behavior and a reproducible metric.
- Write the predicate schema and final success rule before running policies.
- Freeze scene, object, task-template, and initial-state splits.
- Run every method with the same horizon, action interface, and predicate evaluator.
- Save per-predicate completion, final success, failure step, action count, and replay path.
- Aggregate by task length and scene novelty, not only by a global mean.
Code Fragment 2 records a household evaluation result with both progress and final success. This lets the paper say exactly what improved.
# Record household evidence with scene split and predicate progress.
# Long-horizon results need the failure step and replay path because
# final success alone does not explain where the episode broke.
from dataclasses import dataclass, asdict
@dataclass
class HouseholdResult:
suite: str
scene_split: str
task_template: str
progress: float
final_success: bool
failure_step: str
def as_row(self) -> dict[str, object]:
return asdict(self)
result = HouseholdResult(
suite="BEHAVIOR-1K",
scene_split="unseen_homes",
task_template="put_away_tableware",
progress=0.75,
final_success=False,
failure_step="close_cabinet",
)
print(result.as_row())
HouseholdResult ties final success to a scene split, task template, progress value, and failure step. Those fields let a reader distinguish an almost-complete long-horizon rollout from a short failure with the same binary outcome.Expected output: the printed result should expose scene split, task template, progress, final success, and failure step. Without those fields, a household benchmark hides the structure of the episode.
When a household experiment fails, inspect the first unsatisfied predicate. If the robot never found the object, test perception and navigation. If it found the object but could not change its state, test manipulation and physics. If it changed the object state but missed the final predicate, test the evaluator and task definition.
Household and long-horizon benchmarks are useful when final success, predicate progress, scene novelty, object novelty, and failure points are saved from the same replay artifact.
Choose a household task and write its predicate schema. Specify scene split, object split, initial-state seeds, final success rule, progress metric, and the failure labels you would attach to replays.
Section 12.5 → turns from household predicates to navigation and social interaction, where path efficiency and safety must be measured together.
ManiSkill Contributors. "ManiSkill Documentation."
ManiSkill provides manipulation tasks, demonstrations, GPU-parallel workflows, and documentation for robot-learning experiments. It is relevant when this section asks how benchmark design turns simulator capability into comparable evidence. Readers should connect this source to household and long-horizon: behavior-1k / omnigibson when deciding what is reusable, what is benchmark-specific, and what must be remeasured.
RoboCasa Team. "RoboCasa Documentation."
RoboCasa documents everyday manipulation tasks and simulation assets, including the 2024 release lineage and later RoboCasa365 expansion. Readers should use it to study how task diversity and environment generation affect benchmark claims. Readers should connect this source to household and long-horizon: behavior-1k / omnigibson when deciding what is reusable, what is benchmark-specific, and what must be remeasured.
Mandlekar, A. et al. "robomimic Documentation."
robomimic provides datasets and algorithms for learning from demonstrations. It matters here because benchmark evaluation often depends as much on dataset format and split discipline as on simulator physics. Readers should connect this source to household and long-horizon: behavior-1k / omnigibson when deciding what is reusable, what is benchmark-specific, and what must be remeasured.
James, S. et al. (2019). "RLBench: The Robot Learning Benchmark and Learning Environment." arXiv.
RLBench frames a large set of vision-guided manipulation tasks with demonstrations and task variation. It is useful for readers studying few-shot, multi-task, and manipulation benchmark design. Readers should connect this source to household and long-horizon: behavior-1k / omnigibson when deciding what is reusable, what is benchmark-specific, and what must be remeasured.
Stanford Vision and Learning Lab. "BEHAVIOR-1K."
BEHAVIOR-1K grounds household embodied AI tasks in human needs and long-horizon mobile manipulation. It gives benchmark designers a concrete example of task suites that go beyond isolated tabletop success rates. Readers should connect this source to household and long-horizon: behavior-1k / omnigibson when deciding what is reusable, what is benchmark-specific, and what must be remeasured.