A Careful Control Loop
Navigation and social: Habitat 3.0, AI2-THOR / ProcTHOR asks what a benchmark number actually measures. A task suite is useful only when its episodes, splits, seeds, metrics, wrappers, and failure labels match the embodied construct the paper claims to evaluate.
For Navigation and social: Habitat 3.0, AI2-THOR / ProcTHOR, benchmark discipline means freezing environment API, task panel, success definition, seed policy, and artifact schema before comparing methods.
What This Section Builds
This section makes navigation and social benchmarks operational. It distinguishes pure navigation success from efficient navigation, rearrangement progress, and social behavior such as following a humanoid while maintaining safe distance.
The goal is to keep path metrics, scene splits, generated-house seeds, and human-interaction rules attached to the result. Without those fields, a leaderboard row can hide whether a policy learned navigation, memorized layouts, or benefited from easier generated homes.
For Navigation and social: Habitat 3.0, AI2-THOR / ProcTHOR, treat the leaderboard as an instrument: it is interpretable only when the benchmark isolates the capability, fixes the protocol, and records rerunnable context.
Theory
Navigation benchmarks often report success and path efficiency together. If the shortest path length is $L$ and the agent path length is $P$, success weighted by path length is $S \cdot L / \max(P, L)$, where $S$ is 1 for success and 0 for failure. This prevents a policy from getting full credit for eventually reaching the goal through an inefficient route.
Social navigation adds another constraint: the robot should achieve its goal without crowding, blocking, or colliding with the humanoid or human collaborator. Habitat 3.0 makes this explicit through collaborative tasks such as social navigation and social rearrangement, while AI2-THOR and ProcTHOR emphasize interactive indoor environments and procedurally generated houses.
The mechanism is a scene-sampled path evaluation. The harness samples a house or generated layout, places the agent and target, runs the policy, measures success, path length, collisions, object-state changes, and social-distance violations, then aggregates by scene split and seed.
Worked Example
Code Fragment 1 computes success weighted by path length for three episodes. The same pass also keeps collisions visible, because a socially unsafe route should not be celebrated as an efficient route.
# Compute navigation efficiency from the same episode table as safety.
# Success weighted by path length rewards reaching the goal efficiently,
# while collision counts keep unsafe routes visible.
episodes = [
{"success": 1, "shortest": 6.0, "actual": 7.5, "collisions": 0},
{"success": 1, "shortest": 5.0, "actual": 12.0, "collisions": 2},
{"success": 0, "shortest": 8.0, "actual": 10.0, "collisions": 1},
]
spl_values = [
ep["success"] * ep["shortest"] / max(ep["actual"], ep["shortest"])
for ep in episodes
]
mean_spl = sum(spl_values) / len(spl_values)
total_collisions = sum(ep["collisions"] for ep in episodes)
print(f"mean_spl={mean_spl:.2f}, total_collisions={total_collisions}")
spl_values calculation rewards successful short paths and gives zero to failed episodes. The separate total_collisions count keeps social and physical safety visible when interpreting Habitat, AI2-THOR, or ProcTHOR navigation results.The suite should provide scenes, sensors, navigation graph or physics, and task definitions. Your evaluation layer should still record scene split, generated-house seed, target type, path-length rule, collision rule, social-distance threshold, and whether human or humanoid behavior was scripted, sampled, or interactive.
Practical Recipe
- Choose the construct: point navigation, object navigation, rearrangement, social navigation, or social rearrangement.
- Freeze scene split, generated-house seed list, target sampling, sensor suite, action horizon, and path-length normalization.
- Report success, path efficiency, collisions, timeout rate, and social-distance violations from the same evaluation pass.
- For ProcTHOR-style generation, save generator version and house seeds so the panel can be reconstructed.
- For Habitat 3.0-style social tasks, record humanoid policy, interaction mode, and safety threshold.
For Navigation and social: Habitat 3.0, AI2-THOR / ProcTHOR, compare only metrics co-computed in one benchmark pass with the same task panel, wrappers, seed policy, success definition, and logged failure labels.
The common mistake is comparing navigation numbers without checking the path-length rule and scene split. A policy evaluated on familiar houses, easier generated seeds, or a looser collision threshold may outrank a better policy measured under a stricter protocol.
A navigation team should log scene ID, generated-house seed, start and goal, shortest path length, actual path length, success, collisions, timeout, social-distance violations, humanoid behavior source, and replay path. Those fields reveal whether a method navigates robustly or benefits from familiar layouts and forgiving interaction rules.
A robot that reaches the goal by walking through the crowd has solved the map but failed the room.
The frontier is moving from static navigation to interactive homes with generated layouts, object rearrangement, and human-aware agents. The evaluation challenge is to keep efficiency, task progress, and social safety co-computed so one metric cannot hide regressions in another.
Can you name the scene split, generated-house seeds, target sampling rule, shortest-path definition, collision rule, social-distance threshold, humanoid policy, and aggregation metric? If not, the experiment boundary is still too vague.
Navigation and social benchmarks become useful when they preserve both route quality and interaction quality. A Habitat 3.0 social-navigation result should not report only whether the robot found and followed the person. It should also report whether the robot maintained safe distance, avoided blocking, and completed the task under the same humanoid behavior model as the baseline.
The graduate-level habit is to separate map competence from social competence. AI2-THOR and ProcTHOR can stress generalization across interactive indoor scenes and generated homes. Habitat 3.0 can stress collaboration with humanoid or human behavior. A paper-facing comparison should say which competence is measured and which scene or human-behavior distribution was held out.
| Benchmark family | Primary construct | Protocol detail to freeze |
|---|---|---|
| Habitat 3.0 | Social navigation and social rearrangement with humanoid or human interaction | Humanoid behavior model, social-distance threshold, scene split, action horizon, and collaboration metric. |
| Habitat navigation tasks | Point, object, and embodied navigation in 3D scenes | Scene split, sensor suite, start-goal sampling, shortest-path metric, and success radius. |
| AI2-THOR | Interactive indoor navigation, object interaction, and rearrangement | Scene set, object states, interaction actions, horizon, and task predicate definitions. |
| ProcTHOR | Scale through procedurally generated houses | Generator version, house seeds, train/test generated panels, and zero-shot evaluation scenes. |
| Social evaluation overlays | Safety around people or humanoid agents | Collision rule, personal-space threshold, blocking definition, and human-in-the-loop condition. |
A robust navigation evaluation starts with the episode panel. The panel should list each scene or generated house, start and target, shortest path length, seed, humanoid behavior if present, and the metric rule. Every method should consume that panel unchanged.
- Write an episode panel with scene ID, generated-house seed, start, target, and shortest path length.
- Freeze sensors, action space, success radius, collision rule, horizon, and social-distance threshold.
- Evaluate every method through one script that computes success, efficiency, collisions, and social violations.
- Save per-episode replays and stratify results by seen versus unseen scenes or generated houses.
- For social tasks, report both task completion and human-aware safety metrics.
Code Fragment 2 records the navigation protocol fields that make a result replayable. The same schema works for a static navigation panel or a ProcTHOR-generated panel.
# Record a navigation result with path and social metrics together.
# Co-computing these fields prevents a method from optimizing route
# efficiency while hiding collisions or personal-space violations.
from dataclasses import dataclass, asdict
@dataclass
class NavigationResult:
suite: str
scene_split: str
generated_seed_panel: str
success_rate: float
mean_spl: float
social_violations_per_episode: float
def as_row(self) -> dict[str, object]:
return asdict(self)
result = NavigationResult(
suite="Habitat 3.0",
scene_split="unseen_homes",
generated_seed_panel="proc_panel_v2",
success_rate=0.74,
mean_spl=0.51,
social_violations_per_episode=0.18,
)
print(result.as_row())
NavigationResult stores success, mean_spl, and social violations under one scene split and generated-seed panel. This prevents path efficiency and human-aware safety from being reported as separate, non-comparable diagnostics.Expected output: the printed result should expose scene split, generated-seed panel, success, path efficiency, and social violations. If one field changes between methods, the comparison should stay in diagnostics rather than the paper table.
When a navigation or social experiment fails, replay the path and tag the first failure mode: wrong target, inefficient route, collision, timeout, blocked humanoid, social-distance violation, object-state miss, or generated-scene mismatch. The tag tells you whether to fix mapping, planning, interaction policy, or the evaluation panel.
Navigation and social benchmarks are useful when success, path efficiency, collisions, generated-scene splits, and human-aware safety metrics are co-computed from one episode panel.
Design a Habitat, AI2-THOR, or ProcTHOR comparison. Specify scene split, generated-house seeds if any, start-target sampling, success radius, path metric, collision rule, social-distance threshold, and one failure label.
Section 12.6 → brings the chapter together by showing how to read leaderboards without mixing incompatible panels, splits, seeds, metrics, or wrappers.
ManiSkill Contributors. "ManiSkill Documentation."
ManiSkill provides manipulation tasks, demonstrations, GPU-parallel workflows, and documentation for robot-learning experiments. It is relevant when this section asks how benchmark design turns simulator capability into comparable evidence. Readers should connect this source to navigation and social: habitat 3.0, ai2-thor / procthor when deciding what is reusable, what is benchmark-specific, and what must be remeasured.
RoboCasa Team. "RoboCasa Documentation."
RoboCasa documents everyday manipulation tasks and simulation assets, including the 2024 release lineage and later RoboCasa365 expansion. Readers should use it to study how task diversity and environment generation affect benchmark claims. Readers should connect this source to navigation and social: habitat 3.0, ai2-thor / procthor when deciding what is reusable, what is benchmark-specific, and what must be remeasured.
Mandlekar, A. et al. "robomimic Documentation."
robomimic provides datasets and algorithms for learning from demonstrations. It matters here because benchmark evaluation often depends as much on dataset format and split discipline as on simulator physics. Readers should connect this source to navigation and social: habitat 3.0, ai2-thor / procthor when deciding what is reusable, what is benchmark-specific, and what must be remeasured.
James, S. et al. (2019). "RLBench: The Robot Learning Benchmark and Learning Environment." arXiv.
RLBench frames a large set of vision-guided manipulation tasks with demonstrations and task variation. It is useful for readers studying few-shot, multi-task, and manipulation benchmark design. Readers should connect this source to navigation and social: habitat 3.0, ai2-thor / procthor when deciding what is reusable, what is benchmark-specific, and what must be remeasured.
Stanford Vision and Learning Lab. "BEHAVIOR-1K."
BEHAVIOR-1K grounds household embodied AI tasks in human needs and long-horizon mobile manipulation. It gives benchmark designers a concrete example of task suites that go beyond isolated tabletop success rates. Readers should connect this source to navigation and social: habitat 3.0, ai2-thor / procthor when deciding what is reusable, what is benchmark-specific, and what must be remeasured.