Section 12.5: Navigation and social: Habitat 3.0, AI2-THOR / ProcTHOR

A Careful Control Loop
Cartoon navigation robot choosing between a short unsafe route through people and a longer safe route through an unseen generated home.
Figure 12.5A: A navigation score should not hide the route, the generated scene seed, or the cost paid in social safety.
Big Picture

Navigation and social: Habitat 3.0, AI2-THOR / ProcTHOR asks what a benchmark number actually measures. A task suite is useful only when its episodes, splits, seeds, metrics, wrappers, and failure labels match the embodied construct the paper claims to evaluate.

For Navigation and social: Habitat 3.0, AI2-THOR / ProcTHOR, benchmark discipline means freezing environment API, task panel, success definition, seed policy, and artifact schema before comparing methods.

What This Section Builds

This section makes navigation and social benchmarks operational. It distinguishes pure navigation success from efficient navigation, rearrangement progress, and social behavior such as following a humanoid while maintaining safe distance.

The goal is to keep path metrics, scene splits, generated-house seeds, and human-interaction rules attached to the result. Without those fields, a leaderboard row can hide whether a policy learned navigation, memorized layouts, or benefited from easier generated homes.

Evidence Is The Test

For Navigation and social: Habitat 3.0, AI2-THOR / ProcTHOR, treat the leaderboard as an instrument: it is interpretable only when the benchmark isolates the capability, fixes the protocol, and records rerunnable context.

Theory

Navigation benchmarks often report success and path efficiency together. If the shortest path length is $L$ and the agent path length is $P$, success weighted by path length is $S \cdot L / \max(P, L)$, where $S$ is 1 for success and 0 for failure. This prevents a policy from getting full credit for eventually reaching the goal through an inefficient route.

Social navigation adds another constraint: the robot should achieve its goal without crowding, blocking, or colliding with the humanoid or human collaborator. Habitat 3.0 makes this explicit through collaborative tasks such as social navigation and social rearrangement, while AI2-THOR and ProcTHOR emphasize interactive indoor environments and procedurally generated houses.

Mechanism

The mechanism is a scene-sampled path evaluation. The harness samples a house or generated layout, places the agent and target, runs the policy, measures success, path length, collisions, object-state changes, and social-distance violations, then aggregates by scene split and seed.

Worked Example

Code Fragment 1 computes success weighted by path length for three episodes. The same pass also keeps collisions visible, because a socially unsafe route should not be celebrated as an efficient route.

# Compute navigation efficiency from the same episode table as safety.
# Success weighted by path length rewards reaching the goal efficiently,
# while collision counts keep unsafe routes visible.
episodes = [
    {"success": 1, "shortest": 6.0, "actual": 7.5, "collisions": 0},
    {"success": 1, "shortest": 5.0, "actual": 12.0, "collisions": 2},
    {"success": 0, "shortest": 8.0, "actual": 10.0, "collisions": 1},
]

spl_values = [
    ep["success"] * ep["shortest"] / max(ep["actual"], ep["shortest"])
    for ep in episodes
]
mean_spl = sum(spl_values) / len(spl_values)
total_collisions = sum(ep["collisions"] for ep in episodes)

print(f"mean_spl={mean_spl:.2f}, total_collisions={total_collisions}")
mean_spl=0.41, total_collisions=3
Code Fragment 1: The spl_values calculation rewards successful short paths and gives zero to failed episodes. The separate total_collisions count keeps social and physical safety visible when interpreting Habitat, AI2-THOR, or ProcTHOR navigation results.
Library Shortcut

The suite should provide scenes, sensors, navigation graph or physics, and task definitions. Your evaluation layer should still record scene split, generated-house seed, target type, path-length rule, collision rule, social-distance threshold, and whether human or humanoid behavior was scripted, sampled, or interactive.

Practical Recipe

  1. Choose the construct: point navigation, object navigation, rearrangement, social navigation, or social rearrangement.
  2. Freeze scene split, generated-house seed list, target sampling, sensor suite, action horizon, and path-length normalization.
  3. Report success, path efficiency, collisions, timeout rate, and social-distance violations from the same evaluation pass.
  4. For ProcTHOR-style generation, save generator version and house seeds so the panel can be reconstructed.
  5. For Habitat 3.0-style social tasks, record humanoid policy, interaction mode, and safety threshold.
Benchmark Evidence Rule

For Navigation and social: Habitat 3.0, AI2-THOR / ProcTHOR, compare only metrics co-computed in one benchmark pass with the same task panel, wrappers, seed policy, success definition, and logged failure labels.

Common Failure Mode

The common mistake is comparing navigation numbers without checking the path-length rule and scene split. A policy evaluated on familiar houses, easier generated seeds, or a looser collision threshold may outrank a better policy measured under a stricter protocol.

Practical Example

A navigation team should log scene ID, generated-house seed, start and goal, shortest path length, actual path length, success, collisions, timeout, social-distance violations, humanoid behavior source, and replay path. Those fields reveal whether a method navigates robustly or benefits from familiar layouts and forgiving interaction rules.

Memory Hook

A robot that reaches the goal by walking through the crowd has solved the map but failed the room.

Research Frontier

The frontier is moving from static navigation to interactive homes with generated layouts, object rearrangement, and human-aware agents. The evaluation challenge is to keep efficiency, task progress, and social safety co-computed so one metric cannot hide regressions in another.

Self Check

Can you name the scene split, generated-house seeds, target sampling rule, shortest-path definition, collision rule, social-distance threshold, humanoid policy, and aggregation metric? If not, the experiment boundary is still too vague.

Navigation and social benchmarks become useful when they preserve both route quality and interaction quality. A Habitat 3.0 social-navigation result should not report only whether the robot found and followed the person. It should also report whether the robot maintained safe distance, avoided blocking, and completed the task under the same humanoid behavior model as the baseline.

The graduate-level habit is to separate map competence from social competence. AI2-THOR and ProcTHOR can stress generalization across interactive indoor scenes and generated homes. Habitat 3.0 can stress collaboration with humanoid or human behavior. A paper-facing comparison should say which competence is measured and which scene or human-behavior distribution was held out.

Navigation And Social Benchmark Audit Fields
Benchmark familyPrimary constructProtocol detail to freeze
Habitat 3.0Social navigation and social rearrangement with humanoid or human interactionHumanoid behavior model, social-distance threshold, scene split, action horizon, and collaboration metric.
Habitat navigation tasksPoint, object, and embodied navigation in 3D scenesScene split, sensor suite, start-goal sampling, shortest-path metric, and success radius.
AI2-THORInteractive indoor navigation, object interaction, and rearrangementScene set, object states, interaction actions, horizon, and task predicate definitions.
ProcTHORScale through procedurally generated housesGenerator version, house seeds, train/test generated panels, and zero-shot evaluation scenes.
Social evaluation overlaysSafety around people or humanoid agentsCollision rule, personal-space threshold, blocking definition, and human-in-the-loop condition.

A robust navigation evaluation starts with the episode panel. The panel should list each scene or generated house, start and target, shortest path length, seed, humanoid behavior if present, and the metric rule. Every method should consume that panel unchanged.

  1. Write an episode panel with scene ID, generated-house seed, start, target, and shortest path length.
  2. Freeze sensors, action space, success radius, collision rule, horizon, and social-distance threshold.
  3. Evaluate every method through one script that computes success, efficiency, collisions, and social violations.
  4. Save per-episode replays and stratify results by seen versus unseen scenes or generated houses.
  5. For social tasks, report both task completion and human-aware safety metrics.

Code Fragment 2 records the navigation protocol fields that make a result replayable. The same schema works for a static navigation panel or a ProcTHOR-generated panel.

# Record a navigation result with path and social metrics together.
# Co-computing these fields prevents a method from optimizing route
# efficiency while hiding collisions or personal-space violations.
from dataclasses import dataclass, asdict

@dataclass
class NavigationResult:
    suite: str
    scene_split: str
    generated_seed_panel: str
    success_rate: float
    mean_spl: float
    social_violations_per_episode: float

    def as_row(self) -> dict[str, object]:
        return asdict(self)

result = NavigationResult(
    suite="Habitat 3.0",
    scene_split="unseen_homes",
    generated_seed_panel="proc_panel_v2",
    success_rate=0.74,
    mean_spl=0.51,
    social_violations_per_episode=0.18,
)
print(result.as_row())
{'suite': 'Habitat 3.0', 'scene_split': 'unseen_homes', 'generated_seed_panel': 'proc_panel_v2', 'success_rate': 0.74, 'mean_spl': 0.51, 'social_violations_per_episode': 0.18}
Code Fragment 2: The NavigationResult stores success, mean_spl, and social violations under one scene split and generated-seed panel. This prevents path efficiency and human-aware safety from being reported as separate, non-comparable diagnostics.

Expected output: the printed result should expose scene split, generated-seed panel, success, path efficiency, and social violations. If one field changes between methods, the comparison should stay in diagnostics rather than the paper table.

When a navigation or social experiment fails, replay the path and tag the first failure mode: wrong target, inefficient route, collision, timeout, blocked humanoid, social-distance violation, object-state miss, or generated-scene mismatch. The tag tells you whether to fix mapping, planning, interaction policy, or the evaluation panel.

Key Takeaway

Navigation and social benchmarks are useful when success, path efficiency, collisions, generated-scene splits, and human-aware safety metrics are co-computed from one episode panel.

Exercise 12.5.1

Design a Habitat, AI2-THOR, or ProcTHOR comparison. Specify scene split, generated-house seeds if any, start-target sampling, success radius, path metric, collision rule, social-distance threshold, and one failure label.

What's Next?

Section 12.6 → brings the chapter together by showing how to read leaderboards without mixing incompatible panels, splits, seeds, metrics, or wrappers.

Bibliography and Further Reading
Tools And Libraries

ManiSkill Contributors. "ManiSkill Documentation."

ManiSkill provides manipulation tasks, demonstrations, GPU-parallel workflows, and documentation for robot-learning experiments. It is relevant when this section asks how benchmark design turns simulator capability into comparable evidence. Readers should connect this source to navigation and social: habitat 3.0, ai2-thor / procthor when deciding what is reusable, what is benchmark-specific, and what must be remeasured.

Tool

RoboCasa Team. "RoboCasa Documentation."

RoboCasa documents everyday manipulation tasks and simulation assets, including the 2024 release lineage and later RoboCasa365 expansion. Readers should use it to study how task diversity and environment generation affect benchmark claims. Readers should connect this source to navigation and social: habitat 3.0, ai2-thor / procthor when deciding what is reusable, what is benchmark-specific, and what must be remeasured.

Tool

Mandlekar, A. et al. "robomimic Documentation."

robomimic provides datasets and algorithms for learning from demonstrations. It matters here because benchmark evaluation often depends as much on dataset format and split discipline as on simulator physics. Readers should connect this source to navigation and social: habitat 3.0, ai2-thor / procthor when deciding what is reusable, what is benchmark-specific, and what must be remeasured.

Tool
Datasets And Benchmarks

James, S. et al. (2019). "RLBench: The Robot Learning Benchmark and Learning Environment." arXiv.

RLBench frames a large set of vision-guided manipulation tasks with demonstrations and task variation. It is useful for readers studying few-shot, multi-task, and manipulation benchmark design. Readers should connect this source to navigation and social: habitat 3.0, ai2-thor / procthor when deciding what is reusable, what is benchmark-specific, and what must be remeasured.

Paper

Stanford Vision and Learning Lab. "BEHAVIOR-1K."

BEHAVIOR-1K grounds household embodied AI tasks in human needs and long-horizon mobile manipulation. It gives benchmark designers a concrete example of task suites that go beyond isolated tabletop success rates. Readers should connect this source to navigation and social: habitat 3.0, ai2-thor / procthor when deciding what is reusable, what is benchmark-specific, and what must be remeasured.

Dataset