Section 30.6: Language- and image-goal navigation | Building Embodied AI: From Perception to Autonomous Action

"A plan is only smart if the wheels, floor, people, and clock all agree to it."
A Local Planner With Commitment Issues

Figure 30.6.1: The navigation loop turns goals into candidate motions, filters them through constraints, and publishes only commands the robot can execute.

Big Picture

Language- and image-goal navigation turns maps and goals into constrained motion. The planner is not searching for a pretty line; it is choosing a feasible commitment under geometry, dynamics, uncertainty, moving obstacles, and recovery rules.

Problem First

A command such as find the red mug is not a coordinate. The robot must ground a linguistic or visual goal into possible regions, objects, viewpoints, and stopping conditions.

Goal grounding adds an interpretation layer before planning. Language, image, and object detections produce candidate goals with confidence. The navigation stack then plans to viewpoints that can verify the goal rather than blindly driving to the nearest semantic guess.

Feasibility Before Beauty

The best-looking route is not the best robot plan unless the controller can track it, the costmap reflects current hazards, and replanning has a defined trigger. Navigation quality is measured by executed motion, not only by path length.

Formal Model

Most navigation methods can be read as constrained search or optimization:

$$ g^*=\arg\max_g p(g\mid \text{language},\text{image},m),\quad \pi^*=\arg\min_\pi C(\pi,g^*) $$

The cost term names what the robot wants. The constraints name what reality permits: collision clearance, velocity and acceleration limits, curvature bounds, kinodynamic feasibility, perception confidence, and safety monitors.

Algorithm: Section 30.6 Planning Loop

Parse or embed the goal into object, room, relation, or image-match constraints.
Generate candidate map targets with confidence and observability requirements.
Plan to a viewpoint that can verify the goal.
Stop only when perception evidence satisfies the goal predicate.

Worked Diagnostic

Code Fragment 1 isolates the planning idea in a tiny runnable example. The goal is not to replace Nav2 or OMPL; the goal is to make the invariant visible before the full stack absorbs it.

# Choose a semantic goal candidate with confidence and travel cost.
# The best target balances semantic match with route cost.
candidates = [
    {"place": "kitchen_counter", "match": 0.82, "cost": 9.0},
    {"place": "desk", "match": 0.76, "cost": 4.0},
    {"place": "shelf", "match": 0.91, "cost": 15.0},
]
ranked = sorted(((c["match"] - 0.03 * c["cost"], c["place"]) for c in candidates), reverse=True)
print(ranked[0])

(0.64, 'kitchen_counter')

Expected output interpretation. The kitchen counter wins because its stronger semantic match outweighs the extra travel cost relative to the desk. The output should be interpreted as a grounded-goal decision, not a path plan yet: the robot has selected the most plausible destination hypothesis, and only then should route generation and viewpoint verification begin.

Code Fragment 1: The scoring rule chooses a goal candidate by combining semantic confidence and route cost. Real systems also include viewpoint quality, uncertainty, and a stopping verifier.

Tool Workflow

Library Shortcut

CLIP-style embeddings, open-vocabulary detectors, semantic maps, Habitat simulators, and Nav2 planners can be joined into a language-goal navigation pipeline. The library shortcut handles perception embeddings and navigation execution while the book's evidence contract preserves auditability.

Keep the small implementation as a regression test. Use the maintained stack for maps, costmaps, behavior trees, controllers, plugins, simulation replay, and deployment telemetry.

Failure Mode To Test

Every planner in this chapter should be replayed with blocked corridors, moving obstacles, localization jumps, stale costmaps, actuator saturation, and recovery failure. A plan that only works in a clean static grid is a sketch, not an embodied system.

Practical Example

A delivery robot should log global path, local command, costmap snapshot, controller error, nearest obstacle distance, replan count, and recovery action. Those fields separate a weak route planner from a bad local controller or an outdated perception layer.

Integration Checklist

Before comparing planners, freeze the robot footprint, inflation radius, controller frequency, maximum velocity, acceleration limits, map resolution, and localization source. Otherwise the comparison silently mixes planner quality with robot configuration. A serious navigation report should also include a route replay, a costmap snapshot at the decision point, the exact recovery behavior that was enabled, and whether the final command respected the same kinodynamic limits used during planning.

Research Frontier

The frontier is reliable grounding: open-vocabulary object search, embodied question answering, memory maps, and stopping rules that do not confuse a plausible detection with task completion.

Memory Hook

A planner that ignores dynamics is a cartographer with excellent handwriting and no driver license.

Self Check

Can you state the search space, cost function, constraints, replanning trigger, controller interface, and failure metric for language- and image-goal navigation? If not, the planner is not specified enough to deploy.

Key Takeaway

Language- and image-goal navigation is ready for embodied use when route quality, dynamic feasibility, local control, and recovery behavior are measured in the same replay.

Exercise 30.6.1

Create a three-scenario planning panel: clear route, blocked route, and dynamic obstacle. Report path cost, minimum clearance, replan count, controller saturation, and final mission outcome for the same robot model.

What's Next?

Continue to Section 30.7: Field navigation under degraded sensing, where this planning contract connects to the next embodied capability.

Section References

LaValle, S. M. "Planning Algorithms." Cambridge University Press, 2006. http://lavalle.pl/planning/

Open textbook reference for graph search, sampling-based planning, configuration spaces, and kinodynamic planning.

OMPL Project. "Open Motion Planning Library." Official documentation. https://ompl.kavrakilab.org/

Primary tool reference for sampling-based planners such as RRT, RRTstar, PRM, and kinodynamic variants.

ROS 2 Navigation Project. "Nav2 documentation." Official documentation. https://navigation.ros.org/

Primary documentation for global planners, controllers, costmaps, behavior trees, and recovery behaviors.