Section 31.5: Task planning from language; ambiguity and clarification | Building Embodied AI: From Perception to Autonomous Action

A Careful Control Loop

Technical illustration for Section 31.5: Task planning from language; ambiguity and clarification. — Figure 31.5A: Task planning under language ambiguity: a clarification-request module estimates instruction entropy, asks the user one targeted question when uncertainty is high, and re-plans with the resolved intent.

For Task planning from language; ambiguity and clarification, read the figure as an interface check: identify the language input, grounding evidence, action representation, safety gate, and logged result before accepting the agent behavior described below.

Figure 31.5: A closed-loop map for Task planning from language; ambiguity and clarification. The diagram forces the reader to name the input, model boundary, action interface, and evidence record before trusting the system.

Build And Evaluation Checklist

Depth and self-containment. This section must explain when a language-guided agent should act, when it should ask, and how ambiguity propagates into plan quality. The reader needs a formal test for whether clarification is worth the latency.

Production and evaluation contract. The minimum artifact records candidate interpretations, plan value under each interpretation, the clarification question if one was asked, and the post-clarification plan revision. Without that record, ambiguity handling cannot be audited.

Checklist Memory Anchor

For Task planning from language; ambiguity and clarification, name the language interface, grounded world state, executable action contract, and evidence artifact before trusting any claimed improvement.

Mini Audit Exercise

For Task planning from language; ambiguity and clarification, write one evidence row recording instruction, world-state estimate, chosen action, verifier result, and failure label. Then identify which field would change first under command misunderstanding.

Big Picture

Task planning from language becomes credible only when the agent can tell the difference between missing information and a difficult plan. Clarification is not an admission of weakness; it is a control action that buys information.

This section connects language planning to active information gathering by showing how an embodied agent should ask before acting when multiple interpretations lead to different risks or trajectories.

The practical question is not 'can the model generate a plan?' but 'should the agent trust the top plan without first reducing ambiguity?'

Action Is The Test

Clarification is rational whenever the expected value of disambiguation exceeds the cost of asking and waiting.

Theory

Let $m \in \mathcal M$ be a latent meaning of the instruction, and let $V(\pi, m)$ be the value of executing plan $\pi$ under that meaning. If the agent can ask a question $q$ with cost $c(q)$, the value of clarification is $$\operatorname{VoI}(q) = \mathbb E_{y \sim p(y \mid q)}\left[\max_\pi \mathbb E_{m \mid y, q} V(\pi, m)\right] - \max_\pi \mathbb E_m V(\pi, m) - c(q).$$ Ask when this quantity is positive.

In practice, the agent approximates this computation with confidence gaps, risk heuristics, or plan disagreement. The deeper lesson is that ambiguity should be represented in the planner's state rather than hidden inside the prompt. Otherwise the robot executes one interpretation while the human assumes another.

Mechanism

A clean clarification loop has four steps: detect multiple plausible task objects, estimate how much the best plan changes across them, ask the smallest question that splits the candidate set, then replan under the updated belief. This is active perception applied to language.

Worked Example

Code Fragment 1 computes a tiny expected-value test for whether to ask before acting. The numbers are synthetic, but the control logic is the same in household dialogue, warehouse dispatch, and mobile manipulation.

# Ask for clarification when plan value changes sharply across meanings.
# The cost of asking should be compared against the value of better execution.
# A small confidence gap does not matter unless it changes the chosen plan.
candidate_meanings = {
    "bring_red_mug": {"best_plan_value": 0.92},
    "bring_blue_mug": {"best_plan_value": 0.41},
}
ask_cost = 0.05
no_question_value = 0.5 * 0.92 + 0.5 * 0.41
after_question_value = max(0.92, 0.41)
voi = round(after_question_value - no_question_value - ask_cost, 2)

print({"no_question": round(no_question_value, 2), "after_question": after_question_value, "voi": voi})
print("ask" if voi > 0 else "act")

{'no_question': 0.67, 'after_question': 0.92, 'voi': 0.2} ask

Code Fragment 1: This calculation shows why ambiguity should be treated as a planning variable, not only as a language score. Because the expected value gain from disambiguation exceeds the asking cost, the rational action is to clarify before moving.

Library Shortcut

Dialogue managers, LangGraph state machines, and tool-calling APIs implement the same loop with a few nodes: detect ambiguity, ask, validate the reply, and replan. They hide the bookkeeping around state transitions and logging so the engineer can focus on the ambiguity test itself.

Practical Recipe

Maintain more than one candidate task object whenever the parse is not decisive.
Measure plan disagreement, risk difference, or verifier difference across those candidates.
Ask the smallest clarification question that collapses the uncertainty the most.
Treat the user's reply as a state update, then rerun grounding and planning.
Log the pre-question and post-question plan so ambiguity handling is auditable.

Common Failure Mode

A common failure mode is to ask too late, after the robot has already committed to a costly motion. Another is to ask too vaguely, which forces the human to restate the whole task instead of resolving the one missing variable.

Practical Example

In a hospital room, 'bring me the chart on the table' may refer to several documents. If walking to the wrong side of the room is costly or disruptive, a two-second clarification question can save a minute of motion and a socially awkward recovery.

Memory Hook

Humans call it a clarifying question. Robots call it avoiding a future apology tour.

Research Frontier

Research is shifting from one-shot instruction following toward mixed-initiative systems that decide when to ask, point, move for a better view, or request confirmation. The hard open problem is calibrating these interventions so they improve task success without becoming annoying or slow.

Self Check

Can you name one task where the top-1 parse confidence looks high, but the difference between the top two meanings still justifies asking because the wrong choice would be costly or unsafe?

Clarification is best understood as a control action that changes the information state. It belongs in the same conceptual family as camera motion for better visibility or probing contact to reduce pose uncertainty. The agent spends time now to improve policy value later.

This framing also clarifies evaluation. A system that asks more questions is not automatically worse. It is worse only if those questions do not buy enough downstream value, such as safer execution, lower path length, or fewer catastrophic failures.

Tool Choices For Clarification and Replanning

Tool or Library	Role in the Topic	Builder Advice
LangGraph	Stateful dialogue and replanning loops.	Use it when ambiguity resolution spans several tool calls and planner updates.
BehaviorTree.CPP	Execution trees with question, wait, and fallback branches.	Use it when clarification is one branch among several recovery actions.
TEACh	Benchmark for dialogue during embodied execution.	Use it when you need a dataset where asking and acting are intertwined.
ROS 2 actions	Cancelable skills during clarification.	Use actions when the robot may need to pause or preempt a running behavior while asking.
Pydantic task objects	Structured storage of multiple candidate meanings.	Use them when ambiguity should survive across planner and verifier modules.

Code Fragment 2 stores the ambiguity state and the chosen clarification question in one artifact. The planner can then compare the original and revised plan without losing the reason the question was asked.

Store the top candidate meanings instead of only the winner.
Attach a question template to the specific slot that needs disambiguation.
Pause or gate dangerous actions until the reply is received or a timeout fires.
After the reply, re-run grounding and planning from the updated task object.
Audit whether clarification improved success, safety, or efficiency on the same episode set.

If the clarification loop underperforms, check whether ambiguity was detected too late, whether the wrong slot was queried, or whether the user reply failed to update the internal task object. These are distinct bugs with different remedies.

Key Takeaway

Ambiguity handling is part of planning, not just part of conversation.

Exercise 31.5.1

Construct a two-interpretation task where acting immediately is cheaper but risky, while asking first is slower but safer. Estimate the value of information and decide which policy you would deploy.

Bibliography and Further Reading

Primary Sources and Tools

Padmakumar et al. (2022). "TEACh: Task-driven Embodied Agents that Chat." AAAI.

TEACh is a key source for dialogue-driven clarification and task progress in embodied settings.

Paper or Documentation

LangGraph Documentation.

LangGraph is a practical reference for stateful LLM control loops with explicit replanning and tool routing.

Paper or Documentation

Wang et al. (2025). "EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models as Embodied Agents." arXiv.

EmbodiedBench is useful for thinking about evaluation protocols in embodied LLM systems, including interaction and replanning.

Paper or Documentation