A Careful Control Loop
For Task planning from language; ambiguity and clarification, read the figure as an interface check: identify the language input, grounding evidence, action representation, safety gate, and logged result before accepting the agent behavior described below.
Build And Evaluation Checklist
Depth and self-containment. This section must explain when a language-guided agent should act, when it should ask, and how ambiguity propagates into plan quality. The reader needs a formal test for whether clarification is worth the latency.
Production and evaluation contract. The minimum artifact records candidate interpretations, plan value under each interpretation, the clarification question if one was asked, and the post-clarification plan revision. Without that record, ambiguity handling cannot be audited.
For Task planning from language; ambiguity and clarification, name the language interface, grounded world state, executable action contract, and evidence artifact before trusting any claimed improvement.
For Task planning from language; ambiguity and clarification, write one evidence row recording instruction, world-state estimate, chosen action, verifier result, and failure label. Then identify which field would change first under command misunderstanding.
Task planning from language becomes credible only when the agent can tell the difference between missing information and a difficult plan. Clarification is not an admission of weakness; it is a control action that buys information.
This section connects language planning to active information gathering by showing how an embodied agent should ask before acting when multiple interpretations lead to different risks or trajectories.
The practical question is not 'can the model generate a plan?' but 'should the agent trust the top plan without first reducing ambiguity?'
Clarification is rational whenever the expected value of disambiguation exceeds the cost of asking and waiting.
Theory
Let $m \in \mathcal M$ be a latent meaning of the instruction, and let $V(\pi, m)$ be the value of executing plan $\pi$ under that meaning. If the agent can ask a question $q$ with cost $c(q)$, the value of clarification is $$\operatorname{VoI}(q) = \mathbb E_{y \sim p(y \mid q)}\left[\max_\pi \mathbb E_{m \mid y, q} V(\pi, m)\right] - \max_\pi \mathbb E_m V(\pi, m) - c(q).$$ Ask when this quantity is positive.
In practice, the agent approximates this computation with confidence gaps, risk heuristics, or plan disagreement. The deeper lesson is that ambiguity should be represented in the planner's state rather than hidden inside the prompt. Otherwise the robot executes one interpretation while the human assumes another.
A clean clarification loop has four steps: detect multiple plausible task objects, estimate how much the best plan changes across them, ask the smallest question that splits the candidate set, then replan under the updated belief. This is active perception applied to language.
Worked Example
Code Fragment 1 computes a tiny expected-value test for whether to ask before acting. The numbers are synthetic, but the control logic is the same in household dialogue, warehouse dispatch, and mobile manipulation.
# Ask for clarification when plan value changes sharply across meanings.
# The cost of asking should be compared against the value of better execution.
# A small confidence gap does not matter unless it changes the chosen plan.
candidate_meanings = {
"bring_red_mug": {"best_plan_value": 0.92},
"bring_blue_mug": {"best_plan_value": 0.41},
}
ask_cost = 0.05
no_question_value = 0.5 * 0.92 + 0.5 * 0.41
after_question_value = max(0.92, 0.41)
voi = round(after_question_value - no_question_value - ask_cost, 2)
print({"no_question": round(no_question_value, 2), "after_question": after_question_value, "voi": voi})
print("ask" if voi > 0 else "act")
Dialogue managers, LangGraph state machines, and tool-calling APIs implement the same loop with a few nodes: detect ambiguity, ask, validate the reply, and replan. They hide the bookkeeping around state transitions and logging so the engineer can focus on the ambiguity test itself.
Practical Recipe
- Maintain more than one candidate task object whenever the parse is not decisive.
- Measure plan disagreement, risk difference, or verifier difference across those candidates.
- Ask the smallest clarification question that collapses the uncertainty the most.
- Treat the user's reply as a state update, then rerun grounding and planning.
- Log the pre-question and post-question plan so ambiguity handling is auditable.
A common failure mode is to ask too late, after the robot has already committed to a costly motion. Another is to ask too vaguely, which forces the human to restate the whole task instead of resolving the one missing variable.
In a hospital room, 'bring me the chart on the table' may refer to several documents. If walking to the wrong side of the room is costly or disruptive, a two-second clarification question can save a minute of motion and a socially awkward recovery.
Humans call it a clarifying question. Robots call it avoiding a future apology tour.
Research is shifting from one-shot instruction following toward mixed-initiative systems that decide when to ask, point, move for a better view, or request confirmation. The hard open problem is calibrating these interventions so they improve task success without becoming annoying or slow.
Can you name one task where the top-1 parse confidence looks high, but the difference between the top two meanings still justifies asking because the wrong choice would be costly or unsafe?
Clarification is best understood as a control action that changes the information state. It belongs in the same conceptual family as camera motion for better visibility or probing contact to reduce pose uncertainty. The agent spends time now to improve policy value later.
This framing also clarifies evaluation. A system that asks more questions is not automatically worse. It is worse only if those questions do not buy enough downstream value, such as safer execution, lower path length, or fewer catastrophic failures.
| Tool or Library | Role in the Topic | Builder Advice |
|---|---|---|
| LangGraph | Stateful dialogue and replanning loops. | Use it when ambiguity resolution spans several tool calls and planner updates. |
| BehaviorTree.CPP | Execution trees with question, wait, and fallback branches. | Use it when clarification is one branch among several recovery actions. |
| TEACh | Benchmark for dialogue during embodied execution. | Use it when you need a dataset where asking and acting are intertwined. |
| ROS 2 actions | Cancelable skills during clarification. | Use actions when the robot may need to pause or preempt a running behavior while asking. |
| Pydantic task objects | Structured storage of multiple candidate meanings. | Use them when ambiguity should survive across planner and verifier modules. |
Code Fragment 2 stores the ambiguity state and the chosen clarification question in one artifact. The planner can then compare the original and revised plan without losing the reason the question was asked.
- Store the top candidate meanings instead of only the winner.
- Attach a question template to the specific slot that needs disambiguation.
- Pause or gate dangerous actions until the reply is received or a timeout fires.
- After the reply, re-run grounding and planning from the updated task object.
- Audit whether clarification improved success, safety, or efficiency on the same episode set.
If the clarification loop underperforms, check whether ambiguity was detected too late, whether the wrong slot was queried, or whether the user reply failed to update the internal task object. These are distinct bugs with different remedies.
Ambiguity handling is part of planning, not just part of conversation.
Construct a two-interpretation task where acting immediately is cheaper but risky, while asking first is slower but safer. Estimate the value of information and decide which policy you would deploy.
Padmakumar et al. (2022). "TEACh: Task-driven Embodied Agents that Chat." AAAI.
TEACh is a key source for dialogue-driven clarification and task progress in embodied settings.
LangGraph is a practical reference for stateful LLM control loops with explicit replanning and tool routing.
EmbodiedBench is useful for thinking about evaluation protocols in embodied LLM systems, including interaction and replanning.