Section 33.1: What LLMs can and cannot do in embodied tasks

A Careful Control Loop
Technical illustration for Section 33.1: What LLMs can and cannot do in embodied tasks.
Figure 33.1A: LLM capability vs. limitation chart for embodied tasks: strong on symbolic planning and language grounding, weak on precise metric spatial reasoning and real-time reactive control at high frequency.

Read the figure as a boundary map. The LLM can propose task structure and language-level intent, but the embodied system still needs state estimation, affordance checks, typed action APIs, execution monitoring, and a verifier that can reject unsupported plans.

Closed-loop interface for What LLMs can and cannot do in embodied tasks A four-stage loop connects input, model reasoning, action, and evidence for this page. Instruction Planner Tool API Verifier Observe, decide, act, measure, then feed failure evidence back into the next decision.
Figure 33.1: A closed-loop map for What LLMs can and cannot do in embodied tasks. The diagram forces the reader to name the input, model boundary, action interface, and evidence record before trusting the system.

Build And Evaluation Checklist

Depth and self-containment. This section must separate semantic strengths from physical weaknesses. Readers should know exactly which parts of embodied competence can be offloaded to an LLM and which parts still demand grounded state, typed tools, and control feedback.

Production and evaluation contract. The artifact here is a planner trace with prompt state, proposed subgoals, tool calls, verifier outcomes, and latency. Without those fields, it is impossible to tell whether the LLM contributed real control value or just plausible narration.

Checklist Memory Anchor

For What LLMs can and cannot do in embodied tasks, name the language interface, grounded world state, executable action contract, and evidence artifact before trusting any claimed improvement.

Mini Audit Exercise

For What LLMs can and cannot do in embodied tasks, write one evidence row recording instruction, world-state estimate, chosen action, verifier result, and failure label. Then identify which field would change first under command misunderstanding.

Big Picture

What LLMs can and cannot do in embodied tasks is a question about interface boundaries. LLMs are excellent at semantic decomposition, error explanation, and API selection, but weak at direct state estimation, tight feedback control, and physical constraint satisfaction.

This section defines a clear contract for when an LLM should sit inside an embodied loop and when it should stay outside as an advisor or parser.

The practical question is not whether the LLM can describe the right action sequence, but whether that description can be turned into safe, low-latency, grounded behavior.

Action Is The Test

Use the LLM for semantic search over plans and interfaces, not as a substitute for state estimation or servo control.

Theory

A convenient decomposition is $$\pi(a_t \mid h_t) = \kappa\bigl(\phi_\text{LLM}(x_t, m_t), \hat s_t\bigr),$$ where $\phi_\text{LLM}$ proposes a symbolic or programmatic plan from language context $x_t$ and memory $m_t$, while $\kappa$ is the grounded executor that consumes the proposal together with the current state estimate $\hat s_t$. The executor, not the LLM, owns physical validity.

This split explains both the promise and the limits. LLMs compress broad semantic priors into a small number of candidate subtasks or tool sequences. They do not directly measure friction, occlusion, latency, or actuator saturation. When papers claim strong embodied performance, the key question is how much the grounded stack contributes beyond the language model itself.

Mechanism

Treat the LLM as a high-level search policy over task decompositions, code sketches, or tool calls. Every proposal must pass through typed interfaces, state checks, and local controllers that know the robot's embodiment and current scene.

Worked Example

Code Fragment 1 implements the smallest planner boundary: the LLM proposes one symbolic subgoal, but the executor only accepts it if the required tool and state preconditions are satisfied.

# Accept an LLM proposal only when the grounded stack can execute it.
# The executor checks state and tool availability before acting.
# This boundary keeps semantic planning separate from physical validity.
proposal = {"step": "pick(red_mug)", "required_tool": "grasp"}
state = {"target_visible": True, "toolbox": {"grasp", "place"}}

can_execute = state["target_visible"] and proposal["required_tool"] in state["toolbox"]
decision = "execute" if can_execute else "replan"

print({"proposal": proposal["step"], "can_execute": can_execute, "decision": decision})
{'proposal': 'pick(red_mug)', 'can_execute': True, 'decision': 'execute'}

The expected output is an executable proposal whose semantic content and grounded feasibility agree. If `proposal` looked sensible but `can_execute` were `False`, the correct diagnosis would be a grounding or interface failure rather than a language-understanding success.

Code Fragment 1: This boundary keeps the language model in the role it is good at, proposing a semantically meaningful step, while the grounded stack decides whether the current world state can support it. The important field is `can_execute`, because a plausible textual plan is worthless if the target is not visible or the tool is unavailable.
Library Shortcut

Modern tool-calling APIs and structured-output runtimes turn the same boundary into a few lines by forcing the LLM to emit typed action objects. They absorb prompt formatting, JSON validation, and retry logic so the engineer can focus on the execution contract and verifier design.

Practical Recipe

  1. Write down which variables the LLM sees and which variables only the grounded stack sees.
  2. Require every LLM proposal to map into a typed action or code object.
  3. Attach a verifier to every proposal so textual plausibility never counts as success by itself.
  4. Measure latency separately for planning, execution, and recovery.
  5. Benchmark against strong non-LLM baselines on the same task contract before claiming an embodied gain.
Common Failure Mode

A frequent failure mode is to confuse descriptive competence with control competence. An LLM may explain how to pour safely while still lacking any grounded estimate of the cup pose, liquid dynamics, or actuator limits needed to perform the action.

Practical Example

In mobile manipulation, an LLM can select the sequence 'navigate to sink, grasp sponge, wipe spill,' but it should not be the module that estimates whether the sponge is currently visible or whether the arm can reach it without collision. Those checks belong to perception and planning tools that expose measurable state.

Memory Hook

Large language models are fantastic interns for whiteboard planning. They are much less convincing when asked to be gravity, friction, and depth sensors all at once.

Research Frontier

Current embodied LLM work is moving from prompt-only planners toward typed tool use, verifier loops, and benchmark suites such as EmbodiedBench. The hard scientific question is whether LLMs add embodied value beyond semantic decomposition once strong world models, VLMs, and classic planners are already in place.

Self Check

Can you point to one subproblem in your stack that genuinely benefits from broad language priors, and one subproblem where replacing a grounded estimator with an LLM would be irresponsible or pointless?

A useful scientific discipline is to evaluate LLM contribution at the intervention boundary. Replace the LLM planner with a hand-written planner, a behavior tree, or a retrieval baseline while keeping execution fixed. If performance barely changes, the embodied value comes from the grounded stack, not from the language model.

This also reframes the hype around end-to-end embodied agents. The core question is not whether a model can emit an action token, but whether it can maintain a physically valid internal state, meet timing budgets, and recover from embodiment-specific failures. Most current systems still rely heavily on specialized modules for those responsibilities.

Tool Choices Around the LLM Boundary
Tool or LibraryRole in the TopicBuilder Advice
Structured tool callingTyped action proposals from the LLM.Use it when free-form text would make execution or evaluation ambiguous.
ROS 2 actionsExecution of long-running robot skills with feedback.Use actions when the LLM proposes skills rather than continuous controls.
BehaviorTree.CPPExplicit fallback and retry logic.Use it when LLM proposals need a deterministic execution skeleton.
MoveIt 2Grounded motion planning and collision checking.Use it to execute geometric subgoals that an LLM can describe but not validate.
LangGraphMemory and planner state transitions.Use it when the planner must maintain multi-step conversational or tool context.

Code Fragment 2 saves a planner trace with the fields needed for real evaluation. The key idea is that every proposal is logged alongside its verifier result and latency, so semantic fluency cannot hide execution failure.

  1. Record the prompt context or task card that generated the proposal.
  2. Store the typed action, the verifier result, and the state preconditions checked before execution.
  3. Measure planning latency separately from skill-execution latency.
  4. Tag failures as semantic, state-estimation, tool-interface, or controller failures.
  5. Compare planner variants on the same execution stack and episode set.

The expected output is a co-recorded control trace where the planner decision, the verifier evidence, and the latency measurement all point in the same direction. A useful negative case would keep the same structure but end with `result='blocked'` or `result='replan'`, which would let you localize the failure without replaying the whole episode.

Code Fragment 2: This trace turns a language-model step into an auditable control event. The important engineering discipline is that `planner_output`, `verifier`, and `latency_ms` are co-recorded, which makes it possible to compare semantic quality and execution cost in one artifact.

If an embodied LLM system underperforms, first ask whether the LLM chose the wrong subgoal, whether the tool interface was incomplete, or whether the grounded stack could not realize an otherwise good plan. Those are very different scientific conclusions.

Key Takeaway

LLMs help embodied systems most when they are boxed into a typed planning role and surrounded by grounded verifiers and controllers.

Exercise 33.1.1

Pick one embodied task and write a boundary contract for an LLM planner. List exactly which inputs it sees, which outputs it may emit, and which module rejects invalid proposals before execution.

Bibliography and Further Reading
Primary Sources and Tools

Ahn et al. (2022). "Do As I Can, Not As I Say: Grounding Language in Robotic Affordances." arXiv.

SayCan is the cleanest starting point for understanding how an LLM can propose while grounded affordances dispose.

Paper or Documentation

Wang et al. (2025). "EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models as Embodied Agents." arXiv.

EmbodiedBench is a recent reference for evaluating embodied language systems across navigation and manipulation settings.

Paper or Documentation

BehaviorTree.CPP Documentation. 'Integration with ROS2.'

This documentation is a practical reference for surrounding language proposals with explicit execution logic and recovery branches.

Paper or Documentation