A Careful Control Loop
Read the figure as a boundary map. The LLM can propose task structure and language-level intent, but the embodied system still needs state estimation, affordance checks, typed action APIs, execution monitoring, and a verifier that can reject unsupported plans.
Build And Evaluation Checklist
Depth and self-containment. This section must separate semantic strengths from physical weaknesses. Readers should know exactly which parts of embodied competence can be offloaded to an LLM and which parts still demand grounded state, typed tools, and control feedback.
Production and evaluation contract. The artifact here is a planner trace with prompt state, proposed subgoals, tool calls, verifier outcomes, and latency. Without those fields, it is impossible to tell whether the LLM contributed real control value or just plausible narration.
For What LLMs can and cannot do in embodied tasks, name the language interface, grounded world state, executable action contract, and evidence artifact before trusting any claimed improvement.
For What LLMs can and cannot do in embodied tasks, write one evidence row recording instruction, world-state estimate, chosen action, verifier result, and failure label. Then identify which field would change first under command misunderstanding.
What LLMs can and cannot do in embodied tasks is a question about interface boundaries. LLMs are excellent at semantic decomposition, error explanation, and API selection, but weak at direct state estimation, tight feedback control, and physical constraint satisfaction.
This section defines a clear contract for when an LLM should sit inside an embodied loop and when it should stay outside as an advisor or parser.
The practical question is not whether the LLM can describe the right action sequence, but whether that description can be turned into safe, low-latency, grounded behavior.
Use the LLM for semantic search over plans and interfaces, not as a substitute for state estimation or servo control.
Theory
A convenient decomposition is $$\pi(a_t \mid h_t) = \kappa\bigl(\phi_\text{LLM}(x_t, m_t), \hat s_t\bigr),$$ where $\phi_\text{LLM}$ proposes a symbolic or programmatic plan from language context $x_t$ and memory $m_t$, while $\kappa$ is the grounded executor that consumes the proposal together with the current state estimate $\hat s_t$. The executor, not the LLM, owns physical validity.
This split explains both the promise and the limits. LLMs compress broad semantic priors into a small number of candidate subtasks or tool sequences. They do not directly measure friction, occlusion, latency, or actuator saturation. When papers claim strong embodied performance, the key question is how much the grounded stack contributes beyond the language model itself.
Treat the LLM as a high-level search policy over task decompositions, code sketches, or tool calls. Every proposal must pass through typed interfaces, state checks, and local controllers that know the robot's embodiment and current scene.
Worked Example
Code Fragment 1 implements the smallest planner boundary: the LLM proposes one symbolic subgoal, but the executor only accepts it if the required tool and state preconditions are satisfied.
# Accept an LLM proposal only when the grounded stack can execute it.
# The executor checks state and tool availability before acting.
# This boundary keeps semantic planning separate from physical validity.
proposal = {"step": "pick(red_mug)", "required_tool": "grasp"}
state = {"target_visible": True, "toolbox": {"grasp", "place"}}
can_execute = state["target_visible"] and proposal["required_tool"] in state["toolbox"]
decision = "execute" if can_execute else "replan"
print({"proposal": proposal["step"], "can_execute": can_execute, "decision": decision})
The expected output is an executable proposal whose semantic content and grounded feasibility agree. If `proposal` looked sensible but `can_execute` were `False`, the correct diagnosis would be a grounding or interface failure rather than a language-understanding success.
Modern tool-calling APIs and structured-output runtimes turn the same boundary into a few lines by forcing the LLM to emit typed action objects. They absorb prompt formatting, JSON validation, and retry logic so the engineer can focus on the execution contract and verifier design.
Practical Recipe
- Write down which variables the LLM sees and which variables only the grounded stack sees.
- Require every LLM proposal to map into a typed action or code object.
- Attach a verifier to every proposal so textual plausibility never counts as success by itself.
- Measure latency separately for planning, execution, and recovery.
- Benchmark against strong non-LLM baselines on the same task contract before claiming an embodied gain.
A frequent failure mode is to confuse descriptive competence with control competence. An LLM may explain how to pour safely while still lacking any grounded estimate of the cup pose, liquid dynamics, or actuator limits needed to perform the action.
In mobile manipulation, an LLM can select the sequence 'navigate to sink, grasp sponge, wipe spill,' but it should not be the module that estimates whether the sponge is currently visible or whether the arm can reach it without collision. Those checks belong to perception and planning tools that expose measurable state.
Large language models are fantastic interns for whiteboard planning. They are much less convincing when asked to be gravity, friction, and depth sensors all at once.
Current embodied LLM work is moving from prompt-only planners toward typed tool use, verifier loops, and benchmark suites such as EmbodiedBench. The hard scientific question is whether LLMs add embodied value beyond semantic decomposition once strong world models, VLMs, and classic planners are already in place.
Can you point to one subproblem in your stack that genuinely benefits from broad language priors, and one subproblem where replacing a grounded estimator with an LLM would be irresponsible or pointless?
A useful scientific discipline is to evaluate LLM contribution at the intervention boundary. Replace the LLM planner with a hand-written planner, a behavior tree, or a retrieval baseline while keeping execution fixed. If performance barely changes, the embodied value comes from the grounded stack, not from the language model.
This also reframes the hype around end-to-end embodied agents. The core question is not whether a model can emit an action token, but whether it can maintain a physically valid internal state, meet timing budgets, and recover from embodiment-specific failures. Most current systems still rely heavily on specialized modules for those responsibilities.
| Tool or Library | Role in the Topic | Builder Advice |
|---|---|---|
| Structured tool calling | Typed action proposals from the LLM. | Use it when free-form text would make execution or evaluation ambiguous. |
| ROS 2 actions | Execution of long-running robot skills with feedback. | Use actions when the LLM proposes skills rather than continuous controls. |
| BehaviorTree.CPP | Explicit fallback and retry logic. | Use it when LLM proposals need a deterministic execution skeleton. |
| MoveIt 2 | Grounded motion planning and collision checking. | Use it to execute geometric subgoals that an LLM can describe but not validate. |
| LangGraph | Memory and planner state transitions. | Use it when the planner must maintain multi-step conversational or tool context. |
Code Fragment 2 saves a planner trace with the fields needed for real evaluation. The key idea is that every proposal is logged alongside its verifier result and latency, so semantic fluency cannot hide execution failure.
- Record the prompt context or task card that generated the proposal.
- Store the typed action, the verifier result, and the state preconditions checked before execution.
- Measure planning latency separately from skill-execution latency.
- Tag failures as semantic, state-estimation, tool-interface, or controller failures.
- Compare planner variants on the same execution stack and episode set.
The expected output is a co-recorded control trace where the planner decision, the verifier evidence, and the latency measurement all point in the same direction. A useful negative case would keep the same structure but end with `result='blocked'` or `result='replan'`, which would let you localize the failure without replaying the whole episode.
If an embodied LLM system underperforms, first ask whether the LLM chose the wrong subgoal, whether the tool interface was incomplete, or whether the grounded stack could not realize an otherwise good plan. Those are very different scientific conclusions.
LLMs help embodied systems most when they are boxed into a typed planning role and surrounded by grounded verifiers and controllers.
Pick one embodied task and write a boundary contract for an LLM planner. List exactly which inputs it sees, which outputs it may emit, and which module rejects invalid proposals before execution.
Ahn et al. (2022). "Do As I Can, Not As I Say: Grounding Language in Robotic Affordances." arXiv.
SayCan is the cleanest starting point for understanding how an LLM can propose while grounded affordances dispose.
EmbodiedBench is a recent reference for evaluating embodied language systems across navigation and manipulation settings.
BehaviorTree.CPP Documentation. 'Integration with ROS2.'
This documentation is a practical reference for surrounding language proposals with explicit execution logic and recovery branches.