Section 3.7: Where LLMs, VLMs, and VLAs sit in the stack | Building Embodied AI: From Perception to Autonomous Action

A Careful Control Loop

Big Picture

Where LLMs, VLMs, and VLAs sit in the stack is one lens on embodied system architectures. We study it because an embodied agent needs decisions that survive contact with noisy sensors, delayed effects, and changing environments.

Figure 3.7. Where LLMs, VLMs, and VLAs sit in the stack is easiest to reason about as a closed-loop evidence, decision, consequence pattern: LLMs, VLMs, and VLAs occupy different interfaces in the stack.

This section develops the technical contract for where llms, vlms, and vlas sit in the stack into a usable mental model. First we define the object of study, then we connect it to the agent loop, then we test it with a compact implementation.

The key question in Where LLMs, VLMs, and VLAs sit in the stack is practical: what must the agent know, what can it observe, what action is available, and what evidence shows that the action worked under the stated conditions?

Action Is The Test

A representation earns its place when it changes the measurable action interface. In where llms, vlms, and vlas sit in the stack, the reader should keep asking which decision becomes easier, safer, or more reliable.

Theory

For Where LLMs, VLMs, and VLAs sit in the stack, the practical design rule is to make the interface inspectable before optimization begins: inputs, outputs, units, latency, bounds, and failure labels should all be visible in the saved artifact.

LLMs, VLMs, and VLAs are not interchangeable upgrades. They sit at different interfaces in the stack. An LLM maps language context to language, plans, tool calls, or symbolic instructions. A VLM maps images and language to grounded descriptions, detections, affordances, or decisions. A VLA maps visual and language context closer to robot action, often as action tokens, trajectories, or low-level command chunks.

Where Foundation Models Sit In The Embodied Stack

Model family	Typical input	Typical output	Interface risk
LLM	goal text, memory, tool results	plan, code, command, or query	May produce plausible plans that are not grounded in the current scene.
VLM	image, video, text prompt	description, localization, affordance, or decision	May see the object but miss geometry, contact, timing, or calibration constraints.
VLA	image or state plus language goal	action token, trajectory, or controller target	May hide action scaling, embodiment assumptions, and recovery logic inside the model.

The design question is therefore not "Which model is most capable?" The better question is "Which interface needs learned generalization, and which interface still needs an explicit contract?" A strong architecture often uses an LLM for task decomposition, a VLM for scene grounding, and a VLA or controller for execution, with explicit checks between them.

Mechanism

The mechanism in Where LLMs, VLMs, and VLAs sit in the stack is the contract between representation and action. Name what enters the module, what leaves it, which assumptions make that transformation valid, and which log would reveal a bad handoff.

Worked Example

The three model families compose into one pipeline: an LLM decomposes the goal, a VLM grounds each subgoal in the image, and a VLA turns a grounded subgoal into an action. The example uses stand-in functions for each model so the structure is visible, then runs the section's key diagnostic, oracle substitution, to locate which interface caused a failure. This is the executable form of the Fun Note: the plan and the pointing only matter if the action survives contact.

# Stand-ins for the three model families. Swap in real models later;
# the contract (what enters, what leaves) is what we are testing.
def llm_plan(goal):                       # language -> ordered subgoals
    return {"put apple in bowl": ["grasp apple", "place in bowl"]}[goal]

def vlm_ground(subgoal, scene):           # image+text -> object + pose
    obj = subgoal.split()[-1]             # "grasp apple" / "place in bowl"
    return scene.get(obj)                 # None if not visible -> grounding gap

def vla_act(grounded):                    # state+goal -> action token
    if grounded is None:
        return None
    return f"move_to({grounded['x']:.2f},{grounded['y']:.2f}) + close"

def run(goal, scene):
    for sub in llm_plan(goal):
        g = vlm_ground(sub, scene)
        a = vla_act(g)
        print(f"  {sub:16s} ground={g} action={a}")
        if a is None:
            return f"FAIL at grounding: {sub}"
    return "ok"

scene_ok  = {"apple": {"x": 0.4, "y": 0.1}, "bowl": {"x": 0.7, "y": 0.0}}
scene_bad = {"bowl": {"x": 0.7, "y": 0.0}}            # apple occluded

print("scene_ok :", run("put apple in bowl", scene_ok))
print("scene_bad:", run("put apple in bowl", scene_bad))
# Oracle substitution: inject the apple pose to confirm the VLM, not the
# LLM or VLA, was the first cause in scene_bad.
scene_fixed = dict(scene_bad, apple={"x": 0.4, "y": 0.1})
print("oracle   :", run("put apple in bowl", scene_fixed))

Code Fragment 3.7.1 composes an LLM planner, a VLM grounder, and a VLA actor, then uses oracle substitution to attribute a failure to the grounding interface rather than to the plan or the controller.

Expected output: the good scene runs end to end; the bad scene fails at grounding because the apple is not visible, and the VLA correctly produces no action rather than reaching for nothing; the oracle run succeeds once the missing pose is injected, which proves the VLM grounding was the first cause. The lesson matches the section's diagnostic: replace one model output with a verified value and see whether the system recovers, instead of swapping the whole foundation model on a hunch.

Library Shortcut

For Where LLMs, VLMs, and VLAs sit in the stack, the hand-built fragment is a visibility tool. Production work should move to maintained stacks such as Hugging Face Transformers, open VLMs, OpenVLA, openpi, LeRobot, and tool-calling planners once the section has made the interface, logging contract, and failure recovery path explicit.

Practical Recipe

Write the observation, action, and success metric before choosing a model.
Build a baseline that is simple enough to debug by inspection.
Add the library implementation only after the baseline behavior is understood.
Record failures as structured cases: perception error, state error, planning error, control error, or evaluation error.
Run at least one perturbation test before trusting the result.

Common Failure Mode

The common mistake in Where LLMs, VLMs, and VLAs sit in the stack is to celebrate the component score before checking the closed-loop handoff. The failure usually appears at the boundary: stale state, wrong frame, delayed action, saturated actuator, or metric that ignores the real task cost.

Practical Example

A robotics team using where llms, vlms, and vlas sit in the stack should log not only final success, but intermediate observations, chosen actions, controller status, and recovery events. The logs reveal whether the method is solving the task or merely passing the easiest episodes.

Fun Note

An LLM can explain the plan, a VLM can point at the object, and a VLA is where the explanation has to survive contact with the gripper.

Research Frontier

For Where LLMs, VLMs, and VLAs sit in the stack, treat frontier claims as hypotheses until they expose enough detail to reproduce the result: data boundary, embodiment, controller interface, evaluation panel, and failure cases.

Self Check

Can you name the observation, state estimate, action, success metric, and most likely failure mode for where llms, vlms, and vlas sit in the stack? If not, the system boundary is still too vague.

Where LLMs, VLMs, and VLAs sit in the stack becomes useful when it is tied to a closed-loop contract for how perception, estimation, planning, learning, and control are arranged into a system. The contract names the observation stream, the action representation, the timing budget, the safety boundary, and the result artifact. That is the bridge between a readable concept and a system a skeptical builder can test.

For Where LLMs, VLMs, and VLAs sit in the stack, separate the conceptual claim, the systems claim, and the evidence claim. A good explanation, a clean API, and one successful rollout are different kinds of evidence, and the section should keep them distinct.

Tool or Library	Role in This Topic	Builder Advice
ROS 2	separates system modules while preserving message contracts and timing	Use it when the hand-built contract is clear and the experiment needs repeatable runs.
MuJoCo	gives architecture choices a repeatable simulated world for stress tests	Use it when the hand-built contract is clear and the experiment needs repeatable runs.
LeRobot	anchors modern policy architectures in reusable datasets and policy APIs	Use it when the hand-built contract is clear and the experiment needs repeatable runs.

For Where LLMs, VLMs, and VLAs sit in the stack, a robust implementation starts with one inspectable baseline whose artifact records observations, actions, units, timestamps, seeds, termination reasons, and the perturbation applied. The maintained-tool version is useful only if it preserves that schema and lets the comparison remain construct-matched.

Write a one-paragraph task contract with observation, action, success, failure, and safety fields.
Start with the smallest simulator, dataset, or wrapper that exposes the task contract faithfully.
Run one deterministic smoke test and one perturbation test before scaling.
Save one artifact containing configuration, seed, metrics, traces, and failure labels.
Compare methods only when the same script evaluates the same panel, split, seed set, and metric.

When Where LLMs, VLMs, and VLAs sit in the stack fails, avoid labeling the whole method as weak. First assign the failure to perception, state estimation, planning, control, timing, data coverage, or evaluation. Then rerun one controlled perturbation that isolates the suspected cause. This pattern turns a disappointing rollout into a reusable diagnostic asset.

The quickest diagnostic is to replace one model output with a verified oracle value. If the system succeeds when the LLM plan is manually corrected, the language-level decomposition is the first suspect. If it succeeds only when the VLM grounding is corrected, the problem is scene understanding. If it still fails after both are corrected, inspect action scaling, controller limits, and embodiment-specific assumptions before changing the foundation model.

Key Takeaway

Where LLMs, VLMs, and VLAs sit in the stack is useful when it makes the perception-action loop more reliable, not when it merely adds a more impressive model name.

Exercise 3.7.1

Design a method-matched experiment for Where LLMs, VLMs, and VLAs sit in the stack. Specify the environment, observation schema, action interface, metric, and one perturbation that targets the section's core assumption.

What's Next?

Section 3.8 studies architecture-specific failure modes and how to diagnose them.

Bibliography & Further Reading

Foundational References For This Section

Quigley, M. et al.. "ROS: an open-source Robot Operating System." (2009). https://www.ros.org/

The systems reference for modular robot software and message-passing architecture.

Todorov, E., Erez, T., and Tassa, Y.. "MuJoCo: A physics engine for model-based control." (2012). https://mujoco.org/

A widely used simulator for architecture and control experiments.

Brohan, A. et al.. "RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control." (2023). https://arxiv.org/abs/2307.15818

A central reference for locating VLM and VLA models in embodied control stacks.