Section 33.7: Memory, state tracking, and hallucination in physical tasks | Building Embodied AI: From Perception to Autonomous Action

A Careful Control Loop

Read the figure as a memory-validity audit. A planner may remember prior observations, but physical tasks require timestamps, scope, invalidation rules, and a check that retrieved state still matches the current scene.

Figure 33.7: A closed-loop map for Memory, state tracking, and hallucination in physical tasks. The diagram forces the reader to name the input, model boundary, action interface, and evidence record before trusting the system.

Build And Evaluation Checklist

Depth and self-containment. This section must explain why memory in embodied systems is a state-estimation problem, not only a long-context problem. Readers should leave knowing which facts must be grounded and refreshed from sensors.

Production and evaluation contract. The artifact should record remembered facts, their source, freshness, and whether they were later verified or contradicted by perception. Otherwise hallucination remains a vague label.

Checklist Memory Anchor

For Memory, state tracking, and hallucination in physical tasks, name the language interface, grounded world state, executable action contract, and evidence artifact before trusting any claimed improvement.

Mini Audit Exercise

For Memory, state tracking, and hallucination in physical tasks, write one evidence row recording instruction, world-state estimate, chosen action, verifier result, and failure label. Then identify which field would change first under command misunderstanding.

Big Picture

Memory and hallucination in embodied agents is about keeping world state synchronized with words. The agent must remember object identities, task progress, and user preferences without turning stale guesses into confident plans.

This section shows how LLM memory should be paired with explicit state tracking so that past context helps planning without silently overriding new sensor evidence.

The practical question is which memories should live as symbolic facts, which should live as scene state, and how hallucinated memories should be caught before action.

Action Is The Test

Embodied memory is only useful if it carries provenance and freshness. A remembered object location with no timestamp is not memory; it is a latent bug.

Theory

Let memory items be facts $m_i = (f_i, c_i, t_i)$ with content, confidence, and timestamp. A planner should reason over a belief state $$b_t = p(s_t \mid o_{1:t}, a_{1:t-1}, m_{1:t}),$$ not over free-floating text summaries alone. New observations should update or erase memory items whose confidence is no longer justified.

Hallucination in embodied tasks often means one of three things: inventing an object or tool, asserting a stale state as current, or carrying a wrong relational fact across scene changes. The fix is rarely 'better prompting' alone. It is usually a better contract between memory, observation, and verification.

Mechanism

A good memory system separates semantic memory, such as user preference, from dynamic world state, such as object location. The first may persist across episodes; the second should expire quickly or be refreshed from sensors before use.

Worked Example

Code Fragment 1 stores two memories with different freshness and shows how the planner should gate them before use. The example demonstrates why timestamps belong in the memory schema.

# Reject stale world-state memory while keeping durable preference memory.
# Embodied memory should store freshness and source, not just text.
# This keeps old observations from masquerading as current state.
memory = [
    {"fact": "user_prefers_blue_mug", "age_s": 600, "durable": True},
    {"fact": "red_mug_is_on_counter", "age_s": 45, "durable": False},
]

usable = [m["fact"] for m in memory if m["durable"] or m["age_s"] < 10]
print(usable)

['user_prefers_blue_mug']

The expected output is a memory subset where durable user preferences survive but stale scene claims do not. The point is that embodied memory should grant planning authority only to facts whose lifespan matches the kind of fact they are, not to every retrieved sentence equally.

Code Fragment 1: This gating rule preserves long-lived preference memory while rejecting stale world-state memory. The key lesson is that not all remembered text should have equal planning authority once the physical scene may have changed.

Library Shortcut

State stores, graph memories, and vector memories can all hold the facts, but they are only safe in robotics when coupled to freshness metadata and sensor-side verification hooks. The library can manage retrieval; it cannot decide which physical facts are still true.

Practical Recipe

Store memory items with source, timestamp, confidence, and type.
Separate durable preferences from dynamic world-state facts.
Refresh or invalidate dynamic facts before high-consequence actions.
Never let retrieved text bypass a verifier when the action depends on current geometry.
Log contradictions between memory and observation as first-class events.

Common Failure Mode

The easiest hallucination to miss is not a novel object. It is a plausible but stale memory, such as believing the mug is still on the counter after another agent already moved it.

Practical Example

A household robot may remember that the user prefers tea in the blue mug across many days, but it should not remember that the blue mug is on the left shelf unless that fact was refreshed by recent perception. One memory is durable preference; the other is dynamic scene state.

Memory Hook

Embodied hallucination is often just nostalgia with a manipulator attached.

Research Frontier

Current research explores memory graphs, learned world models, and verifier-guided long-horizon planning for embodied agents. The open challenge is keeping memories useful across long tasks without allowing stale facts to outrank fresh sensor evidence.

Self Check

Can you list one fact in your system that should persist across sessions and one that should expire within seconds unless perception reconfirms it?

This section connects directly to classical filtering and SLAM. The novelty is that language memories and symbolic task facts must join the same belief-management discipline as geometric state. Otherwise the planner treats a ten-minute-old caption and a ten-millisecond-old sensor reading as equally authoritative.

That is also why hallucination should be decomposed. A model may hallucinate semantically, but many embodied 'hallucinations' are actually stale-state propagation errors. Better memory schemas, not bigger models, are often the right fix.

Tool Choices For Embodied Memory and State Tracking

Tool or Library	Role in the Topic	Builder Advice
LangGraph or explicit state graph	Planner-visible memory state.	Use it when memory items should change planner behavior in transparent ways.
Semantic map or object tracker	Grounded dynamic world state.	Use it when remembered object locations must be refreshed from sensors.
Vector store with metadata	Retrieval of durable semantic context.	Use it for user preferences or long-range task summaries, not raw geometry.
Pydantic schemas	Typed memory records with freshness fields.	Use them to prevent planner logic from consuming untyped memory blobs.
Verifier layer	Checks remembered facts against observation.	Use it whenever an action depends on the present physical world.

Code Fragment 2 stores a memory record with provenance and freshness. This is the minimum structure needed to talk coherently about embodied hallucination instead of merely complaining that the agent 'made something up.'

Tag each memory by type: preference, world state, task progress, or explanation.
Attach timestamps and evidence sources to every remembered fact.
Force memory retrieval to pass through a fact-validity gate before execution.
Record contradiction events when perception and memory disagree.
Evaluate memory systems on tasks with delayed execution and hidden state changes.

The expected output is a provenance-rich memory record that blocks direct action because the scene fact is too old. This is exactly the kind of trace you want before calling a behavior a hallucination, since the deeper mechanism is often stale world state rather than fabricated semantics.

Code Fragment 2: This memory record is useful because it keeps provenance and freshness visible. The planner can see that the fact came from `camera_frame_104` and is too old for direct execution, which is a much sharper diagnosis than the generic label 'hallucination.'

When memory-rich agents fail, check whether the wrong fact was retrieved, whether the fact was stale, or whether the verifier failed to challenge it. Those paths lead to very different architectural fixes.

Key Takeaway

Embodied memory is valuable only when it behaves like a state-estimation aid rather than an untyped bag of text.

Exercise 33.7.1

Design a memory schema for an embodied assistant that stores both user preferences and object locations. Include the fields needed to keep one durable and the other freshness-limited.

Bibliography and Further Reading

Primary Sources and Tools

Wang et al. (2025). "EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models as Embodied Agents." arXiv.

EmbodiedBench is useful for evaluating long-horizon embodied tasks where memory and replanning matter.

Paper or Documentation

LangGraph Documentation.

LangGraph is a practical reference for explicit stateful agent memory rather than opaque prompt concatenation.

Paper or Documentation

GTSAM Documentation.

GTSAM is a classical reference for state-estimation discipline, useful here as a conceptual comparison for how embodied memory should treat uncertainty and updates.

Paper or Documentation