"I found the relevant memory only after the planner stopped asking for nostalgia and started asking for actions."
A Retrieval Index With Boundaries
Memory retrieval for planning matters because retrieval quality should be measured by plan quality, not by whether the returned memory looks familiar or linguistically plausible.
The best retrieved memory is the one that changes the next action for the better under current constraints. Similarity is only a proposal signal; the planner still needs utility, freshness, and risk terms before the memory becomes actionable.
Theory
Planning-aware retrieval scores candidate memories by their value to the current decision:
$$s(m;q,g) = \alpha \, \mathrm{sim}(m,q) + \beta \, \mathrm{utility}(m,g) - \gamma \, \mathrm{staleness}(m) - \delta \, \mathrm{risk}(m).$$
High embedding similarity is not enough. A retrieved memory may be semantically close but unsafe for the current embodiment, stale for the current scene, or irrelevant to the current goal horizon.
In practice retrieval is usually two-stage. A vector index such as FAISS, ScaNN, or pgvector proposes candidates, then a planner-aware reranker checks embodiment tags, horizon compatibility, state constraints, and expected control value. That reranker is where memory search becomes part of planning rather than a generic nearest-neighbor service.
Retrieval for planning therefore sits between information retrieval and control. The planner needs memories that are not only relevant, but also actionable within current kinematic, temporal, and safety constraints.
Worked Example
A mobile manipulator carrying a tray should prefer memories about blocked hallways and stable carrying postures over memories about semantically similar kitchen scenes that do not affect the route or the controller.
candidates = [
{"id": "m1", "similarity": 0.89, "utility": 0.30, "staleness": 0.05, "risk": 0.10},
{"id": "m2", "similarity": 0.75, "utility": 0.92, "staleness": 0.02, "risk": 0.08},
]
def score(x, a=1.0, b=1.5, c=1.0, d=1.0):
return a * x["similarity"] + b * x["utility"] - c * x["staleness"] - d * x["risk"]
ranked = sorted(((c["id"], round(score(c), 3)) for c in candidates), key=lambda t: t[1], reverse=True)
print(ranked)
[('m2', 2.025), ('m1', 1.19)]The expected output shows why m2 should be preferred despite lower raw similarity. The planner should value task utility, freshness, and safety more than textual or visual familiarity.
Use a retrieval engine for top-k recall, but keep the reranked score, selected memory id, and planner delta in one trace. Without that trace, the team can tell that retrieval changed behavior but cannot audit whether it improved the action choice or simply made the plan look more plausible.
Vector search can generate candidates, but planning-aware reranking usually needs custom metadata filters and a model-side utility score. Keep the raw retrieval score, the reranked score, and the chosen memory id in one trace so later audits can explain why the planner trusted what it trusted.
- Generate a retrieval query from the current goal, state, and action horizon.
- Filter candidates by embodiment, scene, and task-phase metadata.
- Re-rank by expected planning utility, freshness, and risk.
- Attach the chosen memory to the planner output for later audit.
- Log whether retrieval changed the chosen plan and whether that change helped.
A planner can become overconfident in retrieved episodes that are visually similar but dynamically mismatched. Scene resemblance is not the same as action-transfer validity.
A warehouse robot retrieving a deadlock episode should condition on aisle width, current traffic pattern, and payload type. An episode from a wider aisle with no pallet load may be a poor planning guide even if the deadlock geometry looks similar.
The evidence artifact for this section should be a retrieval decision card: original query, candidate memories, reranked scores, chosen memory, resulting plan change, and observed outcome. That card supports failure analysis when a memory looked relevant at retrieval time but later caused a bad plan.
A central open problem is training retrieval systems from downstream control improvement instead of static similarity labels. The difficult part is credit assignment across time: which retrieved episode improved the plan, and under which embodiment or disturbance shift would that same episode become unsafe?
If the top retrieved memory changed the chosen plan, could you justify that choice in one line using utility, freshness, and risk? If not, the retrieval policy is still too opaque for deployment.
Can you write down one retrieval score term that improves planning utility and one term that protects safety? If not, the retrieval objective is still too close to plain similarity search.
A live research question is how to train retrieval systems from downstream control improvement instead of from static similarity labels. The hard part is credit assignment: which retrieved episode improved the plan, and under which disturbance or embodiment shift would that same episode become misleading?
If the top retrieved memory changed the chosen plan, could you explain why in one line using utility, freshness, and risk? If not, the retrieval policy is still too opaque for deployment.
Retrieval should be evaluated by decision quality. Similarity alone is not a safe planning criterion.
Define a retrieval score for a drone replanning around wind gusts. Include a similarity term, a utility term, and at least one safety or freshness penalty.
Section References
Parisotto, E. and Salakhutdinov, R. Neural Map: Structured Memory for Deep Reinforcement Learning. ICLR, 2018.
Use for differentiable spatial memory and the distinction between stored geometry and policy state.
Chaplot, D. S. et al. Neural Topological SLAM for Visual Navigation. CVPR, 2020.
Use for map-like memory that supports navigation decisions rather than generic retrieval.
What's Next?
Next, continue with Section 56.4, where the focus shifts from useful memory to stale or unsafe memory.