A Careful Control Loop
Exploration under partial observability addresses the case where the agent cannot directly see the state it needs to reason about. A hallway can look identical from two places, an object can be hidden behind a cabinet door, and a risky contact state can be invisible until the robot moves.
It builds on reward specification in Chapter 18: Reward Design and Goal Specification, reuses partial observability from Chapter 2: The Agent-Environment Interface, and prepares transfer testing in Chapter 20: Sim-to-Real Transfer.
This section develops the technical contract for exploration when observations are aliases of hidden states. The object of study is the belief state: the agent's maintained distribution or memory over what might be true behind the current sensor reading.
The key question is practical: what hidden variable matters, what observation can disambiguate it, and what evidence shows that the agent explored to reduce uncertainty rather than wandering through aliased views?
A belief representation earns its place when it changes the next action. In partial observability, the reader should keep asking whether the agent turns to inspect a landmark, revisits a checkpoint, stores a memory, or pauses because the hidden state is still ambiguous.
Theory
We can view the agent at time $t$ as receiving an observation $o_t$, maintaining a belief $b_t(s)$ over hidden states, choosing an action $a_t$, and updating the belief after observing $o_{t+1}$. The update is the core move: exploration should choose actions that make important hidden states easier to distinguish.
The practical design rule is to make aliasing explicit. Inputs, outputs, assumptions, timing, and failure modes should include the hidden variable, the memory horizon, the observation that resolves ambiguity, and the diagnostic that detects belief collapse.
The mechanism is a sequence of transformations: observe, update belief, score information-gathering actions, execute an active-sensing move, and check whether uncertainty fell. Each transformation should have a measurable contract, otherwise a recurrent policy can appear competent while storing the wrong history.
Worked Example
Code Fragment 19.4.1 shows a tiny belief update for two visually aliased corridors. When the belief remains uncertain, the agent chooses an information-gathering action rather than pretending the observation fully identifies the state.
# Maintain a belief over two hidden states that share similar observations.
# High entropy triggers an active-sensing action instead of blind movement.
import math
belief = {"left_corridor": 0.5, "right_corridor": 0.5}
landmark_likelihood = {"left_corridor": 0.85, "right_corridor": 0.20}
for state in belief:
belief[state] *= landmark_likelihood[state]
normalizer = sum(belief.values())
belief = {state: value / normalizer for state, value in belief.items()}
entropy = -sum(value * math.log2(value) for value in belief.values())
action = "inspect landmark" if entropy > 0.7 else "move to frontier"
print({state: round(value, 2) for state, value in belief.items()})
print("entropy", round(entropy, 2), "action", action)
Expected output: the belief should move toward one hidden state while still reporting uncertainty. If the trace contains only the latest observation, the agent has no audit trail for partial observability.
The from-scratch fragment is for understanding. In a practical system, use Gymnasium wrappers for observation masking, Habitat-Lab for embodied navigation with limited sensors, recurrent policy implementations when memory is required, and ROS 2 logs when hidden hardware state must be reconstructed after a run. The shortcut removes boilerplate so the engineering attention goes to aliasing, memory, and active sensing diagnostics.
Practical Recipe
- Name the hidden variable before choosing a model.
- Log the observation, belief or memory state, action, and disambiguating cue together.
- Build a memory-free baseline before adding recurrence or belief tracking.
- Record failures as structured cases: observation aliasing, memory loss, belief collapse, stale map, unsafe active sensing, or evaluation mismatch.
- Run at least one perturbation test that hides or corrupts the disambiguating cue.
The common mistake is to treat each observation as if it fully identifies the state. Under partial observability, two places or contact states can look the same, so a policy that ignores memory may repeat unsafe or uninformative actions.
A service robot team should log observation frames, belief summaries, memory resets, landmark checks, chosen actions, and whether the hidden-state estimate changed after active sensing. The logs reveal whether exploration reduced ambiguity or merely accumulated more aliased images.
Partial observability is where "I have seen this before" and "this looks like something I have seen before" become dangerously different sentences.
A core research frontier is exploration with learned memory: agents that decide when to store history, when to query a map, and when to take an information-gathering action. The hard part is proving that the memory helps under aliasing rather than only improving average reward in easy episodes.
Can you name the hidden variable, belief representation, disambiguating observation, active-sensing action, and most likely aliasing failure? If not, the partial-observability problem is still too vague.
The idea in this section becomes useful when it is tied to a closed-loop belief contract. In this chapter on Exploration in Embodied Worlds, the contract names the observation stream, hidden variable, memory representation, action representation, active-sensing move, and evaluation artifact. Without that contract, a recurrent policy can look capable while using history in a way nobody can diagnose.
The graduate-level habit is to separate three claims. The conceptual claim explains why memory or belief should help. The systems claim explains which state estimate changes before action. The evidence claim records whether aliasing errors fall under the same seed panel and perturbation suite.
| Tool or Library | Role in the Topic | Builder Advice |
|---|---|---|
| Gymnasium | Observation masking tests | Use it to create memory-free and memory-enabled baselines under the same hidden-state contract. |
| Habitat-Lab | Embodied aliasing and landmarks | Use it when corridors, viewpoints, and map coverage make partial observability concrete. |
| ROS 2 | Sensor and state trace replay | Use it to reconstruct what the robot could observe, not what the debugger knows afterward. |
| MuJoCo | Hidden contact state | Use it when proprioception, contact, and actuator state create ambiguity that vision alone cannot resolve. |
| LeRobot | Memory behavior comparison | Use it to compare learned memory policies against demonstrations that include inspection and reorientation moves. |
A robust implementation starts with a tiny, inspectable belief trace and only then moves to a maintained recurrent learner or navigation simulator. The baseline should log observation, belief or memory state, action, hidden-state label if available in simulation, and the cue that resolved ambiguity. The library version should produce the same artifact schema, so the comparison is a same-task comparison rather than a story assembled from separate experiments.
- Write a one-paragraph belief contract with hidden state, observation, memory, action, success, and failure fields.
- Start with the smallest simulator or wrapper that exposes aliased observations clearly.
- Run one deterministic smoke test and one cue-corruption perturbation before scaling.
- Save a single result artifact containing configuration, seed, belief traces, memory resets, metrics, and failure labels.
- Compare methods only when one script evaluates memory-free and memory-enabled policies on the same task panel.
When partial-observability exploration fails, avoid labeling the whole method as weak. First assign the failure to observation aliasing, memory horizon, belief update, active-sensing choice, stale map, timing, or evaluation. Then rerun one controlled perturbation that isolates the suspected cause.
For exploration under partial observability, compare only construct-matched metrics that are co-computed in one pass on one configuration: same environment panel, same policy checkpoint, same seed set, same hidden-state labels where simulation provides them, same aliasing perturbation, and the same success definition. Save reward, coverage, belief entropy, memory resets, active-sensing actions, and failure labels in one artifact so every number in a later table is backed by the same run.
Exploration under partial observability succeeds when memory and active sensing reduce consequential uncertainty, not when a recurrent model merely raises average reward on easy episodes.
Design a partial-observability experiment in simulation. Specify the hidden variable, observation alias, belief or memory representation, active-sensing action, success metric, and one perturbation that removes a disambiguating cue.
What's Next?
This section turned partial-observability exploration into a testable belief contract: define the hidden state, update memory, save one comparable artifact, and diagnose failure by aliasing source. Next, return to Chapter 19 to connect reset cost, intrinsic motivation, safety, and belief-aware exploration into one embodied diagnostics panel.
This is the foundational POMDP reference for belief-state decision making. Use it here to connect active sensing and exploration to explicit uncertainty over hidden states.
Bellemare, M. G. et al. (2016). Unifying count-based exploration and intrinsic motivation. NeurIPS.
The paper connects pseudo-counts to intrinsic rewards in high-dimensional spaces. Under partial observability, it raises the question of whether the count belongs to an observation, a belief, or a memory state.
Pathak, D. et al. (2017). Curiosity-driven Exploration by Self-supervised Prediction. ICML.
Intrinsic Curiosity Module rewards prediction progress in learned feature space. Use it here to ask whether prediction error reflects hidden-state uncertainty or nuisance variation.
Burda, Y. et al. (2018). Exploration by Random Network Distillation. arXiv.
RND is a practical intrinsic reward method based on prediction error. In aliased environments, its error signal should be interpreted beside belief entropy and disambiguating actions.
DD-PPO connects exploration to distributed simulation and navigation evaluation. It is useful here because navigation agents often face viewpoint aliasing, hidden map structure, and memory-dependent recovery.
Habitat-Lab provides embodied navigation and interaction environments. Use it to test landmark checks, map memory, cue removal, and active sensing under a reproducible seed panel.