"Fusion is only useful when it changes what the robot decides next."
A Fusion Engineer's Whiteboard
Combining vision and touch is an estimation and control problem, not just a neural-architecture choice. The modalities differ in range, field of view, latency, and the kinds of uncertainty they expose.
This section covers early, late, and state-level fusion between vision and tactile sensing, with attention to contact onset, occlusion, uncertainty, and action-conditioned confidence.
It synthesizes the whole chapter by turning multimodal sensing into a real control interface rather than a general claim that more modalities must be better.
Vision and touch should not always vote equally. The right fusion policy changes with distance to contact, occlusion level, slip risk, and what state variable the controller needs right now.
Theory
Vision offers global context and long-range target localization, while touch offers precise local contact evidence after interaction begins. Fusion should therefore be conditioned on phase and uncertainty rather than forced into a static weighted average.
A useful formulation maintains a latent state with modality-specific observation models. The controller then updates its belief differently before contact, at contact onset, and during sustained manipulation.
$$ b_{t+1}(s) \propto p(o_t^{v}\mid s)\,p(o_t^{t}\mid s)\,\sum_{s'} p(s\mid s', a_t)\,b_t(s'),\qquad \alpha_t = f(\text{contact}, \sigma_v, \sigma_t) $$
The system predicts state from vision before contact, shifts weight toward touch as local interaction begins, and exposes a fused belief to the policy or controller. The decisive engineering choice is the gating logic that decides when each modality should dominate.
- Define which state variables are better observed by vision and which by touch.
- Switch or reweight modalities based on contact phase and uncertainty.
- Expose the fused belief, not the raw modalities alone, to the downstream controller where possible.
- Audit failure cases where one modality confidently disagrees with the other.
Worked Example
# Reweight vision and touch after contact onset.
vision_sigma = 0.35
tactile_sigma = 0.12
contact = True
touch_weight = 0.7 if contact else 0.2
vision_weight = 1.0 - touch_weight
fused_uncertainty = round(vision_weight * vision_sigma + touch_weight * tactile_sigma, 3)
print({"vision_weight": vision_weight, "touch_weight": touch_weight, "fused_uncertainty": fused_uncertainty})
Expected output: The expected result shifts trust toward touch after contact. That is appropriate when local contact cues become more reliable than vision for the state variable the controller now needs.
ROS 2 message synchronizers, tactile libraries, and multimodal encoders make data transport manageable. The difficult part is still designing the phase-aware trust logic and auditing disagreement cases.
Practical Recipe
- Write down which state variables each modality should dominate before building the fusion model.
- Synchronize timestamps tightly so disagreements are interpretable.
- Use explicit contact-phase gates or learned uncertainty estimates rather than static equal weighting.
- Create disagreement episodes where one modality is wrong and the other is right.
- Evaluate fusion with task metrics and disagreement analysis, not only latent-space latent-space visualizations.
Equal-weight fusion is often lazy engineering. When one modality is uninformative or stale, averaging can be worse than trusting the better source decisively.
In peg insertion, vision places the peg near the hole, while touch takes over for local alignment and slip-free seating once the peg starts interacting with the rim.
Vision and touch are like two strong opinions at a meeting. The trick is not to average them politely, it is to know which one has actually seen the problem from the right distance.
Current work explores learned fusion gates, world models with tactile state, and cross-modal retrieval. The enduring systems question is still whether fusion improves action quality on disagreement-heavy episodes.
When vision and touch disagree in your task, which modality should win, and what evidence supports that choice?
This section is where the chapter's pieces finally connect. The fusion problem is about sensing, control phase, uncertainty, and action consequences all at once, which makes it a compact summary of embodied-system thinking.
A good advanced exercise is to compare early fusion, late fusion, and belief-state fusion on the same task. Students quickly see that the architecture question is inseparable from the control-phase question.
| Tool or Library | Role in the Topic | Builder Advice |
|---|---|---|
| ROS 2 synchronization tools | Timestamp alignment | Use them to keep multimodal episodes temporally coherent. |
| PyTouch | Tactile features | Useful for constructing tactile state estimates that can be fused with vision. |
| Belief-state or sequence models | Phase-aware fusion | Prefer them when uncertainty and contact phase change the best action materially. |
Build a simple fusion gate that changes modality weights at contact onset. Compare it with static equal weighting on a small insertion or slip-detection benchmark.
When fusion fails, inspect timestamp alignment, phase gating, and disagreement handling before changing the neural backbone. Many multimodal failures are systems bugs wearing a representation-learning costume.
Section References
Open tactile-learning library relevant for tactile feature extraction and fusion experiments.
Representative optical tactile hardware often fused with vision.
Visuo-tactile object-state inference project illustrating cross-modal fusion for manipulation.
Combining vision and touch works when modality trust shifts with contact phase, uncertainty, and the actual state variable the controller needs.
Design a disagreement benchmark in which vision is misleading but touch is informative, and a second benchmark with the opposite property. Explain how your fusion logic should react in each.