Section 44.5: Combining vision and touch | Building Embodied AI: From Perception to Autonomous Action

"Fusion is only useful when it changes what the robot decides next."
A Fusion Engineer's Whiteboard

Illustration for Section 44.5: Combining vision and touch — **Figure 44.5A**: Good vision-touch fusion is state-dependent: vision dominates before contact, touch often dominates during local interaction, and the controller should know when to hand over trust.

Big Picture

Combining vision and touch is an estimation and control problem, not just a neural-architecture choice. The modalities differ in range, field of view, latency, and the kinds of uncertainty they expose.

This section covers early, late, and state-level fusion between vision and tactile sensing, with attention to contact onset, occlusion, uncertainty, and action-conditioned confidence.

It synthesizes the whole chapter by turning multimodal sensing into a real control interface rather than a general claim that more modalities must be better.

Action Is The Test

Vision and touch should not always vote equally. The right fusion policy changes with distance to contact, occlusion level, slip risk, and what state variable the controller needs right now.

Figure 44.5.1: Good vision-touch fusion is state-dependent: vision dominates before contact, touch often dominates during local interaction, and the controller should know when to hand over trust.

Theory

Vision offers global context and long-range target localization, while touch offers precise local contact evidence after interaction begins. Fusion should therefore be conditioned on phase and uncertainty rather than forced into a static weighted average.

A useful formulation maintains a latent state with modality-specific observation models. The controller then updates its belief differently before contact, at contact onset, and during sustained manipulation.

$$ b_{t+1}(s) \propto p(o_t^{v}\mid s)\,p(o_t^{t}\mid s)\,\sum_{s'} p(s\mid s', a_t)\,b_t(s'),\qquad \alpha_t = f(\text{contact}, \sigma_v, \sigma_t) $$

Mechanism

The system predicts state from vision before contact, shifts weight toward touch as local interaction begins, and exposes a fused belief to the policy or controller. The decisive engineering choice is the gating logic that decides when each modality should dominate.

Algorithm: Contact-Phase Fusion Gate

Define which state variables are better observed by vision and which by touch.
Switch or reweight modalities based on contact phase and uncertainty.
Expose the fused belief, not the raw modalities alone, to the downstream controller where possible.
Audit failure cases where one modality confidently disagrees with the other.

Worked Example

# Reweight vision and touch after contact onset.
vision_sigma = 0.35
tactile_sigma = 0.12
contact = True

touch_weight = 0.7 if contact else 0.2
vision_weight = 1.0 - touch_weight
fused_uncertainty = round(vision_weight * vision_sigma + touch_weight * tactile_sigma, 3)
print({"vision_weight": vision_weight, "touch_weight": touch_weight, "fused_uncertainty": fused_uncertainty})

{'vision_weight': 0.3, 'touch_weight': 0.7, 'fused_uncertainty': 0.189}

Code Fragment 44.5.1 demonstrates the most important multimodal lesson in contact-rich control: fusion weights should change when the physics of the task changes.

Expected output: The expected result shifts trust toward touch after contact. That is appropriate when local contact cues become more reliable than vision for the state variable the controller now needs.

Library Shortcut

ROS 2 message synchronizers, tactile libraries, and multimodal encoders make data transport manageable. The difficult part is still designing the phase-aware trust logic and auditing disagreement cases.

Practical Recipe

Write down which state variables each modality should dominate before building the fusion model.
Synchronize timestamps tightly so disagreements are interpretable.
Use explicit contact-phase gates or learned uncertainty estimates rather than static equal weighting.
Create disagreement episodes where one modality is wrong and the other is right.
Evaluate fusion with task metrics and disagreement analysis, not only latent-space latent-space visualizations.

Common Failure Mode

Equal-weight fusion is often lazy engineering. When one modality is uninformative or stale, averaging can be worse than trusting the better source decisively.

Practical Example

In peg insertion, vision places the peg near the hole, while touch takes over for local alignment and slip-free seating once the peg starts interacting with the rim.

Memory Hook

Vision and touch are like two strong opinions at a meeting. The trick is not to average them politely, it is to know which one has actually seen the problem from the right distance.

Research Frontier

Current work explores learned fusion gates, world models with tactile state, and cross-modal retrieval. The enduring systems question is still whether fusion improves action quality on disagreement-heavy episodes.

Self Check

When vision and touch disagree in your task, which modality should win, and what evidence supports that choice?

This section is where the chapter's pieces finally connect. The fusion problem is about sensing, control phase, uncertainty, and action consequences all at once, which makes it a compact summary of embodied-system thinking.

A good advanced exercise is to compare early fusion, late fusion, and belief-state fusion on the same task. Students quickly see that the architecture question is inseparable from the control-phase question.

Practical Tool Choices For This Section

Tool or Library	Role in the Topic	Builder Advice
ROS 2 synchronization tools	Timestamp alignment	Use them to keep multimodal episodes temporally coherent.
PyTouch	Tactile features	Useful for constructing tactile state estimates that can be fused with vision.
Belief-state or sequence models	Phase-aware fusion	Prefer them when uncertainty and contact phase change the best action materially.

Mini Lab

Build a simple fusion gate that changes modality weights at contact onset. Compare it with static equal weighting on a small insertion or slip-detection benchmark.

When fusion fails, inspect timestamp alignment, phase gating, and disagreement handling before changing the neural backbone. Many multimodal failures are systems bugs wearing a representation-learning costume.

Section References

PyTouch

Open tactile-learning library relevant for tactile feature extraction and fusion experiments.

DIGIT

Representative optical tactile hardware often fused with vision.

NeuralFeels

Visuo-tactile object-state inference project illustrating cross-modal fusion for manipulation.

Key Takeaway

Combining vision and touch works when modality trust shifts with contact phase, uncertainty, and the actual state variable the controller needs.

Exercise 44.5.1

Design a disagreement benchmark in which vision is misleading but touch is informative, and a second benchmark with the opposite property. Explain how your fusion logic should react in each.