Section 40.2: I-JEPA and V-JEPA | Building Embodied AI: From Perception to Autonomous Action

"A still image can tell you what exists; a video can tell you what is about to matter."
A Temporal Representation Learner

A split scene comparing a masked image patch and a masked video clip, with the video side showing motion arrows and temporal cues that the image side cannot express. — **Figure 40.2A**: I-JEPA learns semantic structure from a single image, while V-JEPA extends the same latent-prediction idea across time.

Big Picture

I-JEPA and V-JEPA share the same philosophy but solve different prediction problems. I-JEPA learns from spatial context inside one image; V-JEPA must preserve motion, temporal causality, and action-relevant persistence across frames.

Why The Image Case Is Not Enough

I-JEPA is already a strong test of semantic prediction because the model cannot succeed by copying pixels. But a robot does not act in static images. It acts in a world where occlusion clears, objects move, and contact depends on temporal continuity. A representation that is excellent at image semantics can still fail to preserve the motion cues needed for action timing or object tracking.

V-JEPA extends the JEPA objective from images to video. The context is now a spatiotemporal block of frames, and the target is a masked future or withheld region in a video clip. The central question is no longer "what is behind the mask?" but "what latent future is consistent with the observed motion and scene dynamics?"

I-JEPA Versus V-JEPA

The image formulation can be written as context-to-target latent regression over 2D patches. V-JEPA keeps the same latent-loss structure but swaps the input domain to video clips:

$$ \mathcal{L}_{\text{I-JEPA}} = \sum_{k=1}^{K}\left\lVert g_\theta(f_\theta(x_c), m_k) - \operatorname{sg}(f_\xi(x_{t,k})) \right\rVert_2^2 $$

$$ \mathcal{L}_{\text{V-JEPA}} = \sum_{k=1}^{K}\left\lVert g_\theta(f_\theta(v_c), m_k, \Delta t_k) - \operatorname{sg}(f_\xi(v_{t,k})) \right\rVert_2^2 $$

The extra temporal index $\Delta t_k$ matters. In video, the model must preserve object identity and temporal evolution: velocity, contact onset, object permanence under occlusion, and the difference between a transient appearance change and a real state change.

What V-JEPA Adds

V-JEPA is not "I-JEPA plus more frames." It changes the latent invariances that matter. A good video representation must remain stable under appearance noise while still being sensitive to dynamic events that change the action plan.

Worked Shape Probe

Code Fragment 40.2.1 shows the bookkeeping difference between the image and video settings. The tensor shapes are simple, but they make the temporal burden visible.

# Compare the bookkeeping load in image JEPA and video JEPA.
# The video case adds time, which changes what a target mask means
# and what information the predictor must preserve.
image_tokens = (14, 14, 768)
video_tokens = (16, 14, 14, 768)

ijepa_context = (10, 10, 768)
ijepa_target = (4, 4, 768)
vjepa_context = (8, 10, 10, 768)
vjepa_target = (4, 4, 4, 768)

print({
    "image_tokens": image_tokens,
    "video_tokens": video_tokens,
    "ijepa_target_volume": 4 * 4,
    "vjepa_target_volume": 4 * 4 * 4,
})

{'image_tokens': (14, 14, 768), 'video_tokens': (16, 14, 14, 768), 'ijepa_target_volume': 16, 'vjepa_target_volume': 64}

Code Fragment 1: This small shape probe makes the jump from image JEPA to video JEPA concrete. The temporal axis multiplies the target volume, which means the model must preserve dynamic information rather than only static semantics. In practice, that is why masking and context design become even more consequential in the video setting.

The expected output should show that the video target volume is larger. That is the first hint that V-JEPA must solve a harder abstraction problem: more latent content, more possible futures, and stronger pressure to learn motion-aware features.

Library Shortcut

This probe takes 10 lines. A maintained PyTorch implementation does the same shape handling in a few tensor operations while also managing batching and mixed precision. The point of writing the tiny version first is to make it obvious that "video JEPA" means a different target geometry, not just a larger dataset.

When Each One Helps

Choosing Between I-JEPA And V-JEPA

Setting	I-JEPA strength	V-JEPA strength
Static object ranking	Strong semantics with cheaper training	Often unnecessary unless motion context matters
Action anticipation	Weak, temporal cues are missing	Captures evolving intent and scene dynamics
Occlusion-heavy manipulation	Can encode object identity but misses temporal persistence	Better for tracking hidden objects through time
Robot video pretraining before planning	Useful initializer	Better aligned with downstream rollout prediction

This is the main didactic lesson of the section: I-JEPA and V-JEPA are not competitors so much as different levels of abstraction. Use I-JEPA when you need robust spatial semantics and the downstream task is mostly snapshot-based. Use V-JEPA when the downstream policy depends on motion history or on predicting what remains true across a short temporal window.

Practical Example

A mobile manipulator that must grab a moving bin from a conveyor can use I-JEPA features to recognize the bin category, but that does not tell the arm where the handle will be 400 milliseconds later. V-JEPA-style features can encode the drift direction and the persistence of the handle under partial occlusion, which is exactly the signal a short-horizon controller needs.

What To Measure In Transfer

The right transfer test is not only linear probing accuracy. For I-JEPA, useful downstream probes include depth, object counting, and pose-sensitive retrieval. For V-JEPA, add action anticipation, temporal ordering, state-change detection, and short-horizon planning support. If the video representation does not outperform the image one on a motion-sensitive probe, you may be paying the video-training bill without buying temporal structure.

Algorithm: Transfer Audit

1. Freeze the encoder checkpoint.
2. Run one static-semantic probe and one temporal probe on the same validation split.
3. Compare I-JEPA and V-JEPA under the same head architecture.
4. Promote V-JEPA only if the temporal probe improves enough to justify the extra training and inference cost.

Evaluation Contract

Code Fragment 2 below records the minimum contract for an I-JEPA versus V-JEPA transfer comparison.

# Record a matched transfer experiment for image and video JEPA.
# The same downstream head and split keep the comparison fair,
# so any gain can be attributed to temporal representation quality.
from dataclasses import asdict, dataclass

@dataclass
class TransferAudit:
    image_encoder: str
    video_encoder: str
    probe_task: str
    split: str
    metric: str
    accepted_winner: str

    def as_row(self) -> dict[str, object]:
        return asdict(self)

audit = TransferAudit(
    image_encoder="ijepa_vith",
    video_encoder="vjepa_vitl",
    probe_task="short_horizon_action_anticipation",
    split="held_out_conveyor_sequences",
    metric="top5_future_action_recall",
    accepted_winner="pending",
)
print(audit.as_row())

{'image_encoder': 'ijepa_vith', 'video_encoder': 'vjepa_vitl', 'probe_task': 'short_horizon_action_anticipation', 'split': 'held_out_conveyor_sequences', 'metric': 'top5_future_action_recall', 'accepted_winner': 'pending'}

Code Fragment 2: This transfer audit keeps the I-JEPA and V-JEPA comparison construct-matched. The crucial fields are the shared probe task and held-out split, because otherwise the "video helps" claim can collapse into a comparison between different heads or different data slices.

The expected output is a record with `accepted_winner` still marked `pending`. That is healthy. You should not pre-declare V-JEPA as the winner until it proves that temporal pretraining improves the exact motion-sensitive behavior you care about.

Common Failure Mode

A common mistake is to assume that more temporal data automatically yields a better control representation. In practice, a weak masking policy or a downstream task with little temporal content can make V-JEPA look unnecessarily expensive while adding little over I-JEPA.

Research Frontier

Current JEPA research is asking whether video pretraining can produce representations with enough intuitive physics to support planning and action anticipation without dense task labels. The emerging evidence is promising, but the bar for embodied systems is higher: the latent space must survive contact, occlusion, and intervention-heavy rollouts, not just benchmark classification.

Cross-Reference Thread

Return to Section 40.1 for the core JEPA loss. Jump forward to Section 40.3 for the action-conditioned extension in V-JEPA 2. For motion-sensitive control policies, compare this representational route with the direct action-generation route in Chapter 22.

Self Check

Can you name one downstream task where I-JEPA is probably sufficient and one where V-JEPA should win? Can you justify the answer in terms of temporal information rather than model size alone?

I-JEPA is often the cheaper semantic initializer. V-JEPA is the better candidate when the downstream task depends on motion continuity, anticipatory state estimation, or latent prediction under occlusion. A strong engineering pattern is to start with the image baseline, then justify the move to video with one matched temporal benchmark and one closed-loop rollout task.

Memory Hook

I-JEPA is a strong snapshot memory. V-JEPA starts acting like a short movie memory with consequences.

Key Takeaway

I-JEPA and V-JEPA share the same latent-prediction philosophy, but V-JEPA earns its cost only when temporal information changes the downstream decision. The correct comparison is not image versus video in the abstract, it is static semantics versus motion-aware control value.

Exercise 40.2.1

Design a matched probe suite that would fairly compare I-JEPA and V-JEPA for a bin-picking robot. Include one static task, one temporal task, the shared downstream head, and the acceptance rule for promoting the video representation.

Bibliography & Further Reading

Primary References And Tools

Reference LeCun, Y.. "A Path Towards Autonomous Machine Intelligence." (2022). https://openreview.net/forum?id=BZ5a1r-kVsf

This position paper frames JEPA as a path toward predictive abstract representations. It gives the conceptual motivation for predicting in representation space rather than reconstructing every sensory detail.

Reference Assran, M. et al.. "Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture." (2023). https://arxiv.org/abs/2301.08243

I-JEPA is the image-based foundation for the joint-embedding predictive idea. It is useful for understanding masking, target encoders, and representation prediction before moving to video.

Reference Bardes, A. et al.. "V-JEPA: Revisiting Feature Prediction for Learning Visual Representations from Video." (2024). https://arxiv.org/abs/2404.08471

V-JEPA extends JEPA-style prediction to video. It grounds the chapter's distinction between predicting latent features and reconstructing pixel-level futures.

Reference Meta AI. "Introducing the V-JEPA 2 World Model and New Benchmarks." (2025). https://ai.meta.com/blog/v-jepa-2-world-model-benchmarks/

The official V-JEPA 2 release discusses video-trained world models, benchmarks, and zero-shot robot-control claims. The chapter treats these as important frontier claims that need task-level verification.

Reference Assran, M. et al.. "V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning." (2025). https://arxiv.org/abs/2506.09985

The V-JEPA 2 paper connects self-supervised video pretraining with action-conditioned latent planning. It is the central technical reference for this chapter's JEPA-to-control bridge.