JEPA Overview: Joint Embedding Predictive Architecture¶

JEPA (Joint Embedding Predictive Architecture) is a self-supervised learning paradigm that learns by predicting in embedding space rather than reconstructing in pixel/data space.

Key insight: Instead of generating outputs (like VAE/diffusion), JEPA predicts the latent representation of targets, enabling efficient learning of world models without expensive reconstruction.

What is JEPA?¶

The Core Idea¶

Traditional generative models (VAE, diffusion):

Input → Encoder → Latent → Decoder → Reconstruction
                                    ↓
                              Loss: ||x - x̂||²

JEPA:

Input x → Encoder → z_x
                     ↓
                 Predictor → ẑ_y
                     ↑
Target y → Encoder → z_y
                     ↓
              Loss: ||z_y - ẑ_y||²

Key difference: Predict embeddings, not pixels/counts.

Why This Matters¶

1. Computational efficiency

No expensive decoder
Prediction in low-dimensional latent space
Faster training, less memory

2. Better representations

Focus on semantic content, not pixel details
Invariant to nuisance factors
More robust to noise

3. World modeling

Learn dynamics without generation
Predict future states efficiently
Enable planning and reasoning

4. No contrastive negatives

Unlike SimCLR, MoCo, CLIP
No need to sample negative pairs
Simpler training objective

JEPA vs Other Approaches¶

Comparison Table¶

Method	What's Predicted	Loss Type	Negatives?	Decoder?
VAE	Reconstruction	Pixel-level	No	Yes (expensive)
Diffusion	Noise/velocity	Pixel-level	No	Yes (expensive)
Contrastive (SimCLR)	Similarity	Embedding	Yes (required)	No
JEPA	Embedding	Embedding	No	No

Visual Comparison¶

Generative (VAE/Diffusion):

Learn: p(x|z) — "How to generate x from z"
Goal: Reconstruct/generate realistic samples
Cost: High (decoder, pixel-level loss)

Contrastive (SimCLR/MoCo):

Learn: Similarity in embedding space
Goal: Pull positives together, push negatives apart
Cost: Moderate (need negative sampling)

JEPA:

Learn: Predict z_y from z_x
Goal: Model relationships in embedding space
Cost: Low (no decoder, no negatives)

Key Components¶

1. Encoder¶

Maps inputs to embeddings:

\[ z = f_\theta(x) \]

Shared across all inputs (images, videos, gene expression, etc.)

2. Predictor¶

Predicts target embedding from context:

$$

\hat{z}y = g\phi(z_x, \text{context}) $$

Context can be: - Time (predict future from past) - Perturbation (predict perturbed state from baseline) - Masked regions (predict masked from visible)

3. VICReg Regularization¶

Prevents collapse via three terms:

Variance: Keep embeddings spread out $$

\mathcal{L}_{\text{var}} = \sum_d \max(0, \gamma - \sqrt{\text{Var}(z_d) + \epsilon}) $$

Invariance: Predictions match targets $$

\mathcal{L}_{\text{inv}} = | z_y - \hat{z}_y |^2 $$

Covariance: Decorrelate dimensions $$

\mathcal{L}{\text{cov}} = \sum(z_i, z_j)^2 $$} \text{Cov

Total loss:

$$

\mathcal{L} = \lambda_{\text{inv}} \mathcal{L}{\text{inv}} + \lambda}} \mathcal{L{\text{var}} + \lambda $$}} \mathcal{L}_{\text{cov}

Why JEPA for Biology?¶

Problem with Generative Models in Biology¶

Reconstruction is often not the goal:

We don't need to generate realistic gene expression
We care about predictions (perturbations, trajectories)
Pixel-level accuracy is overkill

Example: Perturb-seq - Given: Baseline cell state + perturbation - Want: Predicted perturbed state - Don't need: Perfect reconstruction of all 20K genes

JEPA Advantages for Biology¶

1. Efficient perturbation modeling

Baseline expression → Encoder → z_baseline
                                    ↓
Perturbation info → Predictor → ẑ_perturbed
                                    ↑
Actual perturbed → Encoder → z_perturbed
                                    ↓
                            Loss: ||z - ẑ||²

2. Natural for time-series

x_t → Encoder → z_t
                 ↓
             Predictor → ẑ_{t+1}
                 ↑
x_{t+1} → Encoder → z_{t+1}

3. Handles heterogeneity

Cell-level predictions (not population average)
Uncertainty in embedding space
Robust to technical noise

4. Compositional generalization

Learn perturbation operators
Combine multiple perturbations
Transfer across cell types

Joint Latent Spaces: The Goku Insight¶

The Problem with Separate Models¶

Traditional approach:

Bulk RNA-seq model (static)
Time-series model (dynamic)
Perturb-seq model (perturbations)
Each has its own latent space

Issues:

No knowledge transfer
Can't combine modalities
Redundant learning

Joint Latent Space Solution¶

Key insight from Goku (ByteDance, 2024):

If two data types differ only by dimensionality or observation density, they probably want the same latent space.

For biology:

Bulk RNA-seq = "static image" (baseline state)
Time-series = "video" (temporal dynamics)
Perturb-seq = "video with interventions"
Single-cell snapshots = "variable-length clips"

All map to the same latent manifold:

Bulk RNA-seq ──┐
               ├──> Shared Encoder ──> Joint Latent Space
Time-series ───┤
               │
Perturb-seq ───┘

Benefits:

Static data teaches spatial priors (cell types, pathways)
Dynamic data teaches temporal dynamics
Perturbation data teaches causal relationships
All inform the same representation

JEPA Variants¶

I-JEPA (Image JEPA)¶

Task: Predict masked image regions in embedding space

Visible patches → Encoder → z_visible
                              ↓
                          Predictor → ẑ_masked
                              ↑
Masked patches → Encoder → z_masked

Key innovation: Masking in embedding space, not pixel space

V-JEPA (Video JEPA)¶

Task: Predict future video frames in embedding space

Past frames → Encoder → z_past
                         ↓
                     Predictor → ẑ_future
                         ↑
Future frames → Encoder → z_future

V-JEPA 2 (Meta, 2025): Adds planning capabilities

A-JEPA (Audio JEPA)¶

Task: Predict masked audio segments in embedding space

Bio-JEPA (Proposed)¶

Task: Predict perturbed/future cell states in embedding space

Baseline + perturbation → Predictor → ẑ_perturbed
                                        ↑
Actual perturbed state → Encoder → z_perturbed

JEPA + Generative Models¶

Hybrid Architecture¶

JEPA alone doesn't generate samples — it predicts embeddings.

To generate, combine with generative model:

JEPA: x → z → Predictor → ẑ
                           ↓
Generative: ẑ → Decoder/Diffusion → x̂

Example: JEPA + Diffusion 1. JEPA predicts perturbed embedding 2. Diffusion generates samples from that embedding 3. Get both prediction AND uncertainty quantification

Benefits:

JEPA learns dynamics efficiently
Generative model handles distribution
Best of both worlds

Applications in Computational Biology¶

1. Perturbation Prediction (Perturb-seq)¶

Task: Predict cellular response to genetic/chemical perturbations

JEPA approach:

# Baseline cell
z_baseline = encoder(x_baseline)

# Perturbation embedding
z_pert = perturbation_encoder(perturbation_info)

# Predict perturbed state
z_pred = predictor(z_baseline, z_pert)

# Compare to actual
z_actual = encoder(x_perturbed)
loss = ||z_pred - z_actual||²

Advantages:

No need to reconstruct all 20K genes
Learn perturbation operators
Compositional (combine perturbations)

2. Trajectory Inference¶

Task: Predict developmental or disease trajectories

JEPA approach:

# Current state
z_t = encoder(x_t)

# Time embedding
t_emb = time_encoder(t)

# Predict future
z_t1 = predictor(z_t, t_emb)

# Compare to actual
z_actual = encoder(x_t1)
loss = ||z_t1 - z_actual||²

3. Multi-omics Integration¶

Task: Predict one modality from another

JEPA approach:

# RNA-seq
z_rna = encoder_rna(x_rna)

# Predict protein
z_protein_pred = predictor(z_rna)

# Compare to actual
z_protein = encoder_protein(x_protein)
loss = ||z_protein_pred - z_protein||²

4. Drug Response Prediction¶

Task: Predict cellular response to drugs

JEPA approach:

# Baseline + drug
z_baseline = encoder(x_baseline)
z_drug = drug_encoder(drug_features)

# Predict response
z_response = predictor(z_baseline, z_drug)

# Compare to actual
z_actual = encoder(x_treated)
loss = ||z_response - z_actual||²

Key Advantages for Biology¶

1. Efficiency¶

Generative models:

Train decoder on 20K genes
Pixel-level reconstruction loss
Slow, memory-intensive

JEPA:

No decoder
Embedding-level loss (e.g., 256-dim)
Fast, memory-efficient

Speedup: 10-100× faster training

2. Robustness¶

Generative models:

Sensitive to technical noise
Must model all variation
Overfits to batch effects

JEPA:

Focuses on semantic content
Invariant to nuisance factors
More robust representations

3. Interpretability¶

Generative models:

Black box decoder
Hard to interpret latents

JEPA:

Direct prediction in embedding space
Can probe what's being predicted
Easier to analyze learned representations

4. Compositional Generalization¶

Generative models:

Learn p(x|condition)
Hard to combine conditions

JEPA:

Learn perturbation operators
Naturally compositional
Combine multiple perturbations

Limitations and Challenges¶

1. No Direct Generation¶

JEPA predicts embeddings, not samples.

Solution: Combine with generative model (VAE, diffusion) for sampling.

2. Embedding Collapse¶

Without regularization, embeddings can collapse to constant.

Solution: VICReg regularization (variance + covariance terms).

3. Predictor Capacity¶

Predictor must be powerful enough to model relationships.

Solution: Use transformer-based predictors with sufficient capacity.

4. Evaluation¶

How do you evaluate embedding predictions?

Solutions:

Downstream task performance
Embedding similarity metrics
Probe classifiers
Generate samples and evaluate

When to Use JEPA¶

Use JEPA When:¶

✅ Prediction is the goal (not generation) - Perturbation prediction - Trajectory inference - Multi-omics translation

✅ Efficiency matters - Large-scale datasets - Limited compute - Need fast training

✅ Robustness is critical - Noisy data - Batch effects - Technical variation

✅ Compositional reasoning needed - Combine perturbations - Transfer across contexts - Causal modeling

Use Generative Models When:¶

❌ Need actual samples - Data augmentation - Synthetic data generation - Uncertainty quantification

❌ Reconstruction quality matters - Image generation - High-fidelity synthesis

❌ Distribution modeling is the goal - Density estimation - Anomaly detection

Best: Hybrid JEPA + Generative¶

Combine both: 1. JEPA learns dynamics efficiently 2. Generative model handles sampling 3. Get prediction + generation + uncertainty

The Path Forward¶

Stage 1: Basic JEPA (Current)¶

Understand architecture
Implement encoder + predictor
VICReg regularization
Toy examples

Stage 2: Bio-JEPA¶

Apply to Perturb-seq
Perturbation conditioning
Evaluate on held-out perturbations

Stage 3: Joint Latent Spaces¶

Combine bulk + single-cell
Static + dynamic data
Multi-omics integration

Stage 4: JEPA + Generative¶

Add diffusion decoder
Uncertainty quantification
Full predictive-generative system

Key Takeaways¶

Conceptual¶

Predict embeddings, not pixels — more efficient, more robust
No reconstruction needed — focus on semantic content
No contrastive negatives — simpler than SimCLR/MoCo
World models without generation — learn dynamics efficiently
Joint latent spaces — static and dynamic data train each other

Practical¶

JEPA is not generative — predicts embeddings, not samples
VICReg prevents collapse — variance + covariance regularization
Powerful predictor needed — transformer-based works well
Combine with generative — for sampling and uncertainty
Perfect for biology — perturbations, trajectories, multi-omics

For Computational Biology¶

Perturb-seq is ideal use case — predict perturbed states
Efficiency matters — 20K genes, millions of cells
Robustness critical — technical noise, batch effects
Compositional reasoning — combine perturbations
Hybrid approach best — JEPA + diffusion for full system

01_jepa_foundations.md — Architecture details
02_jepa_training.md — Training strategies
03_jepa_applications.md — Vision to biology mapping
04_jepa_perturbseq.md — Perturb-seq application
Joint Latent Spaces — Goku insights

References¶

JEPA papers:

LeCun, "A Path Towards Autonomous Machine Intelligence" (2022)
Assran et al., "Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture" (I-JEPA, 2023)
Bardes et al., "V-JEPA: Latent Video Prediction for Visual Representation Learning" (2024)
Meta AI, "V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction, and Planning" (2025)

Joint latent spaces:

ByteDance & HKU, "Goku: Native Joint Image-Video Generation" (2024)

VICReg:

Bardes et al., "VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning" (2022)

Biology applications (to be developed): - Perturb-seq prediction - Trajectory inference - Multi-omics integration

JEPA Overview: Joint Embedding Predictive Architecture¶

What is JEPA?¶

The Core Idea¶

Why This Matters¶

JEPA vs Other Approaches¶

Comparison Table¶

Visual Comparison¶

Key Components¶

1. Encoder¶

2. Predictor¶

3. VICReg Regularization¶

Why JEPA for Biology?¶

Problem with Generative Models in Biology¶

JEPA Advantages for Biology¶

Joint Latent Spaces: The Goku Insight¶

The Problem with Separate Models¶

Joint Latent Space Solution¶

JEPA Variants¶

I-JEPA (Image JEPA)¶

V-JEPA (Video JEPA)¶

A-JEPA (Audio JEPA)¶

Bio-JEPA (Proposed)¶

JEPA + Generative Models¶

Hybrid Architecture¶

Applications in Computational Biology¶

1. Perturbation Prediction (Perturb-seq)¶

2. Trajectory Inference¶

3. Multi-omics Integration¶

4. Drug Response Prediction¶

Key Advantages for Biology¶

1. Efficiency¶

2. Robustness¶

3. Interpretability¶

4. Compositional Generalization¶

Limitations and Challenges¶

1. No Direct Generation¶

2. Embedding Collapse¶

3. Predictor Capacity¶

4. Evaluation¶

When to Use JEPA¶

Use JEPA When:¶

Use Generative Models When:¶

Best: Hybrid JEPA + Generative¶

The Path Forward¶

Stage 1: Basic JEPA (Current)¶

Stage 2: Bio-JEPA¶

Stage 3: Joint Latent Spaces¶

Stage 4: JEPA + Generative¶

Key Takeaways¶

Conceptual¶

Practical¶

For Computational Biology¶

Related Documents¶

References¶