Roadmap Discussion: Adapting GenAI for Gene Expression¶

This document discusses how to adapt the ROADMAP.md for gene expression data (bulk RNA-seq and scRNA-seq).

The roadmap is strong as a "generative AI curriculum" and mostly workable for gene expression. However, a few stages and metrics need domain-specific rewiring to avoid optimizing the wrong thing.

What's Already a Great Fit¶

VAE → cVAE¶

This is exactly the right foundation if you treat gene expression as counts (NB/ZINB) rather than "pixels" (Gaussian/MSE).

Bulk RNA-seq: NB is usually the right default; ZINB rarely needed
scRNA-seq: NB is often sufficient for UMI data; ZINB can help for extreme sparsity, but it's not automatically "better"

Score Matching → DDPM¶

Good next chapter if you pick the right representation. Directly diffusing raw counts is awkward; practical pipelines diffuse:

Log-normalized expression, or
Learned latent space (latent diffusion)

JEPA / World Models¶

Especially relevant for biology use cases:

Perturbation prediction (action = drug/KO)
Trajectory modeling (action = time)
Counterfactuals

Keeping JEPA/world models in later stages is sensible.

What to Adjust for Gene Expression¶

Put scVI-Style Likelihoods + Library Size into Stage 1–2¶

For real gene expression work, the first serious milestone should be:

NB decoder for counts (bulk + scRNA)
Explicit handling of library size / sequencing depth (as offset or covariate)
ZINB only after diagnostics show NB underfits zeros

This is the difference between "VAE toy demo" and "biology-grade model."

Replace Vision Metrics with Biology Metrics¶

FID/IS aren't natural for gene expression (they rely on pretrained vision feature extractors). Better metrics:

Category	Metric
Likelihood	Held-out NB/ZINB log-likelihood (or ELBO)
Distribution	Gene-wise mean/variance + zero rate matching (per condition)
Structure	Condition-separation (do generated samples preserve tissue/disease structure?)
Utility	Downstream classifier trained on real+synthetic → tested on real

IWAE: Only if You Hit Posterior Collapse¶

IWAE is a great learning milestone, but for expression you'll get more value from:

KL warmup / free bits
Decoder likelihood correctness (NB)
Conditioning hygiene

IWAE becomes useful when studying inference quality, but it's not the highest ROI "next step" unless you see issues.

Flow Matching: After Deciding Data Space vs Latent Space¶

Flow matching works well if you work in:

Continuous normalized expression space, or
Latent space from a trained encoder

Tie it explicitly to the representation choice.

A Two-Track Roadmap for Gene Expression¶

Track A: Count-Faithful Representation Learning¶

cVAE with NB decoder (bulk + scRNA)
Add conditions (tissue/disease/batch) + counterfactual swap
β-VAE only if you want disentangled residual factors (with good diagnostics)

Track B: High-Fidelity Generation¶

Learn a good continuous representation (normalized or latent)
Score matching / diffusion in that space
Conditional sampling (classifier-free guidance for metadata)

Unification¶

Both tracks converge for:

Perturbation response (world model)
JEPA-style predictive objectives

This preserves the "VAE → score matching → diffusion → JEPA/world models" arc, but makes it biology-native.

The Key Decision: What Are You Modeling?¶

For your first real dataset, choose one:

Representation	Likelihood	Pros	Cons
Raw counts	NB/ZINB	Biologically faithful	Harder to model
Log-normalized	Gaussian	Easier diffusion/flow	Loses count structure
Learned latent	Gaussian	Best of both worlds	Requires good encoder first

Recommendation: Start with raw counts — it forces you to confront the actual generative problem (library size, overdispersion, sparsity, batch, confounding).

Guardrails for Raw Count Modeling¶

Use NB First, Not ZINB¶

Your MVP should be NB:

\[ x_g \sim \text{NB}(\mu_g, \alpha_g) \]

ZINB adds an extra head ($\pi$) and can soak up modeling mistakes ("everything becomes dropout"). Upgrade to ZINB only if NB fails clear diagnostics.

Pick Real but Manageable Datasets¶

Datasets should be:

Public, well-described
Not enormous
Have clean metadata (tissue/disease/batch)
Used in prior work (for sanity-checking)

Minimum Data Hygiene for Raw Counts¶

If you skip these, NB models will look worse than they are:

Gene filtering: Remove genes expressed in ~0 cells/samples (or keep HVGs)
Library size factor (must-have): Total counts per sample/cell

Typical NB parameterization:

$$ \mu_g = \ell \cdot \exp(\eta_g) $$

where $\ell$ is library size and $\eta_g$ is what the decoder predicts from $(z, y)$.

Batch covariate: Include batch in $y$ if present (even if you later want invariance)

Evaluation Metrics for Real Data¶

Forget FID. For counts, track:

Likelihood-Fit Diagnostics

Held-out NB log-likelihood / ELBO
Gene-wise mean/variance vs real (per condition)
Zero rate vs real (per gene; per condition)

Structure Diagnostics

Train a probe on inferred $z$ to predict batch/tissue (detect leakage/confounding)
Latent collapse monitoring: average KL, active dimensions

Usefulness Diagnostics

Downstream classifier trained on synthetic + real, tested on real
DE signature preservation: effect size correlation (real vs generated)

Where Different Methods Shine¶

| Method | Best For | Caveats | |--------|----------|---------|| | cVAE (NB) | Controlled generation + counterfactual swaps; latents for world-modeling | — | | β-VAE | Disentangled residual factors | Can hurt reconstruction/log-likelihood | | Diffusion/Score | High-fidelity generation | Awkward on discrete counts; use latent space |

Practical Staged Plan (Raw Counts)¶

NB cVAE on a real dataset with a single clean condition (tissue only OR disease only)
Add library size modeling explicitly and confirm fit improves
Add batch conditioning and test counterfactual consistency
Test ZINB only if NB underfits zeros (check held-out likelihood + zero-rate calibration)
Add β (β-VAE) only for specific goals: disentangled factors, stable latents

Bulk vs scRNA: Which First?¶

For your first real experiment:

Data Type	Pros	Cons
Bulk RNA-seq	Cleaner per sample; simpler	Fewer samples; harder to train deep models
scRNA-seq	More data points; tests sparsity handling	More nuisance variation

Recommendation: Start with scRNA-seq (PBMC 3k) because:

Smaller, faster iteration — 3k cells vs thousands of samples
Well-documented — extensively used in tutorials (scanpy, scVI)
Clear ground truth — known cell types for validation
Directly tests NB/ZINB — sparsity is real
Preprocessing script ready — src/genailab/data/sc_preprocess.py

Recommended Datasets¶

scRNA-seq (Start Here)¶

Dataset	Description	Size	Link
PBMC 3k	Classic starter dataset	~3k cells	10x Genomics
Tabula Sapiens	Multi-tissue human atlas	~500k cells	Tabula Sapiens

Bulk RNA-seq (Later)¶

Dataset	Description	Size	Link
GTEx	Multi-tissue, healthy baseline	~17k samples	GTEx Portal
recount3	Uniformly processed public RNA-seq	Massive	recount3

Note: For bulk, NB is typically sufficient. For UMI scRNA, NB is often enough; add ZINB only if NB badly underfits zeros.

References¶

ROADMAP.md — Original GenAI roadmap
VAE-07-NB-ZINB.md — NB vs ZINB likelihood choice
VAE-08-NB-likelihood.md — NB log-likelihood derivation
Lopez et al. (2018) — scVI: Deep generative modeling for single-cell transcriptomics
Eraslan et al. (2019) — DCA: Single-cell RNA-seq denoising using a deep count autoencoder