Variational Autoencoders (VAE)¶
Comprehensive guide to VAEs for gene expression and count data modeling.
Overview¶
Variational Autoencoders (VAEs) are powerful generative models that learn compressed representations of high-dimensional data. For computational biology, VAEs are particularly useful for:
- Dimensionality reduction: Learn low-dimensional latent spaces for gene expression
- Denoising: Remove technical noise from single-cell data
- Generation: Create synthetic samples for data augmentation
- Perturbation prediction: Model drug responses and genetic perturbations
This series covers VAE theory, implementation, and specialized variants for biological count data.
Document Series¶
Core Theory¶
| Document | Topic | Key Concepts |
|---|---|---|
| VAE-01: Overview | Introduction to VAEs | Encoder-decoder, latent space, variational inference |
| VAE-02: ELBO | Evidence Lower Bound | Reconstruction + KL divergence, variational objective |
| VAE-03: Inference | Inference & generation | Posterior q(z|x), prior p(z), sampling |
Gradient Estimation¶
| Document | Topic | Key Concepts |
|---|---|---|
| VAE-04: Reparameterization | Reparameterization trick | Backprop through stochastic nodes |
| VAE-05: Pathwise Derivative | Pathwise gradient estimator | Score function vs. pathwise |
| VAE-05a: Pathwise Details | Implementation details | Practical considerations |
Training & Optimization¶
| Document | Topic | Key Concepts |
|---|---|---|
| VAE-06: Optimization | Training strategies | KL annealing, batch normalization, regularization |
| VAE Model Training | Implementation guide | PyTorch training loops, hyperparameters |
Count Data & Biology¶
| Document | Topic | Key Concepts |
|---|---|---|
| VAE-07: NB & ZINB | Count data decoders | Negative Binomial, Zero-Inflated NB |
| VAE-08: NB Likelihood | NB loss derivation | Gamma-Poisson mixture, dispersion |
Applications¶
| Document | Topic | Key Concepts |
|---|---|---|
| VAE for Prediction | Predictive modeling | Conditioning, perturbation response |
| VAE-09: Roadmap | Extensions & future work | Hierarchical VAE, disentanglement, causal |
Quick Start Guide¶
1. Start with the Basics¶
Read in order: 1. VAE-01: Overview - Understand the overall framework 2. VAE-02: ELBO - Learn the training objective 3. VAE-03: Inference - Understand latent space and sampling
2. Understand Gradients¶
Essential for implementation: 1. VAE-04: Reparameterization - The key trick for backprop 2. VAE-05: Pathwise Derivative - Why it works
3. Train Your First VAE¶
- VAE-06: Optimization - Training strategies
- VAE Model Training - Hands-on implementation
4. Handle Count Data¶
For scRNA-seq and bulk RNA-seq: 1. VAE-07: NB & ZINB - Specialized decoders 2. VAE-08: NB Likelihood - Mathematical details
5. Build Applications¶
- VAE for Prediction - Perturbation modeling
- VAE-09: Roadmap - Advanced topics
VAE Variants Implemented¶
Conditional VAE (CVAE)¶
Use case: Conditional generation (e.g., cell type → expression)
from genailab.model.vae import CVAE
model = CVAE(
input_dim=2000, # genes
latent_dim=10, # compressed representation
condition_dim=5, # cell types
hidden_dims=[512, 256]
)
CVAE with Negative Binomial (CVAE_NB)¶
Use case: Single-cell RNA-seq (count data with overdispersion)
from genailab.model.vae import CVAE_NB
model = CVAE_NB(
input_dim=2000,
latent_dim=10,
condition_dim=5
)
# Predicts mean μ and dispersion r for NB distribution
CVAE with Zero-Inflated NB (CVAE_ZINB)¶
Use case: scRNA-seq with dropout (many zeros)
from genailab.model.vae import CVAE_ZINB
model = CVAE_ZINB(
input_dim=2000,
latent_dim=10,
condition_dim=5
)
# Predicts μ, r, and dropout probability π
Key Concepts¶
Encoder (Recognition Network)¶
Purpose: Map data x to latent distribution q(z|x)
Decoder (Generative Network)¶
Purpose: Map latent z back to data distribution p(x|z)
Decoder types: - Gaussian: MSE loss, for continuous data - Negative Binomial: For count data with variance > mean - Zero-Inflated NB: For sparse count data (scRNA-seq)
ELBO (Evidence Lower Bound)¶
Training objective:
- Reconstruction: How well can we regenerate x from z?
- KL divergence: How close is q(z|x) to prior p(z)?
Applications in Computational Biology¶
1. Denoising scRNA-seq¶
Problem: Technical noise, dropout, batch effects
Solution: VAE learns clean latent representation
Model: CVAE_ZINB with batch/cell type conditioning
2. Drug Response Prediction¶
Problem: Predict perturbed expression from baseline
Solution: Conditional VAE with drug embeddings
Model: CVAE conditioned on [baseline expression, drug ID, dose]
3. Data Augmentation¶
Problem: Limited training samples
Solution: Generate synthetic samples from learned distribution
Model: Sample from p(z), decode to get new x
4. Batch Correction¶
Problem: Technical variation across experiments
Solution: Learn batch-invariant latent space
Model: CVAE with adversarial batch discriminator
5. Cell Type Discovery¶
Problem: Identify novel cell types
Solution: Cluster in learned latent space
Model: VAE → t-SNE/UMAP on z → clustering
Comparison: VAE vs. Other Generative Models¶
| Model | Pros | Cons | Best For |
|---|---|---|---|
| VAE | Fast, stable, explicit latent space | Can be blurry, mode averaging | Representation learning, denoising |
| GAN | Sharp samples, high quality | Training instability, mode collapse | Image generation |
| Diffusion | High quality, stable training | Slow sampling | State-of-the-art generation |
| Flow | Exact likelihood, invertible | Complex architecture | Density estimation |
For gene expression: VAE is often preferred for its: - Interpretable latent space - Fast inference (single forward pass) - Stable training - Uncertainty quantification
Related Topics¶
Within This Project¶
- DDPM - Diffusion models (slower but higher quality)
- Flow Matching - Continuous normalizing flows
- Beta-VAE - Disentangled representations
- Foundation Models - Pre-trained encoders
External Resources¶
- scVI - Industry-standard VAE for scRNA-seq
- CPA (Compositional Perturbation Autoencoder) - Perturbation prediction
- Geneformer - Foundation model (can replace VAE encoder)
Implementation Status¶
| Component | Status | Location |
|---|---|---|
| CVAE (Gaussian) | ✅ Complete | src/genailab/model/vae.py |
| CVAE_NB | ✅ Complete | src/genailab/model/vae.py |
| CVAE_ZINB | ✅ Complete | src/genailab/model/vae.py |
| Training scripts | ✅ Complete | scripts/ |
| Evaluation metrics | ✅ Complete | src/genailab/eval/ |
| Interactive notebooks | 📋 Planned | notebooks/vae/ |
Frequently Asked Questions¶
When should I use VAE vs. Diffusion?¶
Use VAE when: - Fast inference is critical - You need interpretable latent representations - Working with small-to-medium datasets - Uncertainty quantification is important
Use Diffusion when: - Generation quality is top priority - You have large datasets and compute - Slow sampling (100+ steps) is acceptable
How to choose latent dimension?¶
Guidelines: - Gene expression: 10-50 dims (10-20 typical for scRNA-seq) - Rule of thumb: Start with 10-20, increase if reconstruction is poor - Validate: Plot reconstruction error vs. latent dim
What decoder should I use?¶
| Data Type | Decoder | Reason |
|---|---|---|
| Normalized (log-transformed) | Gaussian (MSE) | Simple, fast |
| Raw counts (bulk RNA-seq) | Negative Binomial | Handles overdispersion |
| scRNA-seq (sparse) | ZINB | Handles zeros and overdispersion |
Questions or suggestions? Open an issue on GitHub