Skip to content

Variational Autoencoders (VAE)

Comprehensive guide to VAEs for gene expression and count data modeling.


Overview

Variational Autoencoders (VAEs) are powerful generative models that learn compressed representations of high-dimensional data. For computational biology, VAEs are particularly useful for:

  • Dimensionality reduction: Learn low-dimensional latent spaces for gene expression
  • Denoising: Remove technical noise from single-cell data
  • Generation: Create synthetic samples for data augmentation
  • Perturbation prediction: Model drug responses and genetic perturbations

This series covers VAE theory, implementation, and specialized variants for biological count data.


Document Series

Core Theory

Document Topic Key Concepts
VAE-01: Overview Introduction to VAEs Encoder-decoder, latent space, variational inference
VAE-02: ELBO Evidence Lower Bound Reconstruction + KL divergence, variational objective
VAE-03: Inference Inference & generation Posterior q(z|x), prior p(z), sampling

Gradient Estimation

Document Topic Key Concepts
VAE-04: Reparameterization Reparameterization trick Backprop through stochastic nodes
VAE-05: Pathwise Derivative Pathwise gradient estimator Score function vs. pathwise
VAE-05a: Pathwise Details Implementation details Practical considerations

Training & Optimization

Document Topic Key Concepts
VAE-06: Optimization Training strategies KL annealing, batch normalization, regularization
VAE Model Training Implementation guide PyTorch training loops, hyperparameters

Count Data & Biology

Document Topic Key Concepts
VAE-07: NB & ZINB Count data decoders Negative Binomial, Zero-Inflated NB
VAE-08: NB Likelihood NB loss derivation Gamma-Poisson mixture, dispersion

Applications

Document Topic Key Concepts
VAE for Prediction Predictive modeling Conditioning, perturbation response
VAE-09: Roadmap Extensions & future work Hierarchical VAE, disentanglement, causal

Quick Start Guide

1. Start with the Basics

Read in order: 1. VAE-01: Overview - Understand the overall framework 2. VAE-02: ELBO - Learn the training objective 3. VAE-03: Inference - Understand latent space and sampling

2. Understand Gradients

Essential for implementation: 1. VAE-04: Reparameterization - The key trick for backprop 2. VAE-05: Pathwise Derivative - Why it works

3. Train Your First VAE

  1. VAE-06: Optimization - Training strategies
  2. VAE Model Training - Hands-on implementation

4. Handle Count Data

For scRNA-seq and bulk RNA-seq: 1. VAE-07: NB & ZINB - Specialized decoders 2. VAE-08: NB Likelihood - Mathematical details

5. Build Applications

  1. VAE for Prediction - Perturbation modeling
  2. VAE-09: Roadmap - Advanced topics

VAE Variants Implemented

Conditional VAE (CVAE)

Use case: Conditional generation (e.g., cell type → expression)

from genailab.model.vae import CVAE

model = CVAE(
    input_dim=2000,      # genes
    latent_dim=10,       # compressed representation
    condition_dim=5,     # cell types
    hidden_dims=[512, 256]
)

CVAE with Negative Binomial (CVAE_NB)

Use case: Single-cell RNA-seq (count data with overdispersion)

from genailab.model.vae import CVAE_NB

model = CVAE_NB(
    input_dim=2000,
    latent_dim=10,
    condition_dim=5
)
# Predicts mean μ and dispersion r for NB distribution

CVAE with Zero-Inflated NB (CVAE_ZINB)

Use case: scRNA-seq with dropout (many zeros)

from genailab.model.vae import CVAE_ZINB

model = CVAE_ZINB(
    input_dim=2000,
    latent_dim=10,
    condition_dim=5
)
# Predicts μ, r, and dropout probability π

Key Concepts

Encoder (Recognition Network)

Purpose: Map data x to latent distribution q(z|x)

x (gene expression) → Neural Net → μ_z, σ_z → z ~ N(μ_z, σ_z²)

Decoder (Generative Network)

Purpose: Map latent z back to data distribution p(x|z)

z (latent code) → Neural Net → reconstruction x̂

Decoder types: - Gaussian: MSE loss, for continuous data - Negative Binomial: For count data with variance > mean - Zero-Inflated NB: For sparse count data (scRNA-seq)

ELBO (Evidence Lower Bound)

Training objective:

\[ \mathcal{L} = \underbrace{\mathbb{E}_{q(z|x)}[\log p(x|z)]}_{\text{Reconstruction}} - \underbrace{KL[q(z|x) || p(z)]}_{\text{Regularization}} \]
  • Reconstruction: How well can we regenerate x from z?
  • KL divergence: How close is q(z|x) to prior p(z)?

Applications in Computational Biology

1. Denoising scRNA-seq

Problem: Technical noise, dropout, batch effects
Solution: VAE learns clean latent representation
Model: CVAE_ZINB with batch/cell type conditioning

2. Drug Response Prediction

Problem: Predict perturbed expression from baseline
Solution: Conditional VAE with drug embeddings
Model: CVAE conditioned on [baseline expression, drug ID, dose]

3. Data Augmentation

Problem: Limited training samples
Solution: Generate synthetic samples from learned distribution
Model: Sample from p(z), decode to get new x

4. Batch Correction

Problem: Technical variation across experiments
Solution: Learn batch-invariant latent space
Model: CVAE with adversarial batch discriminator

5. Cell Type Discovery

Problem: Identify novel cell types
Solution: Cluster in learned latent space
Model: VAE → t-SNE/UMAP on z → clustering


Comparison: VAE vs. Other Generative Models

Model Pros Cons Best For
VAE Fast, stable, explicit latent space Can be blurry, mode averaging Representation learning, denoising
GAN Sharp samples, high quality Training instability, mode collapse Image generation
Diffusion High quality, stable training Slow sampling State-of-the-art generation
Flow Exact likelihood, invertible Complex architecture Density estimation

For gene expression: VAE is often preferred for its: - Interpretable latent space - Fast inference (single forward pass) - Stable training - Uncertainty quantification


Within This Project

External Resources

  • scVI - Industry-standard VAE for scRNA-seq
  • CPA (Compositional Perturbation Autoencoder) - Perturbation prediction
  • Geneformer - Foundation model (can replace VAE encoder)

Implementation Status

Component Status Location
CVAE (Gaussian) ✅ Complete src/genailab/model/vae.py
CVAE_NB ✅ Complete src/genailab/model/vae.py
CVAE_ZINB ✅ Complete src/genailab/model/vae.py
Training scripts ✅ Complete scripts/
Evaluation metrics ✅ Complete src/genailab/eval/
Interactive notebooks 📋 Planned notebooks/vae/

Frequently Asked Questions

When should I use VAE vs. Diffusion?

Use VAE when: - Fast inference is critical - You need interpretable latent representations - Working with small-to-medium datasets - Uncertainty quantification is important

Use Diffusion when: - Generation quality is top priority - You have large datasets and compute - Slow sampling (100+ steps) is acceptable

How to choose latent dimension?

Guidelines: - Gene expression: 10-50 dims (10-20 typical for scRNA-seq) - Rule of thumb: Start with 10-20, increase if reconstruction is poor - Validate: Plot reconstruction error vs. latent dim

What decoder should I use?

Data Type Decoder Reason
Normalized (log-transformed) Gaussian (MSE) Simple, fast
Raw counts (bulk RNA-seq) Negative Binomial Handles overdispersion
scRNA-seq (sparse) ZINB Handles zeros and overdispersion

Questions or suggestions? Open an issue on GitHub