Application Guides¶

This directory contains end-to-end application guides for generative AI in computational biology. Unlike the methodology-focused documentation in the parent directories, these guides are application-first and results-oriented.

Philosophy¶

Each application guide follows a consistent structure:

Problem definition: What biological question are we answering?
Why generative AI: What do generative models add over discriminative/deterministic approaches?
Architecture choices: Which methods from our toolkit (VAE, JEPA, diffusion, flow) and why?
Implementation roadmap: Step-by-step guide with code examples
Evaluation strategy: Metrics, benchmarks, biological validation
Expected outcomes: Quantitative results and scientific insights

Active Applications¶

🎯 Perturbation Prediction ¶

Goal: Predict single-cell responses to genetic/chemical perturbations

Status: Active development (Week 1-2: Data + VAE Baseline)

Architecture: CVAE_NB → JEPA → Latent Diffusion

Target Dataset: Norman et al. 2019 Perturb-seq (K562 cells)

Key Innovation: Three-stage approach that combines: - Count-aware modeling (NB decoders) - Self-supervised prediction (JEPA) - Uncertainty quantification (diffusion in latent space)

See: perturbation_prediction.md

Planned Applications¶

📋 Gene Expression Prediction¶

Goal: Predict gene expression from metadata with uncertainty quantification

Why Generative AI: - GEM-1 and similar models learn \(\mathbb{E}[x \mid \text{metadata}]\) - We target \(p(x \mid \text{metadata})\) for uncertainty quantification

Proposed Architecture: Hybrid predictive-generative - Stage 1: GEM-1-style supervised predictor (learn conditional mean) - Stage 2: Diffusion on residuals (learn distribution around mean)

Target Dataset: GTEx or harmonized bulk RNA-seq

Status: Next after Perturbation Prediction

📋 Synthetic Biological Data Generation¶

Goal: Generate realistic synthetic datasets for augmentation and benchmarking

Why Generative AI: - Data augmentation for rare conditions - Privacy-preserving data sharing - Benchmarking computational methods

Proposed Architecture: Conditional diffusion with metadata conditioning

Target Dataset: CellxGene census or scPerturb

Status: After at least one prediction-focused application is complete

Application Selection Criteria¶

We prioritize applications based on:

Scientific impact: Does it address a central problem in computational biology?
Clear benchmarks: Can we compare against published methods?
Leverages strengths: Does it use our existing infrastructure (VAE, diffusion, JEPA)?
Demonstrates value: Does generative AI add something discriminative models cannot?

Relationship to Methodology Documentation¶

Directory	Focus	Style
docs/VAE/	VAE theory and derivations	Methodology-first
docs/DDPM/	Diffusion model foundations	Methodology-first
docs/JEPA/	Self-supervised prediction	Methodology-first
docs/applications/	End-to-end biological applications	Application-first

When to use which: - Learning VAE theory? → docs/VAE/ - Learning JEPA architecture? → docs/JEPA/ - Building a perturbation prediction system? → docs/applications/perturbation_prediction.md

Contributing New Applications¶

When adding a new application guide:

Start with a clear problem: What biological question?
Justify generative AI: Why not just discriminative/deterministic models?
Choose architectures deliberately: From our validated toolbox
Include implementation details: Code examples, not just ideas
Define success metrics: How do we know it works?
Validate biologically: Not just computational metrics

Template structure:

# Application Name

## Executive Summary
## Background: Why Generative AI for [Problem]?
## Architecture Overview
## Implementation Roadmap
  ### Phase 1: ...
  ### Phase 2: ...
  ### Phase 3: ...
## Expected Outcomes
## Beyond the Flagship: Extensions
## References
## Implementation Checklist

Status Dashboard¶

Application	Stage	Dataset	Next Milestone
Perturbation Prediction	🎯 Active	Norman et al. 2019	VAE baseline + metrics
Gene Expression Prediction	📋 Planned	GTEx	(after Perturb-seq)
Synthetic Data Generation	📋 Planned	CellxGene	(after Perturb-seq)

Last Updated: 2026-01-31