Application Guides¶
This directory contains end-to-end application guides for generative AI in computational biology. Unlike the methodology-focused documentation in the parent directories, these guides are application-first and results-oriented.
Philosophy¶
Each application guide follows a consistent structure:
- Problem definition: What biological question are we answering?
- Why generative AI: What do generative models add over discriminative/deterministic approaches?
- Architecture choices: Which methods from our toolkit (VAE, JEPA, diffusion, flow) and why?
- Implementation roadmap: Step-by-step guide with code examples
- Evaluation strategy: Metrics, benchmarks, biological validation
- Expected outcomes: Quantitative results and scientific insights
Active Applications¶
🎯 Perturbation Prediction¶
Goal: Predict single-cell responses to genetic/chemical perturbations
Status: Active development (Week 1-2: Data + VAE Baseline)
Architecture: CVAE_NB → JEPA → Latent Diffusion
Target Dataset: Norman et al. 2019 Perturb-seq (K562 cells)
Key Innovation: Three-stage approach that combines: - Count-aware modeling (NB decoders) - Self-supervised prediction (JEPA) - Uncertainty quantification (diffusion in latent space)
See: perturbation_prediction.md
Planned Applications¶
📋 Gene Expression Prediction¶
Goal: Predict gene expression from metadata with uncertainty quantification
Why Generative AI: - GEM-1 and similar models learn \(\mathbb{E}[x \mid \text{metadata}]\) - We target \(p(x \mid \text{metadata})\) for uncertainty quantification
Proposed Architecture: Hybrid predictive-generative - Stage 1: GEM-1-style supervised predictor (learn conditional mean) - Stage 2: Diffusion on residuals (learn distribution around mean)
Target Dataset: GTEx or harmonized bulk RNA-seq
Status: Next after Perturbation Prediction
Related: docs/incubation/generative-ai-for-gene-expression-prediction.md
📋 Synthetic Biological Data Generation¶
Goal: Generate realistic synthetic datasets for augmentation and benchmarking
Why Generative AI: - Data augmentation for rare conditions - Privacy-preserving data sharing - Benchmarking computational methods
Proposed Architecture: Conditional diffusion with metadata conditioning
Target Dataset: CellxGene census or scPerturb
Status: After at least one prediction-focused application is complete
Application Selection Criteria¶
We prioritize applications based on:
- Scientific impact: Does it address a central problem in computational biology?
- Clear benchmarks: Can we compare against published methods?
- Leverages strengths: Does it use our existing infrastructure (VAE, diffusion, JEPA)?
- Demonstrates value: Does generative AI add something discriminative models cannot?
Relationship to Methodology Documentation¶
| Directory | Focus | Style |
|---|---|---|
| docs/VAE/ | VAE theory and derivations | Methodology-first |
| docs/DDPM/ | Diffusion model foundations | Methodology-first |
| docs/JEPA/ | Self-supervised prediction | Methodology-first |
| docs/applications/ | End-to-end biological applications | Application-first |
When to use which:
- Learning VAE theory? → docs/VAE/
- Learning JEPA architecture? → docs/JEPA/
- Building a perturbation prediction system? → docs/applications/perturbation_prediction.md
Contributing New Applications¶
When adding a new application guide:
- Start with a clear problem: What biological question?
- Justify generative AI: Why not just discriminative/deterministic models?
- Choose architectures deliberately: From our validated toolbox
- Include implementation details: Code examples, not just ideas
- Define success metrics: How do we know it works?
- Validate biologically: Not just computational metrics
Template structure:
# Application Name
## Executive Summary
## Background: Why Generative AI for [Problem]?
## Architecture Overview
## Implementation Roadmap
### Phase 1: ...
### Phase 2: ...
### Phase 3: ...
## Expected Outcomes
## Beyond the Flagship: Extensions
## References
## Implementation Checklist
Status Dashboard¶
| Application | Stage | Dataset | Next Milestone |
|---|---|---|---|
| Perturbation Prediction | 🎯 Active | Norman et al. 2019 | VAE baseline + metrics |
| Gene Expression Prediction | 📋 Planned | GTEx | (after Perturb-seq) |
| Synthetic Data Generation | 📋 Planned | CellxGene | (after Perturb-seq) |
Last Updated: 2026-01-31