genai-lab¶
Generative AI for Computational Biology: Research into foundation models and generative methods for accelerating drug discovery, understanding treatment responses, and enabling in silico biological experimentation.
Overview¶
This project investigates generative modeling approaches across computational biology, inspired by emerging platforms such as:
- Gene Expression: Synthesize Bio (GEM-1), Deep Genomics (BigRNA)
- DNA Sequence: Arc Institute (Evo 2), InstaDeep (Nucleotide Transformer)
- Single-Cell: Geneformer, scGPT
- Gene Editing: Profluent (OpenCRISPR)
Research Goals:
- Investigate state-of-the-art generative architectures (VAE, flows, diffusion, transformers) for biological sequences and multi-omics data
- Develop reusable, modular components for conditional generation and counterfactual simulation
- Explore causal inference methods for predicting treatment responses and perturbation effects
- Contribute to the growing field of generative biology with reproducible implementations and benchmarks
See docs/INDUSTRY_LANDSCAPE.md for a comprehensive survey of companies and technologies in this space.
Project Structure¶
genai-lab/
├── src/genailab/
│ ├── foundation/ # 🆕 Foundation model adaptation framework
│ │ ├── configs/ # Resource-aware model configs (small/medium/large)
│ │ ├── tuning/ # LoRA, adapters, freezing strategies
│ │ ├── conditioning/ # FiLM, cross-attention, CFG (planned)
│ │ └── recipes/ # End-to-end pipelines (planned)
│ ├── data/ # Data loading, transforms, preprocessing
│ │ ├── paths.py # Standardized data path management
│ │ ├── sc_preprocess.py # scRNA-seq preprocessing (Scanpy)
│ │ └── bulk_preprocess.py # Bulk RNA-seq preprocessing
│ ├── model/ # Encoders, decoders, VAE, diffusion architectures
│ │ ├── vae.py # CVAE, CVAE_NB, CVAE_ZINB
│ │ ├── encoders.py # ConditionEncoder, etc.
│ │ ├── decoders.py # Gaussian, NB, ZINB decoders
│ │ └── diffusion/ # Diffusion models (DDPM, score networks)
│ ├── objectives/ # Loss functions, regularizers
│ │ └── losses.py # ELBO, NB, ZINB losses
│ ├── eval/ # Metrics, diagnostics, plotting
│ ├── workflows/ # Training, simulation, benchmarking
│ └── utils/ # Config, reproducibility
├── docs/ # Theory documents and derivations
│ ├── foundation_models/ # 🆕 Foundation model adaptation
│ ├── DiT/ # 🆕 Diffusion Transformers
│ ├── JEPA/ # 🆕 Joint Embedding Predictive Architecture
│ ├── latent_diffusion/ # 🆕 Latent diffusion for biology
│ ├── DDPM/ # Denoising Diffusion Probabilistic Models
│ ├── VAE/ # VAE theory and derivations
│ ├── EBM/ # Energy-based models
│ ├── score_matching/ # Score matching and energy functions
│ ├── flow_matching/ # Flow matching & rectified flow
│ └── datasets/ # Data preparation guides
├── notebooks/ # Educational tutorials (interactive learning)
│ ├── foundation_models/ # 🆕 Foundation adaptation tutorials
│ ├── diffusion/ # Diffusion models tutorials
│ ├── vae/ # VAE tutorials
│ └── foundations/ # Mathematical foundations
├── examples/ # Production scripts (real-world applications)
│ ├── perturbation/ # Drug response, perturbation prediction
│ └── utils/ # Helper modules for examples
├── scripts/ # Training scripts with CLI
│ └── diffusion/ # Diffusion model training scripts
├── data/ # Local data storage (gitignored)
├── tests/
└── environment.yml # Conda environment specification
Documentation & Learning Resources¶
Theory Documents (docs/)¶
Detailed theory, derivations, and mathematical foundations:
| Topic | Description | Start Here |
|---|---|---|
| 🆕 foundation_models | Foundation model adaptation (LoRA, adapters, freezing) | leveraging_foundation_models_v2.md |
| 🆕 DiT | Diffusion Transformers (architecture, training, sampling) | README.md |
| 🆕 JEPA | Joint Embedding Predictive Architecture | README.md |
| 🆕 latent_diffusion | Latent diffusion with NB/ZINB decoders | README.md |
| DDPM | Denoising Diffusion Probabilistic Models | README.md |
| VAE | Variational Autoencoders (ELBO, inference, training) | VAE-01-overview.md |
| beta-VAE | VAE with disentanglement (β parameter) | beta_vae.md |
| EBM | Energy-Based Models (Boltzmann, partition functions) | README.md |
| score_matching | Score functions, Fisher vs Stein scores | README.md |
| flow_matching | Flow matching & rectified flow | README.md |
| datasets | Datasets & preprocessing pipelines | README.md |
| incubation | Ideas under development | README.md |
Ideas Under Incubation (docs/incubation/)¶
Exploratory architectural proposals and application ideas not yet implemented:
| Document | Focus |
|---|---|
| joint_latent_space_and_JEPA.md | Joint latent spaces for static/dynamic data, JEPA for Perturb-seq |
| generative-ai-for-gene-expression-prediction.md | Diffusion/VAE/Flow for gene expression with uncertainty |
| generative-ai-for-perturbation-modeling.md | Generative approaches for scPerturb, beyond GEM-1 |
Key insights from incubation:
- Joint latent spaces: Static (bulk RNA-seq) and dynamic (Perturb-seq) data can share the same manifold
- JEPA over reconstruction: Predicting embeddings is more robust for biology
- Hybrid predictive-generative: Combine GEM-1-style predictors with generative wrappers for uncertainty
Interactive Tutorials (notebooks/)¶
Educational Jupyter notebooks for hands-on learning:
| Topic | Description | Start Here |
|---|---|---|
| 🆕 foundation_models | Foundation model adaptation (LoRA, adapters, resource management) | README.md |
| diffusion | Diffusion models (DDPM, score-based, flow matching) | 01_ddpm_basics.ipynb |
| vae | VAE tutorials (coming soon) | - |
| foundations | Mathematical foundations (coming soon) | - |
See notebooks/README.md for learning paths and progression.
Production Examples (examples/)¶
Ready-to-use Python scripts for real-world applications:
01_bulk_cvae.ipynb— Train CVAE on bulk RNA-seq02_pbmc3k_cvae_nb.ipynb— Train CVAE with NB decoder on scRNA-seqperturbation/— Drug response and perturbation prediction (coming soon)
How to use:
- Learning: Start with
notebooks/for interactive tutorials - Theory: Reference
docs/for detailed derivations - Application: Use
examples/for production workflows - Follow the ROADMAP for structured progression
Installation¶
Using mamba + poetry (recommended)¶
# Create conda environment
mamba create -n genailab python=3.11 -y
mamba activate genailab
# Install poetry if not available
pip install poetry
# Install package in editable mode
poetry install
# Optional: install bio dependencies (scanpy, anndata)
poetry install --with bio
# Optional: install dev dependencies
poetry install --with dev
Quick start¶
# Verify installation
python -c "import genailab; print(genailab.__version__)"
# Run toy training (once implemented)
genailab-train --config configs/cvae_toy.yaml
Milestones¶
Stage 1: Variational Autoencoders ✅¶
- Core CVAE implementation with condition encoding
- Gaussian decoder (MSE reconstruction)
- Negative Binomial decoder for count data (
CVAE_NB) - Zero-Inflated Negative Binomial decoder (
CVAE_ZINB) - ELBO loss with KL annealing support
- Comprehensive documentation (VAE-01 through VAE-09)
- Unit tests for all model variants
Stage 2: Data Pipeline ✅¶
- Standardized data path management (
genailab.data.paths) - scRNA-seq preprocessing with Scanpy
- Bulk RNA-seq preprocessing (Python + R/recount3)
- Environment setup (conda/mamba + Poetry)
- Data preparation documentation
Stage 3: Score Matching & Energy Functions ✅¶
- Score matching overview documentation
- Energy functions deep dive (Boltzmann, partition function)
- VP-SDE and VE-SDE formulations
- Denoising score matching loss
Stage 4: Diffusion Models ✅¶
- Forward/reverse diffusion process (VP-SDE, VE-SDE)
- Score networks (MLP, TabularScoreNetwork, UNet2D, UNet3D)
- Medical imaging diffusion (synthetic X-rays)
- Training scripts with configurable model sizes
- RunPod setup documentation for GPU training
- Comprehensive DDPM documentation series
- Gene expression architectures (latent tokens, pathway tokens)
- Conditional generation with classifier-free guidance
- Flow matching implementation
Stage 5: Foundation Model Adaptation ✅¶
- Resource-aware model configurations (small/medium/large)
- Auto-detection of hardware (M1 Mac, RunPod, Cloud)
- LoRA (Low-Rank Adaptation) implementation
- Comprehensive documentation (DiT, JEPA, Latent Diffusion)
- Adapters and freezing strategies
- Conditioning modules (FiLM, cross-attention, CFG)
- Tutorial notebooks for each adaptation pattern
- End-to-end recipes for gene expression tasks
Stage 6: Advanced Architectures 📝¶
- DiT (Diffusion Transformers) documentation
- JEPA (Joint Embedding Predictive Architecture) documentation
- Latent Diffusion documentation
- DiT implementation for gene expression
- JEPA implementation for Perturb-seq
- Flow matching implementation
Stage 7: Counterfactual & Causal (Planned)¶
- Counterfactual generation pipeline
- Deconfounding / SCM-flavored latent model
- Causal regularization via invariance
Industry Landscape¶
Companies and platforms pioneering generative AI for drug discovery and biological research:
Gene Expression & Multi-Omics Foundation Models¶
| Company | Focus | Key Technology |
|---|---|---|
| Synthesize Bio | Gene expression generation | GEM-1 foundation model |
| Ochre Bio | Liver disease, RNA therapeutics | Functional genomics + AI |
| Deep Genomics | RNA biology & therapeutics | BigRNA (~2B params) |
| Helical | DNA/RNA foundation models | Helix-mRNA, open-source platform |
| Noetik | Cancer biology | OCTO model for treatment prediction |
Protein & Structure-Based Discovery¶
| Company | Focus | Key Technology |
|---|---|---|
| Isomorphic Labs | Drug discovery (DeepMind spin-off) | AlphaFold 3 |
| EvolutionaryScale | Protein design | ESM3 generative model |
| Generate:Biomedicines | Protein therapeutics | Generative Biology™ platform |
| Chai Discovery | Molecular structure | Chai-½ (antibody design) |
| Recursion | Phenomics + drug discovery | Phenom-Beta, BioHive-2 |
Clinical & Treatment Response¶
| Company | Focus | Key Technology |
|---|---|---|
| Insilico Medicine | End-to-end drug discovery | Pharma.AI, Precious3GPT |
| Tempus | Precision medicine | AI-driven clinical insights |
| Owkin | Clinical trials, pathology | Federated learning |
| Retro Biosciences | Cellular reprogramming | GPT-4b micro (with OpenAI) |
Other Notable Players¶
- BioMap — xTrimo (210B params, multi-modal)
- Ginkgo Bioworks — Synthetic biology + Google Cloud partnership
- Bioptimus — H-Optimus-0 pathology foundation model
- Atomic AI — RNA structure (ATOM-1, PARSE platform)
- Enveda Biosciences — PRISM for small molecule discovery
References¶
Academic¶
- Geneformer — Transfer learning for single-cell biology
- scVI — Probabilistic modeling of scRNA-seq
- CPA — Compositional Perturbation Autoencoder
Industry¶
Related Projects¶
causal-bio-lab — Causal AI/ML for Computational Biology¶
Complementary Focus: While genai-lab focuses on modeling data-generating processes through generative models, causal-bio-lab focuses on uncovering causal structures and estimating causal effects from observational and interventional data.
Synergy:
- Generative models (VAE, diffusion) can learn rich representations but may capture spurious correlations
- Causal methods (probabilistic graphical models, causal discovery, structural equations) ensure models capture true mechanisms, not just statistical patterns
- Together: Causal generative models combine the best of both worlds—realistic simulation with causal guarantees
Key Integration Points:
- Causal representation learning: Learn disentangled latent spaces that respect causal structure (causal VAEs, identifiable VAEs)
- Causal discovery for architecture: Use learned causal graphs to constrain generative model structure
- Counterfactual validation: Use causal inference methods (do-calculus, structural equations) to validate generated predictions
- Causal regularization: Apply invariance principles and interventional consistency losses for better generalization
Example Workflow:
1. Train a CVAE on gene expression data (genai-lab)
2. Discover causal gene regulatory network (causal-bio-lab)
3. Constrain VAE latent space to respect causal structure
4. Generate counterfactual perturbation responses with causal guarantees
5. Estimate treatment effects using both generative and causal methods
Why This Matters for Computational Biology:
- Drug discovery: Generate realistic molecular perturbations while ensuring causal mechanisms are preserved
- Treatment response: Predict individual-level effects (counterfactuals) with uncertainty quantification
- Target identification: Discover causal drivers, not just biomarkers
- Combination therapy: Model synergistic effects through causal interaction terms
See causal-bio-lab Milestone 0.5 (SCMs) and Milestone D (Causal Representation Learning) for integration work.
License¶
MIT