Foundation Models for Computational Biology¶
Adapting large-scale foundation models for gene expression and multi-omics tasks.
Overview¶
Foundation models trained on massive biological datasets (DNA, RNA, protein) are emerging as powerful tools for computational biology. This section covers practical strategies for adapting these models to specific tasks without training from scratch.
Key Topics:
- ๐ฏ Model Selection: Choosing the right foundation model for your task
- ๐ง Adaptation Strategies: LoRA, adapters, fine-tuning, freezing
- ๐ Data Preparation: Handling gene expression, sequences, and multi-omics
- ๐ป Implementation: Resource-aware configs, hardware optimization
- ๐ Deployment: Inference, serving, and production pipelines
Key Documents¶
Leveraging Foundation Models¶
Comprehensive guide to foundation model adaptation:
- Overview of available models (Geneformer, scGPT, BigRNA, ESM, etc.)
- When to use foundation models vs. train from scratch
- Adaptation strategies (LoRA, adapters, full fine-tuning)
- Conditioning and control (FiLM, cross-attention, CFG)
- Resource management (small/medium/large configs)
Best for: Understanding the landscape and choosing an adaptation strategy
Data Shape & Tensors¶
How to prepare your data for foundation models:
- Input representations (tokens, embeddings, sequences)
- Batch shapes and padding strategies
- Attention masks and position encodings
- Cell type, drug, and perturbation conditioning
- Multi-omics integration
Best for: Implementing data loaders and preprocessing pipelines
Implementation Guide¶
Step-by-step implementation for common tasks:
- Setting up environments and dependencies
- Loading pre-trained models
- Implementing LoRA and adapters
- Training loops with mixed precision
- Evaluation and benchmarking
Best for: Hands-on implementation and code examples
Why Foundation Models for Biology?¶
Traditional ML Approach¶
Foundation Model Approach¶
Advantages:
โ
Sample efficiency: Learn from 100s-1000s of examples vs. millions
โ
Transfer learning: Leverage knowledge from massive pre-training datasets
โ
Generalization: Better performance on out-of-distribution data
โ
Multi-task: Single model for multiple downstream tasks
โ
Interpretability: Pre-learned biological representations
Available Foundation Models (2026)¶
Gene Expression & Multi-Omics¶
| Model | Organization | Focus | Size | Open Source |
|---|---|---|---|---|
| GEM-1 | Synthesize Bio | Gene expression generation | Unknown | โ |
| BigRNA | Deep Genomics | RNA biology | ~2B params | โ |
| Geneformer | Theodoris et al. | Single-cell transfer learning | 10M-100M | โ |
| scGPT | Cui et al. | Single-cell foundation | 10M-100M | โ |
DNA & RNA Sequences¶
| Model | Organization | Focus | Size | Open Source |
|---|---|---|---|---|
| Evo 2 | Arc Institute | DNA sequence (8kb context) | 7B params | โ |
| Nucleotide Transformer | InstaDeep | Multi-species DNA | 500M-2.5B | โ |
| Helix-mRNA | Helical | mRNA sequences | Unknown | โ |
Protein & Structure¶
| Model | Organization | Focus | Size | Open Source |
|---|---|---|---|---|
| ESM3 | EvolutionaryScale | Protein design | 1.4B-98B | โ (7B, 98B) |
| AlphaFold 3 | Isomorphic Labs | Protein structure | Unknown | โ |
| Chai-1 | Chai Discovery | Antibody design | Unknown | โ |
Typical Workflow¶
1. Choose Your Model¶
Select based on: - Input type (expression, sequence, structure) - Task (generation, prediction, classification) - Available compute (model size) - Open source vs. proprietary
2. Prepare Your Data¶
- Tokenize or embed inputs
- Create attention masks
- Add condition labels (cell type, drug, etc.)
- Split train/val/test
3. Select Adaptation Strategy¶
| Strategy | Data Needed | Compute | Best For |
|---|---|---|---|
| Frozen + Linear Probe | 100s | Low | Quick prototyping |
| LoRA | 1000s | Medium | Most tasks (recommended) |
| Adapter Layers | 1000s | Medium | Multi-task learning |
| Full Fine-Tuning | 10,000s+ | High | Maximum performance |
4. Train & Evaluate¶
- Use mixed precision (fp16/bfloat16)
- Monitor overfitting (small datasets)
- Validate on held-out cell types/drugs
- Compare to baselines
5. Deploy¶
- Quantize for inference (int8, int4)
- Batch predictions for efficiency
- Monitor uncertainty estimates
Example Use Cases¶
๐งฌ Drug Response Prediction¶
Input: Baseline gene expression + drug ID
Output: Perturbed gene expression
Model: Fine-tuned Geneformer with LoRA
Data: Perturb-seq, LINCS L1000
๐ฌ Cell Type Annotation¶
Input: Single-cell expression profile
Output: Cell type label
Model: Frozen scGPT + classifier head
Data: Tabula Sapiens, CellxGene
๐ Combination Therapy¶
Input: Expression + drug A + drug B
Output: Synergy score
Model: Multi-task LoRA adapter
Data: DrugComb, O'Neil et al.
๐งช RNA Design¶
Input: Target structure + constraints
Output: RNA sequence
Model: Fine-tuned Helix-mRNA
Data: RNAcentral, Rfam
Getting Started¶
Recommended learning path:
- Start with Leveraging Foundation Models for conceptual overview
- Follow Data Shape & Tensors to prepare your data
- Use Implementation Guide for hands-on coding
- Experiment with different adaptation strategies (LoRA โ Adapters โ Full fine-tune)
- Evaluate on held-out data and compare to baselines
Next steps:
- ๐ Notebooks: Coming soon - interactive tutorials for each model
- ๐ง Code examples: See
examples/foundation_models/for production scripts - ๐ Theory: Explore DiT, JEPA, Latent Diffusion for advanced architectures
References¶
Key Papers¶
- Geneformer: Theodoris et al. (2023) - "Transfer learning enables predictions in network biology"
- scGPT: Cui et al. (2024) - "scGPT: Toward building a foundation model for single-cell multi-omics"
- ESM3: Hayes et al. (2024) - "Simulating 500 million years of evolution"
- Nucleotide Transformer: Dalla-Torre et al. (2023) - "The Nucleotide Transformer"
Industry Reports¶
- Foundation Models for Computational Biology - Nature Methods
- 17 Companies Pioneering AI Foundation Models in Pharma
- NVIDIA BioNeMo Platform
Questions or suggestions? Open an issue on GitHub