genai-lab — Project Overview¶
A research codebase exploring generative AI for computational biology — translating state-of-the-art generative architectures into practical applications for drug discovery, treatment-response simulation, and in-silico biological experimentation.
This document expands on the project README with a fuller view of mission, scope, organization, and current state. It's meant for anyone discovering the project — collaborators, reviewers, or readers following the work.
Mission¶
The project investigates how modern generative methods — the VAE family, diffusion and flow models, Joint Embedding Predictive Architectures (JEPA), and transformers — can address concrete questions in computational biology that classical methods struggle with:
- Predicting cellular responses to genetic and chemical perturbations
- Generating biologically realistic expression states under specified conditions
- Quantifying uncertainty in predictions, not just point estimates
- Reasoning counterfactually about cell state changes
These are problems where the data is high-dimensional, sparse, count-valued, and small relative to the questions being asked — exactly the regime where generative modeling with strong inductive biases (count-aware likelihoods, self-supervised pretraining, latent diffusion) can outperform discriminative baselines.
Scientific Scope¶
The flagship application area is perturbation response prediction on single-cell Perturb-seq data. The benchmark is the Norman et al. 2019 dataset (K562 cells, CRISPR activation, combinatorial perturbations) — the de facto reference for the field.
The technical approach combines three complementary ideas:
- Count-aware variational autoencoders with negative-binomial decoders for the deterministic baseline — fast point estimates of perturbed cell states, suitable for sanity checks and as a comparison floor.
- Joint Embedding Predictive Architectures (JEPA) with perturbation conditioning for self-supervised prediction in latent space — learning what a cell would look like under a new perturbation without reconstructing the full gene-count profile. (The phrase "predict embeddings, not pixels" comes from JEPA's vision lineage, where the observation is an image; here the observation is a raw gene-count vector of ~20k mostly-zero entries, and the same logic applies — predicting in embedding space avoids spending model capacity fitting dropout noise.)
- Latent diffusion wrapped around the predictive latent for uncertainty quantification — sampling diverse plausible perturbed states rather than committing to a single prediction.
Each component addresses a different epistemic need: the baseline gives interpretable means, JEPA captures the structure of how perturbations deform the latent manifold, and the diffusion layer gives confidence intervals on counterfactual queries.
A Note on JEPA¶
JEPA (Joint Embedding Predictive Architecture, from the LeCun-school work including I-JEPA and V-JEPA) is technically a predictive / representation-learning architecture rather than a generative one — it predicts the embedding of a target from the embedding of a context, without reconstructing data or defining an explicit likelihood.
For genai-lab, JEPA serves as the latent-space predictor, and the project's contribution is wrapping its output with a generative head (latent diffusion) so the combined system both predicts and quantifies uncertainty. This pairing — strong predictive latent + generative sampling on top — is the core architectural bet of the project.
Current Stage¶
The project is transitioning from broad methodology exploration to focused application consolidation. A previous phase produced documentation, theory derivations, and partial implementations across many architectures; the current phase consolidates one complete vertical (perturbation prediction, benchmarked end-to-end) before expanding.
A five-tier status system distinguishes documentation from implementation from validation throughout the codebase:
| Tier | Symbol | Meaning |
|---|---|---|
| Mature | ✅ | Theory + implementation + validated on a realistic dataset |
| Validated | 🔬 | Theory + implementation, validated on toy/benchmark data |
| Active | 🎯 | Current development focus |
| Prototype | 📝 | Theory complete, implementation pending or partial |
| Planned | 🔮 | Designed, not yet started |
Tier assignments live alongside the components they describe; the project README has the current snapshot.
How the Project Is Organized¶
A maturity ladder¶
Work flows through three deliberately gated stages:
Stage 1 — Use case under development
examples/<topic>/ + notebooks/<topic>/
Scripts run; results may be preliminary
↓ promote when documented end-to-end with ≥1 concrete result
Stage 2 — Application
docs/applications/<topic>.md
Methodology write-up; reproducible by a reader
↓ promote when API stabilizes, tests cover inference,
and a published baseline is matched
Stage 3 — Product
docs/products/<name>/ + src/genailab/applications/<name>/
Deployable: stable API, versioned checkpoints, documented limitations
A product has a user contract; an application has only a methodology
claim. Promotion is a one-way gate; demotion is allowed and expected when
the underlying assumptions change. The full promotion criteria live in
docs/products/README.md.
Repository structure¶
src/genailab/ Library package (importable)
foundation/ Foundation-model adaptation (LoRA, configs, hardware detection)
data/ AnnData loaders, preprocessing
model/ Encoders, decoders, VAE variants
diffusion/ VP-SDE, VE-SDE, score networks, training/sampling
flow_matching/ Rectified flow library
objectives/ Loss functions (ELBO, NB, ZINB, CFM)
eval/ Metrics, diagnostics
applications/ Flagship application code (stable APIs)
utils/ Config, reproducibility helpers
examples/<topic>/ Production-style scripts per topic
notebooks/<topic>/ Tutorials and exploration, parallel to examples/
ops/ GPU cluster provisioning (SkyPilot + RunPod)
docs/ Public documentation (MkDocs)
applications/ Methodology write-ups
products/ Mature, deployable applications
VAE/ DDPM/ JEPA/ … Methodology-first theory docs by topic
data/ Datasets (not tracked in git)
runs/ Per-training-run artifacts (not tracked)
output/ Cross-run analyses and figures (not tracked)
Each topic in examples/ typically has a parallel directory in
notebooks/ — a well-developed topic has both: a notebook walking through
intuition and a script running the benchmark at scale.
Tech stack¶
- Python 3.10–3.12, PyTorch only (no TensorFlow)
- Single-cell: scanpy, anndata, GEOparse
- Reference methods (optional, per task): scvi-tools, scgen, cpa-tools, diffusers
- Experiment tracking: Weights & Biases
- GPU provisioning: SkyPilot + RunPod for real training; local CPU for development and small tutorials
- Environment: mamba / conda; environment name
genailab
Domain conventions deserve specific call-out:
- Raw counts are preserved for negative-binomial and zero-inflated negative-binomial decoders; normalization is only applied to copies for descriptive analyses
- Library size is treated as a covariate, not a preprocessing step — computed on the full filtered gene set and passed explicitly to NB-family decoders
- CPU is the local default for correctness; CUDA is reserved for real training on provisioned cloud GPUs
Sibling Projects¶
genai-lab cross-pollinates with related research codebases in the same broader bio-AI portfolio:
- agentic-spliceai — splice-site prediction with agentic validation workflows. Several engineering patterns (GPU provisioning via SkyPilot, milestone-gated example scripts, session-summary discipline) originated there and were ported here.
- combio-lab — biomolecular design, including protein structure prediction and embeddings. Shares the examples/notebooks parallel convention.
- causal-bio-lab — causal inference for biological systems. Planned integration point: once the flagship application lands, causal methods can be used to validate counterfactual predictions against intervention-derived ground truth.
Where to Read Next¶
| Topic | File |
|---|---|
| Project entry point | README.md |
| Flagship application (perturbation prediction) | docs/applications/perturbation_prediction.md |
| Industry landscape | docs/INDUSTRY_LANDSCAPE.md |
| Maturity ladder + product criteria | docs/products/README.md |
| GPU workflow | ops/README.md |
| Experiment running + tracking | examples/docs/running_experiments.md, examples/docs/experiment_tracking.md |
| Theory by topic | docs/VAE/, docs/DDPM/, docs/JEPA/, docs/flow_matching/ |
| Dataset background (flagship) | notebooks/perturbation/docs/norman_2019_dataset_tutorial.md |