Datasets¶
This directory documents datasets used in genai-lab for training and evaluating generative models, along with their preprocessing pipelines and related code.
Overview¶
| Category | Datasets | Use Cases |
|---|---|---|
| Gene Expression | PBMC 3k/68k, bulk RNA-seq | VAE training, latent diffusion |
| Medical Imaging | Chest X-ray (synthetic & real) | Diffusion models, DiT |
| Perturbation | scPerturb, Replogle, Norman | scPPDM, JEPA, perturbation prediction |
Gene Expression Datasets¶
Single-cell and bulk RNA-seq datasets for generative modeling.
| Document | Description |
|---|---|
| PBMC.md | PBMC 3k/68k dataset guide |
| data_preparation.md | RNA-seq preprocessing workflows |
Related code:
src/genailab/data/— Data loading utilitiessrc/genailab/data/sc_dataset.py— Single-cell dataset classesnotebooks/diffusion/04_gene_expression_diffusion/— Gene expression diffusion demo
Medical Imaging Datasets¶
Datasets for training diffusion models on medical images.
| Document | Description |
|---|---|
| chest_xray.md | Chest X-ray datasets (synthetic & real) |
Related code:
src/genailab/diffusion/datasets.py—SyntheticXRayDataset,ChestXRayDatasetnotebooks/diffusion/03_medical_imaging_diffusion/— Medical imaging diffusion demo
Perturbation Datasets¶
Perturb-seq and CRISPR screening datasets for perturbation prediction models.
| Document | Description |
|---|---|
| scperturb.md | scPerturb harmonized collection |
| perturb_seq_guide.md | General Perturb-seq data handling |
Target applications:
- scPPDM: Single-cell Perturbation Prediction via Diffusion Models
- JEPA: Joint Embedding Predictive Architecture for perturbation response
- Counterfactual generation: "What if" perturbation scenarios
Data Pipeline Pattern¶
Each dataset follows a consistent pipeline:
Raw Data → Preprocessing → PyTorch Dataset → DataLoader → Model
↓
Normalization
Quality Control
Train/Val/Test Split
Key considerations:
- Gene expression: Log-transform, HVG selection, NB/ZINB for counts
- Medical imaging: Resize, normalize to [-1, 1], augmentation
- Perturbation: Control vs treated pairing, batch correction
Adding New Datasets¶
When documenting a new dataset:
- Create a markdown file in the appropriate subdirectory
- Include:
- Source: Where to download, licensing
- Description: What the data contains
- Preprocessing: Required transformations
- Code: Related modules and notebooks
- Example usage: Code snippets
- Update this README with a link
Related Documentation¶
- ROADMAP.md — Learning progression
- Latent Diffusion + NB/ZINB — Count data handling
- scPPDM concepts — Perturbation modeling