Skip to content

Datasets

This directory documents datasets used in genai-lab for training and evaluating generative models, along with their preprocessing pipelines and related code.


Overview

Category Datasets Use Cases
Gene Expression PBMC 3k/68k, bulk RNA-seq VAE training, latent diffusion
Medical Imaging Chest X-ray (synthetic & real) Diffusion models, DiT
Perturbation scPerturb, Replogle, Norman scPPDM, JEPA, perturbation prediction

Gene Expression Datasets

Single-cell and bulk RNA-seq datasets for generative modeling.

Document Description
PBMC.md PBMC 3k/68k dataset guide
data_preparation.md RNA-seq preprocessing workflows

Related code:

  • src/genailab/data/ — Data loading utilities
  • src/genailab/data/sc_dataset.py — Single-cell dataset classes
  • notebooks/diffusion/04_gene_expression_diffusion/ — Gene expression diffusion demo

Medical Imaging Datasets

Datasets for training diffusion models on medical images.

Document Description
chest_xray.md Chest X-ray datasets (synthetic & real)

Related code:

  • src/genailab/diffusion/datasets.pySyntheticXRayDataset, ChestXRayDataset
  • notebooks/diffusion/03_medical_imaging_diffusion/ — Medical imaging diffusion demo

Perturbation Datasets

Perturb-seq and CRISPR screening datasets for perturbation prediction models.

Document Description
scperturb.md scPerturb harmonized collection
perturb_seq_guide.md General Perturb-seq data handling

Target applications:

  • scPPDM: Single-cell Perturbation Prediction via Diffusion Models
  • JEPA: Joint Embedding Predictive Architecture for perturbation response
  • Counterfactual generation: "What if" perturbation scenarios

Data Pipeline Pattern

Each dataset follows a consistent pipeline:

Raw Data → Preprocessing → PyTorch Dataset → DataLoader → Model
         Normalization
         Quality Control
         Train/Val/Test Split

Key considerations:

  1. Gene expression: Log-transform, HVG selection, NB/ZINB for counts
  2. Medical imaging: Resize, normalize to [-1, 1], augmentation
  3. Perturbation: Control vs treated pairing, batch correction

Adding New Datasets

When documenting a new dataset:

  1. Create a markdown file in the appropriate subdirectory
  2. Include:
  3. Source: Where to download, licensing
  4. Description: What the data contains
  5. Preprocessing: Required transformations
  6. Code: Related modules and notebooks
  7. Example usage: Code snippets
  8. Update this README with a link