Data Preparation for Generative Models¶
This document describes how to obtain and preprocess real-world gene expression datasets for training and evaluating generative models (VAE, diffusion, etc.).
1. Why Real Data Matters¶
To objectively compare different generative approaches (VAE vs diffusion, NB vs ZINB, etc.), we need:
- Real count distributions with overdispersion and sparsity
- Meaningful conditions (tissue, disease, cell type) for conditional generation
- Held-out test sets for likelihood-based evaluation
2. Recommended Datasets¶
2.1 scRNA-seq (Start Here)¶
| Dataset | Description | Size | Conditions | Link |
|---|---|---|---|---|
| PBMC 3k | Classic starter dataset | ~2,700 cells | Cell type | 10x Genomics |
| PBMC 68k | Larger PBMC dataset | ~68,000 cells | Cell type | 10x Genomics |
| Tabula Sapiens | Multi-tissue human atlas | ~500k cells | Tissue, cell type, donor | Portal |
| Tabula Muris | Multi-tissue mouse atlas | ~100k cells | Tissue, cell type | Portal |
Note: For UMI-based scRNA-seq, NB is often sufficient. Add ZINB only if NB badly underfits zeros.
2.2 Bulk RNA-seq (Later)¶
| Dataset | Description | Size | Conditions | Link |
|---|---|---|---|---|
| GTEx | Multi-tissue, healthy baseline | ~17k samples | Tissue, sex, age | GTEx Portal |
| recount3 | Uniformly processed public RNA-seq | Massive | Study-dependent | recount3 |
| TCGA | Cancer transcriptomes | ~11k samples | Cancer type, stage | GDC Portal |
Note: For bulk RNA-seq, NB is typically the right likelihood; ZINB is rarely necessary.
3. Preprocessing Scripts¶
3.1 scRNA-seq (Python + Scanpy)¶
Script: src/genailab/data/sc_preprocess.py
What it does:
- Loads 10x MTX format or downloads PBMC3k directly
- Computes QC metrics (n_counts, n_genes, mito %)
- Filters low-quality cells and genes
- Computes library size for NB models
- Saves raw counts as
.h5ad
Key principle: Do NOT normalize or log-transform if using NB/ZINB likelihood.
3.2 Bulk RNA-seq (R + recount3)¶
Script: src/genailab/data/bulk_recount3_preprocess.R
What it does:
- Downloads uniformly processed counts from recount3
- Extracts counts matrix and sample metadata
- Filters lowly-expressed genes
- Saves as RDS (can convert to CSV for Python)
Reference: recount3 quickstart
3.3 Bulk RNA-seq (Python Alternative)¶
Script: src/genailab/data/bulk_preprocess.py
What it does:
- Loads counts from CSV files (exported from R or downloaded from portals)
- Optionally downloads from GEO using GEOparse
- Computes library size for NB models
- Filters lowly-expressed genes
- Converts to AnnData format (same as scRNA-seq)
Usage examples:
# From CSV files (e.g., exported from R/recount3)
python -m genailab.data.bulk_preprocess csv \
--counts bulk_counts.csv \
--metadata bulk_metadata.csv \
--output bulk.h5ad
# From GEO (requires: pip install GEOparse)
python -m genailab.data.bulk_preprocess geo \
--geo-id GSE12345 \
--output bulk.h5ad
Workflow: Use R/recount3 to download uniformly processed counts, export to CSV, then use Python for ML pipeline
4. Wiring Conditions for cVAE¶
Once you have counts and metadata, create a condition table:
4.1 Bulk RNA-seq Conditions¶
| Condition | Type | Example Values |
|---|---|---|
tissue |
Categorical | "liver", "brain", "heart" |
disease_status |
Categorical | "healthy", "tumor", "treated" |
batch |
Categorical | "batch1", "batch2" |
sex |
Categorical | "M", "F" |
age |
Continuous | 25, 45, 67 |
4.2 scRNA-seq Conditions¶
| Condition | Type | Example Values |
|---|---|---|
cell_type |
Categorical | "T cell", "B cell", "Monocyte" |
tissue |
Categorical | "blood", "lung", "liver" |
donor |
Categorical | "donor1", "donor2" |
batch |
Categorical | "10x_v2", "10x_v3" |
These become categorical IDs → embedding tables in the cVAE encoder/decoder.
5. Critical Checklist for NB/ZINB Models¶
- Keep raw counts in the training tensor (no normalization)
- Compute library size (total counts per sample/cell) as offset or covariate
- Start with NB, upgrade to ZINB only if NB underfits zeros on held-out data
- Include batch as a condition (even if you later want invariance)
- Filter genes: Remove genes expressed in <3 cells/samples
- Filter cells/samples: Remove outliers by QC metrics
6. Library Size: Why It Matters¶
Library size (total counts per cell/sample) varies due to technical factors, not biology.
For NB models, the typical parameterization is:
where:
- \(\ell\) = library size (or learned size factor)
- \(\eta_g\) = what the decoder predicts from \((z, y)\)
How to compute:
# scRNA-seq (scanpy)
adata.obs["library_size"] = np.array(adata.X.sum(axis=1)).ravel()
# Bulk RNA-seq (pandas)
library_size = counts.sum(axis=0) # sum over genes
7. Output Format for ML¶
Both scRNA-seq and bulk RNA-seq should produce:
| File | Contents |
|---|---|
counts.h5ad or counts.csv |
Raw count matrix (genes × samples/cells) |
metadata.csv |
Sample/cell metadata with conditions |
library_size.npy |
Precomputed library sizes |
The .h5ad format (AnnData) is preferred because it stores counts, metadata, and gene info together.