Data Preparation for Generative Models¶

This document describes how to obtain and preprocess real-world gene expression datasets for training and evaluating generative models (VAE, diffusion, etc.).

1. Why Real Data Matters¶

To objectively compare different generative approaches (VAE vs diffusion, NB vs ZINB, etc.), we need:

Real count distributions with overdispersion and sparsity
Meaningful conditions (tissue, disease, cell type) for conditional generation
Held-out test sets for likelihood-based evaluation

2. Recommended Datasets¶

2.1 scRNA-seq (Start Here)¶

Dataset	Description	Size	Conditions	Link
PBMC 3k	Classic starter dataset	~2,700 cells	Cell type	10x Genomics
PBMC 68k	Larger PBMC dataset	~68,000 cells	Cell type	10x Genomics
Tabula Sapiens	Multi-tissue human atlas	~500k cells	Tissue, cell type, donor	Portal
Tabula Muris	Multi-tissue mouse atlas	~100k cells	Tissue, cell type	Portal

Note: For UMI-based scRNA-seq, NB is often sufficient. Add ZINB only if NB badly underfits zeros.

2.2 Bulk RNA-seq (Later)¶

Dataset	Description	Size	Conditions	Link
GTEx	Multi-tissue, healthy baseline	~17k samples	Tissue, sex, age	GTEx Portal
recount3	Uniformly processed public RNA-seq	Massive	Study-dependent	recount3
TCGA	Cancer transcriptomes	~11k samples	Cancer type, stage	GDC Portal

Note: For bulk RNA-seq, NB is typically the right likelihood; ZINB is rarely necessary.

3. Preprocessing Scripts¶

3.1 scRNA-seq (Python + Scanpy)¶

Script: src/genailab/data/sc_preprocess.py

What it does:

Loads 10x MTX format or downloads PBMC3k directly
Computes QC metrics (n_counts, n_genes, mito %)
Filters low-quality cells and genes
Computes library size for NB models
Saves raw counts as .h5ad

Key principle: Do NOT normalize or log-transform if using NB/ZINB likelihood.

3.2 Bulk RNA-seq (R + recount3)¶

Script: src/genailab/data/bulk_recount3_preprocess.R

What it does:

Downloads uniformly processed counts from recount3
Extracts counts matrix and sample metadata
Filters lowly-expressed genes
Saves as RDS (can convert to CSV for Python)

Reference: recount3 quickstart

3.3 Bulk RNA-seq (Python Alternative)¶

Script: src/genailab/data/bulk_preprocess.py

What it does:

Loads counts from CSV files (exported from R or downloaded from portals)
Optionally downloads from GEO using GEOparse
Computes library size for NB models
Filters lowly-expressed genes
Converts to AnnData format (same as scRNA-seq)

Usage examples:

# From CSV files (e.g., exported from R/recount3)
python -m genailab.data.bulk_preprocess csv \
    --counts bulk_counts.csv \
    --metadata bulk_metadata.csv \
    --output bulk.h5ad

# From GEO (requires: pip install GEOparse)
python -m genailab.data.bulk_preprocess geo \
    --geo-id GSE12345 \
    --output bulk.h5ad

Workflow: Use R/recount3 to download uniformly processed counts, export to CSV, then use Python for ML pipeline

4. Wiring Conditions for cVAE¶

Once you have counts and metadata, create a condition table:

4.1 Bulk RNA-seq Conditions¶

Condition	Type	Example Values
`tissue`	Categorical	"liver", "brain", "heart"
`disease_status`	Categorical	"healthy", "tumor", "treated"
`batch`	Categorical	"batch1", "batch2"
`sex`	Categorical	"M", "F"
`age`	Continuous	25, 45, 67

4.2 scRNA-seq Conditions¶

Condition	Type	Example Values
`cell_type`	Categorical	"T cell", "B cell", "Monocyte"
`tissue`	Categorical	"blood", "lung", "liver"
`donor`	Categorical	"donor1", "donor2"
`batch`	Categorical	"10x_v2", "10x_v3"

These become categorical IDs → embedding tables in the cVAE encoder/decoder.

5. Critical Checklist for NB/ZINB Models¶

Keep raw counts in the training tensor (no normalization)
Compute library size (total counts per sample/cell) as offset or covariate
Start with NB, upgrade to ZINB only if NB underfits zeros on held-out data
Include batch as a condition (even if you later want invariance)
Filter genes: Remove genes expressed in <3 cells/samples
Filter cells/samples: Remove outliers by QC metrics

6. Library Size: Why It Matters¶

Library size (total counts per cell/sample) varies due to technical factors, not biology.

For NB models, the typical parameterization is:

\[ \mu_g = \ell \cdot \exp(\eta_g) \]

where:

\(\ell\) = library size (or learned size factor)
\(\eta_g\) = what the decoder predicts from \((z, y)\)

How to compute:

# scRNA-seq (scanpy)
adata.obs["library_size"] = np.array(adata.X.sum(axis=1)).ravel()

# Bulk RNA-seq (pandas)
library_size = counts.sum(axis=0)  # sum over genes

7. Output Format for ML¶

Both scRNA-seq and bulk RNA-seq should produce:

File	Contents
`counts.h5ad` or `counts.csv`	Raw count matrix (genes × samples/cells)
`metadata.csv`	Sample/cell metadata with conditions
`library_size.npy`	Precomputed library sizes

The .h5ad format (AnnData) is preferred because it stores counts, metadata, and gene info together.

Data Preparation for Generative Models¶

1. Why Real Data Matters¶

2. Recommended Datasets¶

2.1 scRNA-seq (Start Here)¶

2.2 Bulk RNA-seq (Later)¶

3. Preprocessing Scripts¶

3.1 scRNA-seq (Python + Scanpy)¶

3.2 Bulk RNA-seq (R + recount3)¶

3.3 Bulk RNA-seq (Python Alternative)¶

4. Wiring Conditions for cVAE¶

4.1 Bulk RNA-seq Conditions¶

4.2 scRNA-seq Conditions¶

5. Critical Checklist for NB/ZINB Models¶

6. Library Size: Why It Matters¶

7. Output Format for ML¶

References¶