PBMC Datasets for Generative Modeling¶

PBMC 3k and 68k are the MNIST of single-cell biology — standardized, clean, and the right starting point for VAE, cVAE, and score-based models.

What Are PBMC Datasets?¶

PBMC stands for Peripheral Blood Mononuclear Cells — immune cells circulating in blood: T cells, B cells, NK cells, monocytes, and dendritic cells.

Biologically, they're appealing because:

They're diverse but well-studied
They have strong, stereotyped transcriptional programs
Cell types are separable yet overlapping (a perfect stress test for latent models)

Technologically, PBMC datasets come from 10x Genomics and are widely used as benchmarks in single-cell analysis.

PBMC 3k: The "Hello World" of scRNA-seq¶

PBMC 3k contains ~2,700 cells from a healthy donor, sequenced using droplet-based scRNA-seq.

What You Get Per Cell¶

A vector of raw UMI counts (~13,000–20,000 genes after filtering)
Extreme sparsity (90–95% zeros)
A total count per cell (library size)
Latent biological structure visible even with simple models

Why It's Perfect for This Roadmap¶

Small enough to iterate fast
Large enough to expose overdispersion
Cell types are known and interpretable
Ideal for validating NB vs ZINB likelihoods
Excellent for latent space visualization (UMAP/t-SNE)

PBMC 3k lets you verify that your ELBO, KL term, decoder parameterization, and library-size handling are correct before scaling.

From a generative-modeling perspective, PBMC 3k is where you learn to respect the data manifold.

PBMC 68k: Same Biology, Different Regime¶

PBMC 68k is the same kind of data, but in a different computational universe: ~68,000 cells.

What changes is not biology — it's statistics and scaling.

Key Differences¶

Many more rare cell states
Much sharper estimates of dispersion parameters
Clearer separation between biological and technical noise
Enough data to expose posterior collapse issues in VAEs
Enough scale to make diffusion / score models meaningful

Where It Matters¶

Amortized inference actually matters
Batch effects start to bite
Latent dimensionality becomes nontrivial
Conditional models (cVAE) become obviously useful

If PBMC 3k asks "does this work?", PBMC 68k asks "does this still work under pressure?"

Why PBMC Is Ideal for Generative Expression Modeling¶

A core goal in computational biology — pursued by companies like Synthesize Bio, insitro, and others — is to:

Generate biologically realistic gene expression states under different conditions

This enables studying treatment responses, predicting perturbation effects, and simulating counterfactual scenarios in silico.

PBMC datasets let you practice exactly that in miniature:

Concept	PBMC Mapping
Condition	Cell type (T cell, monocyte, etc.)
Latent	Cell state, activation, continuous variation
Generation	Simulate realistic immune profiles
Counterfactual	"What if this monocyte were a T cell?"

That's not toy modeling — that's the same abstraction used in scVI, scGen, and industry-scale expression simulators.

What PBMC Teaches You¶

How much structure comes from the condition
How much must live in the latent
When the model cheats by encoding labels in z
How overdispersion actually looks in practice

How PBMC Fits Into Our Roadmap¶

Mapping this directly onto our genai-lab roadmap:

VAE (Stage 1)¶

PBMC 3k: Learn reconstruction + KL balance
PBMC 68k: Test stability, latent capacity

cVAE¶

Condition on cell type
Generate realistic expression given (z, cell_type)
Perform label swapping to test disentanglement

β-VAE¶

Explore how much biological variation survives stronger bottlenecks

Score Matching / Diffusion (Stage 3+)¶

PBMC 68k gives enough density to meaningfully estimate scores
Noise → denoise → recover biological structure

At each stage, PBMC acts as a known reference system. If the model fails here, it won't magically succeed on GTEx or TCGA.

Critical Point: Keep Raw Counts¶

PBMC data must stay in raw count space if you're serious about generative modeling.

The correct approach (as in data_preparation.md):

No log1p
No CPM
No normalization before modeling
Library size treated explicitly

This is not pedantry. If you normalize first, you destroy the generative story and force the model to learn artifacts instead of biology.

PBMC is forgiving enough to show you this mistake clearly — another reason it's a great teacher.

Big Picture Takeaway¶

PBMC 3k and 68k are not just datasets — they're didactic instruments.

They teach you:

How generative assumptions meet biological reality
How overdispersion emerges naturally
How conditioning interacts with latent structure
How scale changes model behavior

Once PBMC feels intuitive, moving to Tabula Sapiens, GTEx, or disease cohorts becomes an engineering problem, not a conceptual one.

Next Steps¶

Walk through PBMC preprocessing line-by-line
Sketch the exact cVAE computational graph for PBMC
Design evaluation metrics for realistic expression generation

Is It Okay to Use Normalized Counts?¶

This is a really important question, and the confusion is extremely common — even among people who use NB/ZINB every day.

Short answer: Normalized counts look like counts, but they are no longer generated by a count process.

The long answer is where the insight lives.

1. What NB/ZINB Are Actually Modeling¶

A Negative Binomial model is not just a curve that fits integers. It encodes a physical data-generation process:

True expression rate of gene g in cell i
Sequencing depth / capture efficiency of that cell (library size)
Sampling noise + biological variability

Mathematically, the canonical scRNA-seq NB story is:

"Given a cell with library size \(\ell_i\), gene g produces counts \(y_{ig}\) drawn from an NB distribution with mean proportional to \(\ell_i\)."

Written plainly:

Counts increase when you sequence deeper
Variance increases with the mean (overdispersion)
Zero inflation comes from biology + dropout, not arithmetic tricks

That story is true in the wet lab. That's why NB works so well.

2. What Normalization Actually Does¶

Normalization (CPM, TPM, size-factor normalization, log1p, etc.) rewrites the data to answer a different question:

"What would expression look like if all cells had the same depth?"

That is a deterministic transformation of the raw counts.

Example (CPM):

y_ig  →  y_ig / ℓ_i × 10⁶

Key observation:

\(\ell_i\) (library size) is removed
Depth variability is collapsed
Values are now ratios, not samples from a counting process

You didn't just rescale the data. You changed the random variable.

3. Why "They're Still Counts" Is Misleading¶

After normalization:

Values may still be non-negative
They may even be integers (if you round)
They may look "count-like"

But generatively, they are no longer counts.

Question	Answer Type
"How many molecules did I observe?"	True count
"What fraction of my sequencing budget went to this gene?"	Normalized value

Those are not the same random experiment.

NB/ZINB likelihoods assume:

Randomness comes from molecular sampling
Variance depends on the mean and depth
Library size is a latent or observed covariate

After normalization:

Depth is fixed by construction
Variance is artificially homogenized
Mean–variance coupling is broken

So the likelihood is wrong, even if the curve "fits".

4. The Hidden Problem: Double-Using Library Size¶

This is the subtle killer.

If you:

Normalize counts to remove library size
Then train a model with an NB decoder

You've implicitly told the model:

"Pretend depth doesn't matter — but also explain variance as if it did."

That contradiction forces the model to:

Invent fake dispersion
Misuse the latent space
Leak technical effects into z
Blur biology and noise

This is one of the main reasons people see:

Posterior collapse
Mushy latent spaces
Poor counterfactuals

The model is trying to explain artifacts you injected.

5. Why Normalization Is Fine for Some Tasks¶

Normalization isn't evil. It's just task-specific.

Appropriate for (descriptive tasks):

Clustering
Visualization (UMAP / PCA)
Differential expression heuristics
Linear models assuming homoskedasticity

Not appropriate for (generative tasks):

Asserting a data-generating process
Likelihood-based evaluation
Counterfactuals that must respect physics

That's why scVI, scGen, and industry-scale models all operate on raw counts with explicit size factors.

6. The Correct Way to "Normalize" in Generative Models¶

Instead of transforming the data, you transform the model.

The canonical trick:

Keep raw counts; include library size as an offset or covariate

Conceptually:

Decoder predicts a rate
Library size scales that rate
NB handles overdispersion naturally

This preserves:

Correct variance structure
Biological signal
Valid sampling semantics

Nothing is thrown away.

7. A Mental Checksum¶

Ask this question:

"Could I plausibly simulate raw sequencing data from this representation?"

Raw counts → yes
Normalized counts → no

If you can't simulate sequencing, you're not doing generative biology — you're doing regression on a convenience transform.

8. Why This Matters for Diffusion Models¶

This becomes even more critical for:

Score matching
Diffusion models
Likelihood-based evaluation

Those methods assume:

The data distribution is real
Noise has a physical interpretation
The model can walk backward to plausible samples

Normalized data breaks that chain completely.

Bottom Line¶

Normalization answers analytical questions. Raw counts answer generative questions.

If your goal is:

Counterfactual gene expression
In silico experiments
Realistic sampling
Industry-grade generative models

Then raw counts + explicit library size is not a preference — it's a requirement.

This distinction is one of the big conceptual thresholds between using generative models and understanding them.