PBMC Datasets for Generative Modeling¶
PBMC 3k and 68k are the MNIST of single-cell biology — standardized, clean, and the right starting point for VAE, cVAE, and score-based models.
What Are PBMC Datasets?¶
PBMC stands for Peripheral Blood Mononuclear Cells — immune cells circulating in blood: T cells, B cells, NK cells, monocytes, and dendritic cells.
Biologically, they're appealing because:
- They're diverse but well-studied
- They have strong, stereotyped transcriptional programs
- Cell types are separable yet overlapping (a perfect stress test for latent models)
Technologically, PBMC datasets come from 10x Genomics and are widely used as benchmarks in single-cell analysis.
PBMC 3k: The "Hello World" of scRNA-seq¶
PBMC 3k contains ~2,700 cells from a healthy donor, sequenced using droplet-based scRNA-seq.
What You Get Per Cell¶
- A vector of raw UMI counts (~13,000–20,000 genes after filtering)
- Extreme sparsity (90–95% zeros)
- A total count per cell (library size)
- Latent biological structure visible even with simple models
Why It's Perfect for This Roadmap¶
- Small enough to iterate fast
- Large enough to expose overdispersion
- Cell types are known and interpretable
- Ideal for validating NB vs ZINB likelihoods
- Excellent for latent space visualization (UMAP/t-SNE)
PBMC 3k lets you verify that your ELBO, KL term, decoder parameterization, and library-size handling are correct before scaling.
From a generative-modeling perspective, PBMC 3k is where you learn to respect the data manifold.
PBMC 68k: Same Biology, Different Regime¶
PBMC 68k is the same kind of data, but in a different computational universe: ~68,000 cells.
What changes is not biology — it's statistics and scaling.
Key Differences¶
- Many more rare cell states
- Much sharper estimates of dispersion parameters
- Clearer separation between biological and technical noise
- Enough data to expose posterior collapse issues in VAEs
- Enough scale to make diffusion / score models meaningful
Where It Matters¶
- Amortized inference actually matters
- Batch effects start to bite
- Latent dimensionality becomes nontrivial
- Conditional models (cVAE) become obviously useful
If PBMC 3k asks "does this work?", PBMC 68k asks "does this still work under pressure?"
Why PBMC Is Ideal for Generative Expression Modeling¶
A core goal in computational biology — pursued by companies like Synthesize Bio, insitro, and others — is to:
Generate biologically realistic gene expression states under different conditions
This enables studying treatment responses, predicting perturbation effects, and simulating counterfactual scenarios in silico.
PBMC datasets let you practice exactly that in miniature:
| Concept | PBMC Mapping |
|---|---|
| Condition | Cell type (T cell, monocyte, etc.) |
| Latent | Cell state, activation, continuous variation |
| Generation | Simulate realistic immune profiles |
| Counterfactual | "What if this monocyte were a T cell?" |
That's not toy modeling — that's the same abstraction used in scVI, scGen, and industry-scale expression simulators.
What PBMC Teaches You¶
- How much structure comes from the condition
- How much must live in the latent
- When the model cheats by encoding labels in z
- How overdispersion actually looks in practice
How PBMC Fits Into Our Roadmap¶
Mapping this directly onto our genai-lab roadmap:
VAE (Stage 1)¶
- PBMC 3k: Learn reconstruction + KL balance
- PBMC 68k: Test stability, latent capacity
cVAE¶
- Condition on cell type
- Generate realistic expression given (z, cell_type)
- Perform label swapping to test disentanglement
β-VAE¶
- Explore how much biological variation survives stronger bottlenecks
Score Matching / Diffusion (Stage 3+)¶
- PBMC 68k gives enough density to meaningfully estimate scores
- Noise → denoise → recover biological structure
At each stage, PBMC acts as a known reference system. If the model fails here, it won't magically succeed on GTEx or TCGA.
Critical Point: Keep Raw Counts¶
PBMC data must stay in raw count space if you're serious about generative modeling.
The correct approach (as in data_preparation.md):
- No log1p
- No CPM
- No normalization before modeling
- Library size treated explicitly
This is not pedantry. If you normalize first, you destroy the generative story and force the model to learn artifacts instead of biology.
PBMC is forgiving enough to show you this mistake clearly — another reason it's a great teacher.
Big Picture Takeaway¶
PBMC 3k and 68k are not just datasets — they're didactic instruments.
They teach you:
- How generative assumptions meet biological reality
- How overdispersion emerges naturally
- How conditioning interacts with latent structure
- How scale changes model behavior
Once PBMC feels intuitive, moving to Tabula Sapiens, GTEx, or disease cohorts becomes an engineering problem, not a conceptual one.
Next Steps¶
- Walk through PBMC preprocessing line-by-line
- Sketch the exact cVAE computational graph for PBMC
- Design evaluation metrics for realistic expression generation
Is It Okay to Use Normalized Counts?¶
This is a really important question, and the confusion is extremely common — even among people who use NB/ZINB every day.
Short answer: Normalized counts look like counts, but they are no longer generated by a count process.
The long answer is where the insight lives.
1. What NB/ZINB Are Actually Modeling¶
A Negative Binomial model is not just a curve that fits integers. It encodes a physical data-generation process:
- True expression rate of gene g in cell i
- Sequencing depth / capture efficiency of that cell (library size)
- Sampling noise + biological variability
Mathematically, the canonical scRNA-seq NB story is:
"Given a cell with library size \(\ell_i\), gene g produces counts \(y_{ig}\) drawn from an NB distribution with mean proportional to \(\ell_i\)."
Written plainly:
- Counts increase when you sequence deeper
- Variance increases with the mean (overdispersion)
- Zero inflation comes from biology + dropout, not arithmetic tricks
That story is true in the wet lab. That's why NB works so well.
2. What Normalization Actually Does¶
Normalization (CPM, TPM, size-factor normalization, log1p, etc.) rewrites the data to answer a different question:
"What would expression look like if all cells had the same depth?"
That is a deterministic transformation of the raw counts.
Example (CPM):
Key observation:
- \(\ell_i\) (library size) is removed
- Depth variability is collapsed
- Values are now ratios, not samples from a counting process
You didn't just rescale the data. You changed the random variable.
3. Why "They're Still Counts" Is Misleading¶
After normalization:
- Values may still be non-negative
- They may even be integers (if you round)
- They may look "count-like"
But generatively, they are no longer counts.
| Question | Answer Type |
|---|---|
| "How many molecules did I observe?" | True count |
| "What fraction of my sequencing budget went to this gene?" | Normalized value |
Those are not the same random experiment.
NB/ZINB likelihoods assume:
- Randomness comes from molecular sampling
- Variance depends on the mean and depth
- Library size is a latent or observed covariate
After normalization:
- Depth is fixed by construction
- Variance is artificially homogenized
- Mean–variance coupling is broken
So the likelihood is wrong, even if the curve "fits".
4. The Hidden Problem: Double-Using Library Size¶
This is the subtle killer.
If you:
- Normalize counts to remove library size
- Then train a model with an NB decoder
You've implicitly told the model:
"Pretend depth doesn't matter — but also explain variance as if it did."
That contradiction forces the model to:
- Invent fake dispersion
- Misuse the latent space
- Leak technical effects into z
- Blur biology and noise
This is one of the main reasons people see:
- Posterior collapse
- Mushy latent spaces
- Poor counterfactuals
The model is trying to explain artifacts you injected.
5. Why Normalization Is Fine for Some Tasks¶
Normalization isn't evil. It's just task-specific.
Appropriate for (descriptive tasks):
- Clustering
- Visualization (UMAP / PCA)
- Differential expression heuristics
- Linear models assuming homoskedasticity
Not appropriate for (generative tasks):
- Asserting a data-generating process
- Likelihood-based evaluation
- Counterfactuals that must respect physics
That's why scVI, scGen, and industry-scale models all operate on raw counts with explicit size factors.
6. The Correct Way to "Normalize" in Generative Models¶
Instead of transforming the data, you transform the model.
The canonical trick:
Keep raw counts; include library size as an offset or covariate
Conceptually:
- Decoder predicts a rate
- Library size scales that rate
- NB handles overdispersion naturally
This preserves:
- Correct variance structure
- Biological signal
- Valid sampling semantics
Nothing is thrown away.
7. A Mental Checksum¶
Ask this question:
"Could I plausibly simulate raw sequencing data from this representation?"
- Raw counts → yes
- Normalized counts → no
If you can't simulate sequencing, you're not doing generative biology — you're doing regression on a convenience transform.
8. Why This Matters for Diffusion Models¶
This becomes even more critical for:
- Score matching
- Diffusion models
- Likelihood-based evaluation
Those methods assume:
- The data distribution is real
- Noise has a physical interpretation
- The model can walk backward to plausible samples
Normalized data breaks that chain completely.
Bottom Line¶
Normalization answers analytical questions. Raw counts answer generative questions.
If your goal is:
- Counterfactual gene expression
- In silico experiments
- Realistic sampling
- Industry-grade generative models
Then raw counts + explicit library size is not a preference — it's a requirement.
This distinction is one of the big conceptual thresholds between using generative models and understanding them.