Choosing the Likelihood: NB vs ZINB for Gene Expression¶
This document explains why the choice of \(p(x \mid z)\) matters for VAEs in computational biology, and when to use Negative Binomial (NB) vs Zero-Inflated Negative Binomial (ZINB).
1. The Core Principle¶
Choosing \(p(x \mid z)\) means choosing a statistical noise model for your data.
Once you choose it, maximum likelihood training forces a specific loss function.
This is not an ML trick — it is straight probability theory.
2. What is \(x\) in Gene Expression Modeling?¶
For RNA-seq, your observed data \(x\) is usually one of:
- Raw counts: non-negative integers (e.g., gene A has 0, 3, 17 reads)
- Normalized/transformed values: log1p(TPM), logCPM, etc. (continuous)
This choice alone already restricts what likelihoods make sense.
3. Why Gaussian (→ MSE) is Often Wrong for Counts¶
A Gaussian likelihood assumes:
- Data is continuous
- Symmetric noise
- Variance independent of the mean
- Can take negative values
Raw RNA-seq counts violate all of these.
If you write:
you are implicitly saying:
"Gene expression fluctuates symmetrically around a mean, with constant variance."
This is biologically false for counts.
MSE is only acceptable after heavy preprocessing (log transforms), and even then it's an approximation.
4. The Negative Binomial Distribution: Standard Interpretation¶
4.1 Classical Definition¶
The Negative Binomial distribution has two standard interpretations:
Interpretation 1: Counting Failures
Count the number of failures before observing \(r\) successes, where each trial has success probability \(p\).
Interpretation 2: Gamma-Poisson Mixture (more relevant for ML)
A Poisson distribution whose rate \(\lambda\) is itself random, drawn from a Gamma distribution.
Marginalizing out \(\lambda\) gives \(X \sim \text{NB}(\mu, \theta)\).
This second interpretation is why NB is called an overdispersed Poisson — it adds extra variance beyond what Poisson allows.
4.2 Why NB is Used in ML¶
NB appears whenever you have:
- Count data with more variance than Poisson predicts
- Heterogeneity in the underlying rate (different cells, users, events)
- Clustering or "burstiness" in arrivals
Common applications:
| Domain | Phenomenon Modeled |
|---|---|
| Genomics | Gene expression counts (RNA-seq) |
| NLP | Word counts in documents |
| Epidemiology | Disease case counts |
| E-commerce | Purchase counts per user |
| Insurance | Claim counts per policy |
| Ecology | Species abundance counts |
In all cases, the key property is:
This overdispersion (variance > mean) is ubiquitous in real count data.
4.3 NB in Neural Networks¶
In deep learning, NB is used as the output distribution when:
- The target is non-negative integer counts
- A Poisson assumption underestimates variance
- You want a proper probabilistic model (not just MSE)
The network predicts the parameters \((\mu, \theta)\), and the loss is the NB negative log-likelihood.
5. Why Negative Binomial (NB) is the Default for RNA-seq¶
5.1 What NB Models¶
Negative Binomial models:
- Non-negative integer counts
- Variance that grows with the mean (overdispersion)
This matches RNA-seq extremely well.
Formally, for one gene \(g\):
where: * \(\mu_g\) = mean (predicted by decoder) * \(\alpha_g\) = dispersion parameter
Key property — the mean-variance relationship:
This captures:
- Biological variability
- Technical noise
- Sequencing depth effects
5.2 How NB Becomes the Loss Function¶
In a VAE/cVAE, the decoder predicts \(\mu_g(z, y)\) (and possibly \(\alpha_g\)).
The training objective includes:
So the reconstruction loss is the negative log-likelihood of the NB distribution.
This replaces MSE. There is no freedom here:
- If you assume NB → maximum likelihood forces this loss
6. Why ZINB Exists (and When You Need It)¶
6.1 The Zero Problem¶
In single-cell RNA-seq, you see many zeros:
- Some are true biological zeros (gene not expressed)
- Many are dropout (gene expressed but not captured)
NB alone often underestimates zeros.
6.2 Zero-Inflated NB (ZINB)¶
ZINB says:
"Some zeros come from the NB process, but others come from an extra 'always zero' process."
Mathematically, for each gene:
The decoder now predicts three quantities:
| Parameter | Meaning |
|---|---|
| \(\mu_g(z, y)\) | Mean expression |
| \(\alpha_g\) | Dispersion |
| \(\pi_g(z, y)\) | Zero-inflation probability |
The likelihood of the observed expression is computed gene by gene using a zero-inflated Negative Binomial distribution whose parameters are predicted by the decoder network
7. When to Use NB vs ZINB¶
Use NB if:¶
- Bulk RNA-seq
- Pseudo-bulk scRNA
- UMI-based scRNA with good depth
- Zeros are mostly explained by low expression
Use ZINB if:¶
- Sparse scRNA-seq
- Very shallow sequencing
- Dropout dominates zero counts
- NB underfits zeros badly
Practical advice: Many modern pipelines start with NB and only escalate to ZINB if diagnostics demand it.
8. Where Do NB/ZINB Apply? (Decoder Only)¶
Important clarification: NB and ZINB are used only in the decoder, not the encoder.
Why?¶
The VAE has two parts:
| Component | Input | Output | Distribution |
|---|---|---|---|
| Encoder | \(x\) (observed data) | \(z\) (latent) | \(q_\phi(z \mid x) = \mathcal{N}(\mu_\phi(x), \sigma_\phi(x))\) |
| Decoder | \(z\) (latent) | \(x\) (reconstructed) | \(p_\theta(x \mid z) = \text{NB}(\mu_\theta(z), \alpha)\) |
- Encoder: Maps data to latent space. The latent \(z\) is continuous and Gaussian — this is required for the reparameterization trick.
- Decoder: Maps latent back to data space. The likelihood \(p(x \mid z)\) must match the data type — hence NB/ZINB for counts.
The Encoder Stays Gaussian¶
Even when modeling count data:
This is because:
- Reparameterization requires continuous distributions — you can't easily reparameterize discrete distributions
- The latent space is a learned representation — it doesn't need to match the data distribution
- KL divergence is tractable — Gaussian-to-Gaussian KL has a closed form
Summary¶
┌─────────────────────────────────────────────────────────────┐
│ ENCODER: q(z|x) │
│ • Always Gaussian (for reparameterization) │
│ • Input: count data x │
│ • Output: continuous latent z │
└─────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────┐
│ DECODER: p(x|z) │
│ • NB or ZINB (matches data type) │
│ • Input: continuous latent z │
│ • Output: count data x │
└─────────────────────────────────────────────────────────────┘
Why Gaussian Prior Even with NB/ZINB Decoder?¶
You might wonder: if we're modeling count data with NB/ZINB, why is the KL divergence still computed against a Gaussian prior \(p(z) = \mathcal{N}(0, I)\)?
Key insight: The latent space \(z\) is separate from the data space \(x\).
The NB/ZINB likelihood only affects \(p(x \mid z)\) — how we model the output. The latent \(z\) is a learned representation, not the data itself.
Why keep the latent Gaussian?
- Reparameterization trick requires continuous, differentiable sampling
- You can't easily backpropagate through discrete distributions
-
Gaussian allows \(z = \mu + \sigma \cdot \epsilon\) where \(\epsilon \sim \mathcal{N}(0, I)\)
-
The prior \(p(z) = \mathcal{N}(0, I)\) is a regularizer
- Prevents the encoder from "cheating" by encoding each sample at a unique, far-away point
- Forces the latent space to be smooth and interpolable
-
Without it, the VAE degenerates into a deterministic autoencoder
-
Tractable KL divergence
- Gaussian-to-Gaussian KL has a closed form: $\(\text{KL}(q \| p) = -\frac{1}{2} \sum_j \left(1 + \log \sigma_j^2 - \mu_j^2 - \sigma_j^2\right)\)$
- No sampling needed for the KL term
What would happen without the Gaussian prior?
- The encoder could map each training sample to a unique, isolated point
- The latent space would have "holes" — regions with no training data
- Sampling from \(p(z)\) would generate garbage
- No smooth interpolation between samples
Summary: The decoder likelihood (NB/ZINB) determines how we model the data. The encoder prior (Gaussian) determines how we regularize the latent space. These are independent design choices.
9. Summary: Likelihood → Loss¶
| Data Representation | Likelihood \(p(x \mid z)\) | Loss Function |
|---|---|---|
| log1p(TPM) | Gaussian | MSE |
| Counts (bulk) | NB | NB NLL |
| Counts (scRNA) | ZINB | ZINB NLL |
You are not choosing a "loss function" — you are choosing a data-generating assumption.
Note on
log1p(TPM)notation:
log1p(x) = log(1 + x)Why use
log1pinstead oflog?
- Handles zeros:
log1p(0) = log(1) = 0(vslog(0) = -∞)- Reduces skewness: Compresses high values more than low values
- Stabilizes variance: Makes variance less dependent on the mean
10. Why This Motivates Score Matching (Preview)¶
You're now seeing the pressure point:
- Biology data is messy
- NB vs ZINB is already a modeling compromise
- Wrong likelihood → biased gradients everywhere
Score matching later says:
"Let's stop committing to a precise likelihood."
But you cannot appreciate that move until you fully understand NB/ZINB — which is what this document covers.
11. Key Takeaway¶
Choosing \(p(x \mid z)\) is choosing how you believe noise enters your biological measurements; maximum likelihood then forces the corresponding loss.
Next Steps¶
The next document (VAE-08-NB-likelihood.md) covers:
- Write out the NB log-likelihood term explicitly
- Compare it side-by-side with MSE
- Interpret each term biologically (mean, dispersion, zeros)
References¶
- VAE-02-elbo.md — ELBO derivation
- VAE-08-NB-likelihood.md — NB log-likelihood derivation
- Lopez et al. (2018) — "Deep generative modeling for single-cell transcriptomics" (scVI)
- Eraslan et al. (2019) — "Single-cell RNA-seq denoising using a deep count autoencoder" (DCA)
- Cameron & Trivedi (2013) — "Regression Analysis of Count Data"