Denoising Diffusion Probabilistic Models (DDPM): Foundations¶
This document provides a comprehensive mathematical introduction to DDPM, the foundational discrete-time diffusion model introduced by Ho et al. (2020).
Overview¶
Denoising Diffusion Probabilistic Models (DDPM) are a class of generative models that learn to generate data by reversing a gradual noising process. They achieve state-of-the-art results in image generation and have been successfully applied to various domains including gene expression, protein design, and molecular generation.
Key Idea¶
- Forward process: Gradually add Gaussian noise to data over \(T\) steps until it becomes pure noise
- Reverse process: Learn to denoise, step by step, starting from pure noise
- Training: Predict the noise added at each step (equivalently, predict the score)
Why DDPM Matters¶
- Theoretical foundation: Connects variational inference, score matching, and SDEs
- Training stability: Simple MSE loss, no adversarial training
- Sample quality: State-of-the-art FID scores on image generation
- Flexibility: Works for continuous, discrete, and structured data
- Interpretability: Clear probabilistic interpretation via ELBO
The Forward Process (Data → Noise)¶
Definition¶
The forward process is a fixed Markov chain that gradually adds Gaussian noise to data \(x_0 \sim q(x_0)\):
where:
- \(t = 1, 2, \ldots, T\) (typically \(T = 1000\))
- \(\beta_t \in (0, 1)\) is the variance schedule (how much noise to add)
- \(\beta_1 < \beta_2 < \cdots < \beta_T\) (increasing noise)
Intuition¶
At each step: - Keep \(\sqrt{1 - \beta_t}\) of the previous signal - Add \(\sqrt{\beta_t}\) of fresh noise
As \(t \to T\), the signal becomes pure Gaussian noise.
Reparameterization¶
Using the reparameterization trick:
Closed-Form Forward Process¶
Key Insight¶
Because each step is Gaussian and the process is Markovian, we can jump directly from \(x_0\) to any \(x_t\) without computing intermediate steps.
Notation¶
Define: - \(\alpha_t := 1 - \beta_t\) (signal retention coefficient) - \(\bar{\alpha}_t := \prod_{s=1}^t \alpha_s\) (cumulative product)
Closed-Form Marginal¶
Reparameterization form:
Derivation¶
By induction:
Base case (\(t=1\)):
Inductive step: Assume \(x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \bar{\epsilon}_t\).
Then:
The two noise terms combine (sum of independent Gaussians):
Simplify the variance:
Therefore:
The Reverse Process (Noise → Data)¶
Goal¶
Learn to reverse the forward process: \(p_\theta(x_{t-1} \mid x_t)\).
If we can sample \(x_T \sim \mathcal{N}(0, I)\) and iteratively apply the reverse process, we can generate new data.
Parameterization¶
Model the reverse process as a Markov chain:
where:
Key Question¶
What should \(\mu_\theta\) and \(\Sigma_\theta\) be?
Answer: We'll derive them from the posterior \(q(x_{t-1} \mid x_t, x_0)\).
Posterior Distribution¶
Bayes' Rule¶
Using Bayes' rule:
Since the forward process is Markovian, \(q(x_t \mid x_{t-1}, x_0) = q(x_t \mid x_{t-1})\).
Gaussian Posterior¶
All three terms are Gaussian, so the posterior is also Gaussian. We can compute it in closed form.
Result:
where:
Derivation¶
We have: - \(q(x_t \mid x_{t-1}) = \mathcal{N}(x_t; \sqrt{\alpha_t} x_{t-1}, \beta_t I)\) - \(q(x_{t-1} \mid x_0) = \mathcal{N}(x_{t-1}; \sqrt{\bar{\alpha}_{t-1}} x_0, (1 - \bar{\alpha}_{t-1}) I)\) - \(q(x_t \mid x_0) = \mathcal{N}(x_t; \sqrt{\bar{\alpha}_t} x_0, (1 - \bar{\alpha}_t) I)\)
Using the formula for the product of Gaussians (completing the square), we get the result above.
Key insight: The posterior mean is a weighted average of predictions from \(x_t\) and \(x_0\).
Training Objective: The ELBO¶
Variational Lower Bound¶
DDPM is trained by maximizing the evidence lower bound (ELBO):
Decomposition¶
The ELBO can be decomposed into three terms:
Interpretation:
- \(L_T\): How close is \(q(x_T \mid x_0)\) to \(p(x_T) = \mathcal{N}(0, I)\)? (Usually negligible)
- \(L_{t-1}\): How well does \(p_\theta\) match the true posterior \(q\)?
- \(L_0\): Reconstruction term (discrete decoder or continuous likelihood)
Simplification¶
Since both \(q(x_{t-1} \mid x_t, x_0)\) and \(p_\theta(x_{t-1} \mid x_t)\) are Gaussian, the KL divergence has a closed form:
Noise Prediction Parameterization¶
Key Insight¶
Instead of directly predicting \(\mu_\theta(x_t, t)\), we can predict the noise \(\epsilon\) that was added.
Reparameterization¶
Recall:
Solving for \(x_0\):
Substituting into \(\tilde{\mu}_t\):
Noise Prediction Network¶
Define:
where \(\epsilon_\theta(x_t, t)\) is a neural network that predicts the noise.
Simplified Loss¶
The ELBO simplifies to:
where:
- \(t \sim \text{Uniform}(\{1, \ldots, T\})\)
- \(x_0 \sim q(x_0)\)
- \(\epsilon \sim \mathcal{N}(0, I)\)
- \(x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon\)
This is just MSE on noise prediction!
Algorithm Summary¶
Training¶
1. Sample x_0 ~ q(x_0)
2. Sample t ~ Uniform({1, ..., T})
3. Sample ε ~ N(0, I)
4. Compute x_t = sqrt(α_bar_t) * x_0 + sqrt(1 - α_bar_t) * ε
5. Compute loss = ||ε - ε_θ(x_t, t)||²
6. Update θ via gradient descent
Sampling¶
1. Sample x_T ~ N(0, I)
2. For t = T, T-1, ..., 1:
a. Predict noise: ε_θ(x_t, t)
b. Compute mean: μ_θ = (1/sqrt(α_t)) * (x_t - (β_t/sqrt(1-α_bar_t)) * ε_θ)
c. Sample: x_{t-1} = μ_θ + σ_t * z (z ~ N(0,I) if t>1, else z=0)
3. Return x_0
Variance Schedule¶
Linear Schedule (Original DDPM)¶
Typical values: \(\beta_1 = 10^{-4}\), \(\beta_T = 0.02\)
Cosine Schedule (Improved DDPM)¶
where \(s = 0.008\) is a small offset.
Advantages:
- More uniform signal-to-noise ratio across timesteps
- Better sample quality
- Fewer steps needed
Connection to Score Matching¶
Score Function¶
The score is the gradient of the log density:
Equivalence¶
Predicting noise \(\epsilon\) is equivalent to predicting the score (up to a constant):
Therefore: DDPM training is score matching with a specific noise schedule.
Summary¶
We derived DDPM from first principles:
- Forward process: Fixed Markov chain adding Gaussian noise
- Closed-form marginal: \(x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon\)
- Reverse process: Learned Gaussian transitions
- Training objective: Variational lower bound (ELBO)
- Simplified loss: MSE on noise prediction
- Connection to score matching: Noise prediction ≈ score prediction
Key insights:
- Training is simple: predict the noise that was added
- Sampling is iterative denoising
- The model learns the score function implicitly
- DDPM is a discrete-time approximation of continuous SDEs
Related Documents¶
- DDPM Training Details — Loss functions, architectures, conditioning
- DDPM Sampling Methods — DDPM, DDIM, ancestral sampling
- From DDPM to VP-SDE — Continuous-time limit
- VP-SDE to DDPM — Discretization perspective
References¶
- Ho, J., Jain, A., & Abbeel, P. (2020). Denoising Diffusion Probabilistic Models. NeurIPS.
- Sohl-Dickstein, J., et al. (2015). Deep Unsupervised Learning using Nonequilibrium Thermodynamics. ICML.
- Song, Y., & Ermon, S. (2019). Generative Modeling by Estimating Gradients of the Data Distribution. NeurIPS.
- Nichol, A., & Dhariwal, P. (2021). Improved Denoising Diffusion Probabilistic Models. ICML.