Deriving DDPM from the VP-SDE¶

This document shows how the discrete DDPM algorithm emerges naturally from the continuous VP-SDE through Euler–Maruyama discretization. We'll derive both the forward noising process and the reverse denoising process, making explicit why DDPM predicts noise rather than the score directly.

Overview¶

We'll proceed in two parts:

Forward VP-SDE → DDPM forward Markov chain (Euler–Maruyama discretization)
Reverse VP-SDE → DDPM reverse step (why the network predicts noise/score)

Notation¶

Let's establish precise notation to avoid confusion:

Symbol	Meaning
\(x(t) \in \mathbb{R}^d\)	Data state at continuous time \(t \in [0, T]\)
\(0 = t_0 < t_1 < \cdots < t_N = T\)	Discrete time grid
\(\Delta t_k := t_{k+1} - t_k\)	Time step size
\(w(t)\)	Brownian motion (Wiener process)
\(\Delta w_k := w(t_{k+1}) - w(t_k)\)	Brownian increment, \(\sim \mathcal{N}(0, \Delta t_k I)\)
\(dw(t)\)	Infinitesimal increment: \(dw \approx \sqrt{dt}\,\varepsilon\), \(\varepsilon \sim \mathcal{N}(0, I)\)
\(\beta(t) \geq 0\)	Noise rate schedule (chosen by designer)

Part A: Forward VP-SDE → DDPM Forward Noising¶

Step 1: The Variance-Preserving SDE¶

The variance-preserving SDE (VP-SDE) is:

\[ dx(t) = -\frac{1}{2}\beta(t) x(t)\,dt + \sqrt{\beta(t)}\,dw(t) \]

This SDE has:

Drift: \(f(x, t) = -\frac{1}{2}\beta(t) x\)
Diffusion coefficient: \(g(t) = \sqrt{\beta(t)}\)

Step 2: Apply Euler–Maruyama Discretization¶

Euler–Maruyama is a numerical method for discretizing SDEs. For a general SDE:

\[ dx = f(x, t)\,dt + g(t)\,dw \]

The discrete update from \(t_k \to t_{k+1}\) is:

\[ x_{k+1} = x_k + f(x_k, t_k)\,\Delta t_k + g(t_k)\,\Delta w_k \]

where \(\Delta w_k := w(t_{k+1}) - w(t_k) \sim \mathcal{N}(0, \Delta t_k I)\).

Rewrite the Brownian increment:

\[ \Delta w_k = \sqrt{\Delta t_k}\,\varepsilon_k, \quad \varepsilon_k \sim \mathcal{N}(0, I) \]

Then:

\[ x_{k+1} = x_k + f(x_k, t_k)\,\Delta t_k + g(t_k)\sqrt{\Delta t_k}\,\varepsilon_k \]

Step 3: Plug in VP-SDE Components¶

Substitute \(f(x, t) = -\frac{1}{2}\beta(t) x\) and \(g(t) = \sqrt{\beta(t)}\):

\[ x_{k+1} = x_k - \frac{1}{2}\beta(t_k) x_k\,\Delta t_k + \sqrt{\beta(t_k)}\sqrt{\Delta t_k}\,\varepsilon_k \]

Factor out \(x_k\):

\[ x_{k+1} = \left(1 - \frac{1}{2}\beta(t_k)\Delta t_k\right) x_k + \sqrt{\beta(t_k)\Delta t_k}\,\varepsilon_k \]

Define discrete noise parameter:

\[ \beta_k := \beta(t_k)\,\Delta t_k \]

Then:

\[ \boxed{x_{k+1} = \left(1 - \frac{1}{2}\beta_k\right) x_k + \sqrt{\beta_k}\,\varepsilon_k} \quad \text{(Euler–Maruyama form)} \]

This is already a "diffusion-like" forward step!

Step 4: Why DDPM Uses \(\sqrt{1-\beta_k}\) Instead¶

The actual DDPM forward step is written as:

\[ \boxed{x_{k+1} = \sqrt{1-\beta_k}\,x_k + \sqrt{\beta_k}\,\varepsilon_k} \]

Where did \(\sqrt{1-\beta_k}\) come from instead of \(1 - \frac{1}{2}\beta_k\)?

Answer: It's a variance-preserving tweak that matches the first-order Taylor expansion:

\[ \sqrt{1-\beta_k} = 1 - \frac{1}{2}\beta_k + O(\beta_k^2) \]

Comparison:

Form	Accuracy	Variance Control
\(1 - \frac{1}{2}\beta_k\)	First-order accurate	Approximate
\(\sqrt{1-\beta_k}\)	First-order accurate	Exact

DDPM uses \(\sqrt{1-\beta_k}\) because:

It agrees with Euler–Maruyama to first order
It exactly controls variance in discrete time
It keeps the process well-behaved when \(\beta_k\) isn't infinitesimal

DDPM notation: Define \(\alpha_k := 1 - \beta_k\). Then:

\[ q(x_{k+1} \mid x_k) = \mathcal{N}\left(\sqrt{\alpha_k} x_k, (1-\alpha_k) I\right) \]

Sampling form:

\[ x_{k+1} = \sqrt{\alpha_k}\,x_k + \sqrt{1-\alpha_k}\,\varepsilon_k \]

Key insight: DDPM's forward chain is a renormalized Euler step that preserves variance exactly.

Part B: Reverse VP-SDE → DDPM Reverse Denoising¶

Step 1: The Reverse-Time SDE Formula¶

For a forward SDE:

\[ dx = f(x, t)\,dt + g(t)\,dw \]

The reverse-time SDE (running from \(T \to 0\)) is:

\[ dx = \left[f(x, t) - g(t)^2 \nabla_x \log p_t(x)\right]dt + g(t)\,d\bar{w} \]

where:

\(p_t(x)\) is the marginal density of \(x(t)\)
\(\nabla_x \log p_t(x)\) is the score (gradient of log density)
\(d\bar{w}\) is Brownian noise in reverse time

Step 2: Apply to VP-SDE¶

Substitute \(f = -\frac{1}{2}\beta(t) x\) and \(g^2 = \beta(t)\):

\[ \boxed{dx = \left[-\frac{1}{2}\beta(t) x - \beta(t)\nabla_x \log p_t(x)\right]dt + \sqrt{\beta(t)}\,d\bar{w}} \]

This is the reverse diffusion equation.

Key observation: The only unknown term is the score \(\nabla_x \log p_t(x)\).

Solution: Learn a neural network:

\[ s_\theta(x, t) \approx \nabla_x \log p_t(x) \]

Step 3: Discretize the Reverse SDE¶

Apply Euler–Maruyama again, but stepping backward in time:

\[ x_{k-1} \approx x_k + \left[-\frac{1}{2}\beta_k x_k - \beta_k s_\theta(x_k, t_k)\right] + \sqrt{\beta_k}\,z_k \]

where \(z_k \sim \mathcal{N}(0, I)\).

(Note: We've absorbed \(\Delta t\) factors into \(\beta_k\) to match DDPM convention.)

This is the SDE-solver view of reverse sampling.

Step 4: Connection to DDPM's Learned Gaussian¶

DDPM doesn't present sampling as "Euler–Maruyama on the reverse SDE." Instead, it presents it as a learned Gaussian transition:

\[ p_\theta(x_{k-1} \mid x_k) = \mathcal{N}\left(\mu_\theta(x_k, k), \Sigma_k\right) \]

These are consistent: An Euler step of an SDE is a Gaussian update where:

Mean: current state + drift term
Variance: diffusion strength

Step 5: Why Predict Noise Instead of Score?¶

The remaining question: Why parameterize via noise prediction \(\varepsilon_\theta\) instead of score \(s_\theta\)?

Answer: Under the forward marginal:

\[ x_k = \sqrt{\bar{\alpha}_k} x_0 + \sqrt{1 - \bar{\alpha}_k}\,\varepsilon \]

The conditional score has a clean identity:

\[ \nabla_{x_k} \log q(x_k \mid x_0) = -\frac{1}{\sqrt{1 - \bar{\alpha}_k}}\,\varepsilon \]

Therefore: If a network predicts \(\varepsilon\), it is (up to scaling) predicting the score!

The Bridge Between Views¶

View	What to Learn	Relationship
SDE view	Score \(s_\theta(x, t)\)	Direct
DDPM view	Noise \(\varepsilon_\theta(x_t, t)\)	\(s_\theta = -\frac{1}{\sqrt{1-\bar{\alpha}_t}} \varepsilon_\theta\)

These are equivalent up to a known scale factor.

Summary¶

Forward DDPM¶

The forward process is the VP-SDE discretized via Euler–Maruyama, with a variance-preserving square-root coefficient:

\[ x_{k+1} = \sqrt{1-\beta_k}\,x_k + \sqrt{\beta_k}\,\varepsilon_k \]

Key points:

Emerges from continuous-time SDE
\(\sqrt{1-\beta_k}\) preserves variance exactly
Agrees with Euler–Maruyama to first order

Reverse DDPM¶

The reverse process discretizes the reverse-time SDE:

\[ x_{k-1} = x_k + \left[-\frac{1}{2}\beta_k x_k - \beta_k s_\theta(x_k, t_k)\right] + \sqrt{\beta_k}\,z_k \]