Skip to content

Deriving the Reverse-Time SDE: From Noise Back to Data

How do we reverse a diffusion process? The mathematical foundation of generative diffusion models

This document derives the reverse-time SDE, which is the mathematical key to generating samples from noise in diffusion models.


Table of Contents

  1. The Problem: Reversing Diffusion
  2. Anderson's Theorem (1982)
  3. Intuitive Understanding
  4. Derivation via Fokker-Planck Equation
  5. Why the Score Function Appears
  6. Connection to Diffusion Models
  7. Summary

Referenced From

  • Notebook: notebooks/diffusion/02_sde_formulation/sde_formulation.md β€” Section on Reverse SDE

1. The Problem: Reversing Diffusion

The Forward Process (Easy)

We know how to add noise to data. For the VP-SDE:

\[ dx = -\frac{1}{2}\beta(t)x\,dt + \sqrt{\beta(t)}\,dw \]

Starting from clean data \(x_0\), we can simulate this forward in time to get noisy \(x_T \approx \mathcal{N}(0, I)\).

The Reverse Process (Hard)

Question: Can we run this process backwards to go from noise \(x_T\) back to data \(x_0\)?

Naive attempt: Just negate time?

\[ dx = +\frac{1}{2}\beta(t)x\,dt - \sqrt{\beta(t)}\,dw \quad \text{❌ WRONG} \]

Problem: This doesn't work! Simply negating the drift and noise doesn't give you the correct reverse process.

Why not? Because the forward process generates a distribution \(p_t(x)\) that evolves over time. To reverse it, we need to account for the shape of this distribution at each time.


2. Anderson's Theorem (1982)

The fundamental result (Anderson, 1982):

For any forward SDE:

\[ dx = f(x,t)\,dt + g(t)\,dw \]

the reverse-time SDE (running from \(t=T\) back to \(t=0\)) is:

\[ \boxed{dx = \left[f(x,t) - g(t)^2 \nabla_x \log p_t(x)\right] dt + g(t)\,d\bar{w}(t)} \]

where:

  • \(\bar{w}(t)\) is a reverse-time Brownian motion
  • \(\nabla_x \log p_t(x)\) is the score function (gradient of log probability density)

Key observation: The reverse process has an extra term \(-g(t)^2 \nabla_x \log p_t(x)\) that depends on the probability distribution.


3. Intuitive Understanding

Why We Need a Correction Term

Imagine particles diffusing outward from a point source:

Forward process:

  • Particles spread out randomly
  • No "memory" of where they came from
  • Pure diffusion: symmetric spreading

Reverse process:

  • Particles need to know where to go back to
  • Not just random motionβ€”need to be "pulled" toward high-density regions
  • The score \(\nabla_x \log p_t(x)\) provides this "pull"

The Score as a Guide

The score function \(\nabla_x \log p_t(x)\) points in the direction of increasing probability.

  • In forward diffusion: Ignore probability, just add noise
  • In reverse diffusion: Follow the probability gradient to find likely paths back

Analogy:

  • Forward: Drop ink in water, watch it spread (no guidance)
  • Reverse: Collect ink back together (need to know where the ink is concentrated)

4. Derivation via Fokker-Planck Equation

Step 1: Forward SDE and Its Fokker-Planck Equation

The forward SDE:

\[ dx = f(x,t)\,dt + g(t)\,dw \]

generates a probability distribution \(p_t(x)\) that evolves according to the Fokker-Planck equation:

\[ \frac{\partial p_t}{\partial t} = -\nabla \cdot (f p_t) + \frac{1}{2}g(t)^2 \nabla^2 p_t \]

πŸ“˜ Detailed Derivation: For a complete derivation of the Fokker-Planck equation from first principles, including physical intuition and examples, see fokker_planck_derivation.md.

Interpretation: This PDE describes how probability density flows forward in time.

Step 2: Reverse Time

To reverse the process, substitute \(\tau = T - t\) (reverse time variable).

Let \(\tilde{p}_\tau(x) = p_{T-\tau}(x)\) be the distribution in reverse time.

Goal: Find an SDE whose solution has marginals \(\tilde{p}_\tau(x)\).

Step 3: Transform the Fokker-Planck Equation

When we reverse time (\(\tau = T - t\)), we have:

\[ \frac{\partial \tilde{p}_\tau}{\partial \tau} = -\frac{\partial p_t}{\partial t} \]

Substitute the Fokker-Planck equation:

\[ \frac{\partial \tilde{p}_\tau}{\partial \tau} = \nabla \cdot (f p_t) - \frac{1}{2}g(t)^2 \nabla^2 p_t \]

Step 4: Rewrite the Diffusion Term

The key trick is to express \(\nabla^2 p\) using the score.

Identity:

\[ \nabla^2 p = \nabla \cdot (\nabla p) = \nabla \cdot (p \nabla \log p) \]

Derivation: Since \(\nabla \log p = \frac{\nabla p}{p}\), we have \(\nabla p = p \nabla \log p\), so:

\[ \nabla^2 p = \nabla \cdot (p \nabla \log p) \]

Step 5: Substitute into Reverse Fokker-Planck

\[ \frac{\partial \tilde{p}_\tau}{\partial \tau} = \nabla \cdot (f p) - \frac{1}{2}g^2 \nabla \cdot (p \nabla \log p) \]
\[ = \nabla \cdot \left(f p - \frac{1}{2}g^2 p \nabla \log p\right) \]
\[ = \nabla \cdot \left(\left[f - \frac{1}{2}g^2 \nabla \log p\right] p\right) \]

Pattern recognition: Let me explain how we identify the drift from this form.

Recognizing the Fokker-Planck Structure

Recall the general Fokker-Planck equation for an SDE \(dx = \tilde{f}(x,t)\,dt + \tilde{g}(t)\,dw\):

\[ \frac{\partial p}{\partial t} = -\nabla \cdot (\tilde{f} p) + \frac{1}{2}\tilde{g}^2 \nabla^2 p \]

The equation has two parts: 1. Advection term: \(-\nabla \cdot (\tilde{f} p)\) β€” probability flow due to drift 2. Diffusion term: \(+\frac{1}{2}\tilde{g}^2 \nabla^2 p\) β€” probability spreading due to noise

Matching Our Equation

We derived (ignoring the diffusion term for now):

\[ \frac{\partial \tilde{p}_\tau}{\partial \tau} = \nabla \cdot \left(\left[f - \frac{1}{2}g^2 \nabla \log p\right] p\right) \]

Comparison with standard form:

\[ \frac{\partial p}{\partial t} = -\nabla \cdot (\tilde{f} p) + \ldots \]

Key observation: Our equation has \(+\nabla \cdot (\ldots)\) while the standard form has \(-\nabla \cdot (\tilde{f} p)\).

To match the standard form, we need:

\[ \nabla \cdot \left(\left[f - \frac{1}{2}g^2 \nabla \log p\right] p\right) = -\nabla \cdot (\tilde{f} p) \]

This means:

\[ \left[f - \frac{1}{2}g^2 \nabla \log p\right] p = -\tilde{f} p \]

Dividing by \(p\) (assuming \(p > 0\)):

\[ f - \frac{1}{2}g^2 \nabla \log p = -\tilde{f} \]

Therefore:

\[ \boxed{\tilde{f}(x) = -f(x,t) + \frac{1}{2}g(t)^2 \nabla_x \log p_t(x)} \]

Wait, this has a negative sign on \(f\)! This would be the reverse drift if we're going backward in time. But we want the form for the SDE...

Actually, let me reconsider. The issue is subtle and relates to the sign conventions in reverse time. Let me continue to Step 6 where we'll sort this out properly.

Step 6: The Diffusion Term in Reverse Time

Actually, we need to be more careful. The correct reverse-time Fokker-Planck equation is:

\[ \frac{\partial \tilde{p}_\tau}{\partial \tau} = -\nabla \cdot (\tilde{f} \tilde{p}) + \frac{1}{2}g^2 \nabla^2 \tilde{p} \]

where \(\tilde{f}\) is the reverse drift.

Matching coefficients with our transformed equation:

\[ -\nabla \cdot (\tilde{f} \tilde{p}) + \frac{1}{2}g^2 \nabla^2 \tilde{p} = \nabla \cdot (f p) - \frac{1}{2}g^2 \nabla^2 p \]

The diffusion terms have opposite signs! To fix this, we need to include both the drift correction AND account for the sign change.

Result: The reverse SDE is:

\[ dx = \left[f(x,t) - g(t)^2 \nabla_x \log p_t(x)\right] dt + g(t)\,d\bar{w}(t) \]

where the \(+g(t)\,d\bar{w}(t)\) term provides the diffusion in reverse time (note the positive sign, same as forward).

Summary of the Derivation Logic

Let me clarify the full picture:

  1. Forward Fokker-Planck: $\(\frac{\partial p_t}{\partial t} = -\nabla \cdot (f p_t) + \frac{1}{2}g^2 \nabla^2 p_t\)$

  2. Reverse time (\(\tau = T - t\)): $\(\frac{\partial p_\tau}{\partial \tau} = +\nabla \cdot (f p) - \frac{1}{2}g^2 \nabla^2 p\)$ (Sign flip on time derivative flips both terms)

  3. Rewrite diffusion using score: \(\nabla^2 p = \nabla \cdot (p \nabla \log p)\): $\(\frac{\partial p_\tau}{\partial \tau} = \nabla \cdot \left(\left[f - \frac{1}{2}g^2 \nabla \log p\right] p\right)\)$

  4. This is almost a Fokker-Planck equation, but with a sign issue. The resolution is that when we write the SDE that generates this, we need to account for:

  5. The advection term gives us the effective drift
  6. The diffusion term (which we somewhat glossed over) contributes the \(g(t)\,d\bar{w}\) term
  7. The full SDE that produces the correct marginals in reverse time is: $\(dx = [f - g^2 \nabla \log p]\,dt + g\,d\bar{w}\)$

The key insight: The score term \(-g^2 \nabla \log p\) corrects for the fact that probability is concentrated in certain regions, and we need to guide the reverse process toward those regions.

A Concrete Example to Build Intuition

Consider a simple 1D case where probability has flowed outward:

Forward process: Starting from \(x_0 = 0\), particles diffuse outward. At time \(t\), we have \(p_t(x) \approx \mathcal{N}(0, t)\).

Score at time \(t\): $\(\nabla_x \log p_t(x) = -\frac{x}{t}\)$

Interpretation: The score points toward \(x=0\) (the center of the distribution), with magnitude proportional to distance from center.

Reverse process: To bring particles back, we need drift: $\(\text{drift} = f(x,t) - g^2 \nabla \log p = f(x,t) + g^2 \frac{x}{t}\)$

The term \(+g^2 \frac{x}{t}\) pulls particles toward the origin, counteracting the outward diffusion. Without this term, simply reversing would not account for where probability is concentrated.

πŸ“˜ Detailed Example: For a complete worked example with step-by-step calculations, numerical verification, and intuitive explanations, see reverse_process_example.md.


5. Why the Score Function Appears

The Score as Probability Flow

The score function:

\[ \nabla_x \log p_t(x) = \frac{\nabla_x p_t(x)}{p_t(x)} \]

is the normalized gradient of the probability density.

Physical interpretation:

  • Points from low probability to high probability
  • Magnitude is stronger in regions with steeper probability gradients
  • Tells particles "which way to go" to increase likelihood

The \(g(t)^2\) Weighting

Why is the score multiplied by \(g(t)^2\)?

Answer: It's the diffusion coefficient squared.

Intuition:

  • Stronger diffusion (\(g\) large) β†’ more noise added β†’ need stronger correction to reverse
  • Weaker diffusion (\(g\) small) β†’ less noise added β†’ need smaller correction

Mathematical origin: From the Fokker-Planck equation, the diffusion term has coefficient \(\frac{1}{2}g^2\), which when transformed gives \(g^2\) in the drift correction.

Why Not Just \(f(x,t)\) in Reverse?

If we only used \(-f(x,t)\) (negating the forward drift), we'd be ignoring the shape of the distribution.

Example: Consider particles that have diffused to form a Gaussian blob. - Simply reversing \(f\) would make them all move backward the same way - But they need to be "pulled" toward the center of the blob (high density) - The score term provides this pull


6. Connection to Diffusion Models

What We Know and Don't Know

In diffusion models:

Known (designed): - Forward SDE: \(dx = f(x,t)\,dt + g(t)\,dw\) - Drift \(f(x,t)\) and diffusion \(g(t)\) are chosen

Unknown (needs learning): - Score function: \(\nabla_x \log p_t(x)\)

The Learning Problem

Since we don't know \(p_t(x)\), we don't know its score \(\nabla_x \log p_t(x)\).

Solution: Train a neural network \(s_\theta(x,t)\) to approximate the score:

\[ s_\theta(x,t) \approx \nabla_x \log p_t(x) \]

The Reverse SDE for Generation

Once we have the learned score, we can sample by solving:

\[ dx = \left[f(x,t) - g(t)^2 s_\theta(x,t)\right] dt + g(t)\,d\bar{w}(t) \]

Starting from \(x_T \sim \mathcal{N}(0, I)\), integrate backwards from \(t=T\) to \(t=0\) to generate \(x_0\).

Example: VP-SDE

For the VP-SDE with \(f(x,t) = -\frac{1}{2}\beta(t)x\) and \(g(t) = \sqrt{\beta(t)}\):

Reverse SDE:

\[ dx = \left[-\frac{1}{2}\beta(t)x - \beta(t) s_\theta(x,t)\right] dt + \sqrt{\beta(t)}\,d\bar{w}(t) \]

Discretized (Euler-Maruyama):

x = torch.randn(batch_size, dim)  # Start from noise
dt = -T / num_steps

for i in range(num_steps):
    t = T - i * dt
    beta_t = beta(t)

    # Predict score
    score = score_network(x, t)

    # Drift term
    drift = -0.5 * beta_t * x - beta_t * score

    # Diffusion term
    noise = torch.randn_like(x)
    diffusion = np.sqrt(beta_t * abs(dt)) * noise

    # Update
    x = x + drift * dt + diffusion

return x  # This is x_0 (generated sample)

7. Summary

The Key Result

Process SDE
Forward \(dx = f(x,t)\,dt + g(t)\,dw\)
Reverse \(dx = [f(x,t) - g(t)^2 \nabla_x \log p_t(x)]\,dt + g(t)\,d\bar{w}(t)\)

Why This Works

  1. Forward SDE defines how probability evolves: \(p_0 \to p_T\)
  2. Fokker-Planck equation describes this evolution as a PDE
  3. Reverse time requires transforming this PDE
  4. Score appears naturally from rewriting the Laplacian: \(\nabla^2 p = \nabla \cdot (p \nabla \log p)\)
  5. Result: Reverse SDE with score correction

Three Levels of Understanding

Level 1 (Practical): To reverse diffusion, add a score correction term \(-g^2 \nabla \log p\) to the drift.

Level 2 (Intuitive): The score tells particles where probability is concentrated, guiding them back to likely regions.

Level 3 (Mathematical): Reversing the Fokker-Planck equation requires expressing the Laplacian in terms of the score, yielding the correction term.


Appendix: Rigorous Statement of Anderson's Theorem

Theorem (Anderson, 1982):

Let \(x(t)\) be the solution to the forward SDE:

\[ dx = f(x,t)\,dt + G(t)\,dw \]

where \(G(t)\) is a \(d \times d\) matrix (diffusion matrix).

Then the reverse-time process \(x(T-t)\) satisfies the SDE:

\[ dx = \left[f(x,t) - \left(GG^T\right) \nabla_x \log p_t(x)\right] dt + G\,d\bar{w}(t) \]

where \(p_t(x)\) is the marginal density of \(x(t)\) under the forward process.

Special case (scalar diffusion \(g(t)\)):

\[ G(t) = g(t) I \quad \Rightarrow \quad GG^T = g(t)^2 I \]

This gives the form we use:

\[ dx = \left[f(x,t) - g(t)^2 \nabla_x \log p_t(x)\right] dt + g(t)\,d\bar{w}(t) \]

References

  • Anderson (1982): "Reverse-time diffusion equation models" β€” Original theorem
  • Song et al. (2021): Score-Based Generative Modeling through SDEs β€” Application to generative models
  • Haussmann & Pardoux (1986): "Time reversal of diffusions" β€” Mathematical treatment
  • FΓΆllmer (1986): "Random fields and diffusion processes" β€” Connections to optimal transport
  • Fokker-Planck Derivation: notebooks/diffusion/02_sde_formulation/supplements/07_fokker_planck_equation.md