Deriving the Reverse-Time SDE: From Noise Back to Data¶
How do we reverse a diffusion process? The mathematical foundation of generative diffusion models
This document derives the reverse-time SDE, which is the mathematical key to generating samples from noise in diffusion models.
Table of Contents¶
- The Problem: Reversing Diffusion
- Anderson's Theorem (1982)
- Intuitive Understanding
- Derivation via Fokker-Planck Equation
- Why the Score Function Appears
- Connection to Diffusion Models
- Summary
Referenced From¶
- Notebook:
notebooks/diffusion/02_sde_formulation/sde_formulation.mdβ Section on Reverse SDE
1. The Problem: Reversing Diffusion¶
The Forward Process (Easy)¶
We know how to add noise to data. For the VP-SDE:
Starting from clean data \(x_0\), we can simulate this forward in time to get noisy \(x_T \approx \mathcal{N}(0, I)\).
The Reverse Process (Hard)¶
Question: Can we run this process backwards to go from noise \(x_T\) back to data \(x_0\)?
Naive attempt: Just negate time?
Problem: This doesn't work! Simply negating the drift and noise doesn't give you the correct reverse process.
Why not? Because the forward process generates a distribution \(p_t(x)\) that evolves over time. To reverse it, we need to account for the shape of this distribution at each time.
2. Anderson's Theorem (1982)¶
The fundamental result (Anderson, 1982):
For any forward SDE:
the reverse-time SDE (running from \(t=T\) back to \(t=0\)) is:
where:
- \(\bar{w}(t)\) is a reverse-time Brownian motion
- \(\nabla_x \log p_t(x)\) is the score function (gradient of log probability density)
Key observation: The reverse process has an extra term \(-g(t)^2 \nabla_x \log p_t(x)\) that depends on the probability distribution.
3. Intuitive Understanding¶
Why We Need a Correction Term¶
Imagine particles diffusing outward from a point source:
Forward process:
- Particles spread out randomly
- No "memory" of where they came from
- Pure diffusion: symmetric spreading
Reverse process:
- Particles need to know where to go back to
- Not just random motionβneed to be "pulled" toward high-density regions
- The score \(\nabla_x \log p_t(x)\) provides this "pull"
The Score as a Guide¶
The score function \(\nabla_x \log p_t(x)\) points in the direction of increasing probability.
- In forward diffusion: Ignore probability, just add noise
- In reverse diffusion: Follow the probability gradient to find likely paths back
Analogy:
- Forward: Drop ink in water, watch it spread (no guidance)
- Reverse: Collect ink back together (need to know where the ink is concentrated)
4. Derivation via Fokker-Planck Equation¶
Step 1: Forward SDE and Its Fokker-Planck Equation¶
The forward SDE:
generates a probability distribution \(p_t(x)\) that evolves according to the Fokker-Planck equation:
π Detailed Derivation: For a complete derivation of the Fokker-Planck equation from first principles, including physical intuition and examples, see
fokker_planck_derivation.md.
Interpretation: This PDE describes how probability density flows forward in time.
Step 2: Reverse Time¶
To reverse the process, substitute \(\tau = T - t\) (reverse time variable).
Let \(\tilde{p}_\tau(x) = p_{T-\tau}(x)\) be the distribution in reverse time.
Goal: Find an SDE whose solution has marginals \(\tilde{p}_\tau(x)\).
Step 3: Transform the Fokker-Planck Equation¶
When we reverse time (\(\tau = T - t\)), we have:
Substitute the Fokker-Planck equation:
Step 4: Rewrite the Diffusion Term¶
The key trick is to express \(\nabla^2 p\) using the score.
Identity:
Derivation: Since \(\nabla \log p = \frac{\nabla p}{p}\), we have \(\nabla p = p \nabla \log p\), so:
Step 5: Substitute into Reverse Fokker-Planck¶
Pattern recognition: Let me explain how we identify the drift from this form.
Recognizing the Fokker-Planck Structure¶
Recall the general Fokker-Planck equation for an SDE \(dx = \tilde{f}(x,t)\,dt + \tilde{g}(t)\,dw\):
The equation has two parts: 1. Advection term: \(-\nabla \cdot (\tilde{f} p)\) β probability flow due to drift 2. Diffusion term: \(+\frac{1}{2}\tilde{g}^2 \nabla^2 p\) β probability spreading due to noise
Matching Our Equation¶
We derived (ignoring the diffusion term for now):
Comparison with standard form:
Key observation: Our equation has \(+\nabla \cdot (\ldots)\) while the standard form has \(-\nabla \cdot (\tilde{f} p)\).
To match the standard form, we need:
This means:
Dividing by \(p\) (assuming \(p > 0\)):
Therefore:
Wait, this has a negative sign on \(f\)! This would be the reverse drift if we're going backward in time. But we want the form for the SDE...
Actually, let me reconsider. The issue is subtle and relates to the sign conventions in reverse time. Let me continue to Step 6 where we'll sort this out properly.
Step 6: The Diffusion Term in Reverse Time¶
Actually, we need to be more careful. The correct reverse-time Fokker-Planck equation is:
where \(\tilde{f}\) is the reverse drift.
Matching coefficients with our transformed equation:
The diffusion terms have opposite signs! To fix this, we need to include both the drift correction AND account for the sign change.
Result: The reverse SDE is:
where the \(+g(t)\,d\bar{w}(t)\) term provides the diffusion in reverse time (note the positive sign, same as forward).
Summary of the Derivation Logic¶
Let me clarify the full picture:
-
Forward Fokker-Planck: $\(\frac{\partial p_t}{\partial t} = -\nabla \cdot (f p_t) + \frac{1}{2}g^2 \nabla^2 p_t\)$
-
Reverse time (\(\tau = T - t\)): $\(\frac{\partial p_\tau}{\partial \tau} = +\nabla \cdot (f p) - \frac{1}{2}g^2 \nabla^2 p\)$ (Sign flip on time derivative flips both terms)
-
Rewrite diffusion using score: \(\nabla^2 p = \nabla \cdot (p \nabla \log p)\): $\(\frac{\partial p_\tau}{\partial \tau} = \nabla \cdot \left(\left[f - \frac{1}{2}g^2 \nabla \log p\right] p\right)\)$
-
This is almost a Fokker-Planck equation, but with a sign issue. The resolution is that when we write the SDE that generates this, we need to account for:
- The advection term gives us the effective drift
- The diffusion term (which we somewhat glossed over) contributes the \(g(t)\,d\bar{w}\) term
- The full SDE that produces the correct marginals in reverse time is: $\(dx = [f - g^2 \nabla \log p]\,dt + g\,d\bar{w}\)$
The key insight: The score term \(-g^2 \nabla \log p\) corrects for the fact that probability is concentrated in certain regions, and we need to guide the reverse process toward those regions.
A Concrete Example to Build Intuition¶
Consider a simple 1D case where probability has flowed outward:
Forward process: Starting from \(x_0 = 0\), particles diffuse outward. At time \(t\), we have \(p_t(x) \approx \mathcal{N}(0, t)\).
Score at time \(t\): $\(\nabla_x \log p_t(x) = -\frac{x}{t}\)$
Interpretation: The score points toward \(x=0\) (the center of the distribution), with magnitude proportional to distance from center.
Reverse process: To bring particles back, we need drift: $\(\text{drift} = f(x,t) - g^2 \nabla \log p = f(x,t) + g^2 \frac{x}{t}\)$
The term \(+g^2 \frac{x}{t}\) pulls particles toward the origin, counteracting the outward diffusion. Without this term, simply reversing would not account for where probability is concentrated.
π Detailed Example: For a complete worked example with step-by-step calculations, numerical verification, and intuitive explanations, see
reverse_process_example.md.
5. Why the Score Function Appears¶
The Score as Probability Flow¶
The score function:
is the normalized gradient of the probability density.
Physical interpretation:
- Points from low probability to high probability
- Magnitude is stronger in regions with steeper probability gradients
- Tells particles "which way to go" to increase likelihood
The \(g(t)^2\) Weighting¶
Why is the score multiplied by \(g(t)^2\)?
Answer: It's the diffusion coefficient squared.
Intuition:
- Stronger diffusion (\(g\) large) β more noise added β need stronger correction to reverse
- Weaker diffusion (\(g\) small) β less noise added β need smaller correction
Mathematical origin: From the Fokker-Planck equation, the diffusion term has coefficient \(\frac{1}{2}g^2\), which when transformed gives \(g^2\) in the drift correction.
Why Not Just \(f(x,t)\) in Reverse?¶
If we only used \(-f(x,t)\) (negating the forward drift), we'd be ignoring the shape of the distribution.
Example: Consider particles that have diffused to form a Gaussian blob. - Simply reversing \(f\) would make them all move backward the same way - But they need to be "pulled" toward the center of the blob (high density) - The score term provides this pull
6. Connection to Diffusion Models¶
What We Know and Don't Know¶
In diffusion models:
Known (designed): - Forward SDE: \(dx = f(x,t)\,dt + g(t)\,dw\) - Drift \(f(x,t)\) and diffusion \(g(t)\) are chosen
Unknown (needs learning): - Score function: \(\nabla_x \log p_t(x)\)
The Learning Problem¶
Since we don't know \(p_t(x)\), we don't know its score \(\nabla_x \log p_t(x)\).
Solution: Train a neural network \(s_\theta(x,t)\) to approximate the score:
The Reverse SDE for Generation¶
Once we have the learned score, we can sample by solving:
Starting from \(x_T \sim \mathcal{N}(0, I)\), integrate backwards from \(t=T\) to \(t=0\) to generate \(x_0\).
Example: VP-SDE¶
For the VP-SDE with \(f(x,t) = -\frac{1}{2}\beta(t)x\) and \(g(t) = \sqrt{\beta(t)}\):
Reverse SDE:
Discretized (Euler-Maruyama):
x = torch.randn(batch_size, dim) # Start from noise
dt = -T / num_steps
for i in range(num_steps):
t = T - i * dt
beta_t = beta(t)
# Predict score
score = score_network(x, t)
# Drift term
drift = -0.5 * beta_t * x - beta_t * score
# Diffusion term
noise = torch.randn_like(x)
diffusion = np.sqrt(beta_t * abs(dt)) * noise
# Update
x = x + drift * dt + diffusion
return x # This is x_0 (generated sample)
7. Summary¶
The Key Result¶
| Process | SDE |
|---|---|
| Forward | \(dx = f(x,t)\,dt + g(t)\,dw\) |
| Reverse | \(dx = [f(x,t) - g(t)^2 \nabla_x \log p_t(x)]\,dt + g(t)\,d\bar{w}(t)\) |
Why This Works¶
- Forward SDE defines how probability evolves: \(p_0 \to p_T\)
- Fokker-Planck equation describes this evolution as a PDE
- Reverse time requires transforming this PDE
- Score appears naturally from rewriting the Laplacian: \(\nabla^2 p = \nabla \cdot (p \nabla \log p)\)
- Result: Reverse SDE with score correction
Three Levels of Understanding¶
Level 1 (Practical): To reverse diffusion, add a score correction term \(-g^2 \nabla \log p\) to the drift.
Level 2 (Intuitive): The score tells particles where probability is concentrated, guiding them back to likely regions.
Level 3 (Mathematical): Reversing the Fokker-Planck equation requires expressing the Laplacian in terms of the score, yielding the correction term.
Appendix: Rigorous Statement of Anderson's Theorem¶
Theorem (Anderson, 1982):
Let \(x(t)\) be the solution to the forward SDE:
where \(G(t)\) is a \(d \times d\) matrix (diffusion matrix).
Then the reverse-time process \(x(T-t)\) satisfies the SDE:
where \(p_t(x)\) is the marginal density of \(x(t)\) under the forward process.
Special case (scalar diffusion \(g(t)\)):
This gives the form we use:
References¶
- Anderson (1982): "Reverse-time diffusion equation models" β Original theorem
- Song et al. (2021): Score-Based Generative Modeling through SDEs β Application to generative models
- Haussmann & Pardoux (1986): "Time reversal of diffusions" β Mathematical treatment
- FΓΆllmer (1986): "Random fields and diffusion processes" β Connections to optimal transport
- Fokker-Planck Derivation:
notebooks/diffusion/02_sde_formulation/supplements/07_fokker_planck_equation.md
Related Documents¶
- Fokker-Planck Equation:
fokker_planck_derivation.mdβ Derivation and intuition for the probability evolution equation - Detailed Worked Example:
reverse_process_example.mdβ Complete 1D Gaussian example with numerical verification