Deriving the Reverse-Time SDE: From Noise Back to Data¶

How do we reverse a diffusion process? The mathematical foundation of generative diffusion models

This document derives the reverse-time SDE, which is the mathematical key to generating samples from noise in diffusion models.

Table of Contents¶

The Problem: Reversing Diffusion
Anderson's Theorem (1982)
Intuitive Understanding
Derivation via Fokker-Planck Equation
Why the Score Function Appears
Connection to Diffusion Models
Summary

Referenced From¶

Notebook: notebooks/diffusion/02_sde_formulation/sde_formulation.md — Section on Reverse SDE

1. The Problem: Reversing Diffusion¶

The Forward Process (Easy)¶

We know how to add noise to data. For the VP-SDE:

\[ dx = -\frac{1}{2}\beta(t)x\,dt + \sqrt{\beta(t)}\,dw \]

Starting from clean data $x_0$, we can simulate this forward in time to get noisy $x_T \approx \mathcal{N}(0, I)$.

The Reverse Process (Hard)¶

Question: Can we run this process backwards to go from noise $x_T$ back to data $x_0$?

Naive attempt: Just negate time?

\[ dx = +\frac{1}{2}\beta(t)x\,dt - \sqrt{\beta(t)}\,dw \quad \text{❌ WRONG} \]

Problem: This doesn't work! Simply negating the drift and noise doesn't give you the correct reverse process.

Why not? Because the forward process generates a distribution $p_t(x)$ that evolves over time. To reverse it, we need to account for the shape of this distribution at each time.

2. Anderson's Theorem (1982)¶

The fundamental result (Anderson, 1982):

For any forward SDE:

\[ dx = f(x,t)\,dt + g(t)\,dw \]

the reverse-time SDE (running from $t=T$ back to $t=0$) is:

\[ \boxed{dx = \left[f(x,t) - g(t)^2 \nabla_x \log p_t(x)\right] dt + g(t)\,d\bar{w}(t)} \]

where:

$\bar{w}(t)$ is a reverse-time Brownian motion
$\nabla_x \log p_t(x)$ is the score function (gradient of log probability density)

Key observation: The reverse process has an extra term $-g(t)^2 \nabla_x \log p_t(x)$ that depends on the probability distribution.

3. Intuitive Understanding¶

Why We Need a Correction Term¶

Imagine particles diffusing outward from a point source:

Forward process:

Particles spread out randomly
No "memory" of where they came from
Pure diffusion: symmetric spreading

Reverse process:

Particles need to know where to go back to
Not just random motion—need to be "pulled" toward high-density regions
The score $\nabla_x \log p_t(x)$ provides this "pull"

The Score as a Guide¶

The score function $\nabla_x \log p_t(x)$ points in the direction of increasing probability.

In forward diffusion: Ignore probability, just add noise
In reverse diffusion: Follow the probability gradient to find likely paths back

Analogy:

Forward: Drop ink in water, watch it spread (no guidance)
Reverse: Collect ink back together (need to know where the ink is concentrated)

4. Derivation via Fokker-Planck Equation¶

Step 1: Forward SDE and Its Fokker-Planck Equation¶

The forward SDE:

\[ dx = f(x,t)\,dt + g(t)\,dw \]

generates a probability distribution $p_t(x)$ that evolves according to the Fokker-Planck equation:

\[ \frac{\partial p_t}{\partial t} = -\nabla \cdot (f p_t) + \frac{1}{2}g(t)^2 \nabla^2 p_t \]

📘 Detailed Derivation: For a complete derivation of the Fokker-Planck equation from first principles, including physical intuition and examples, see fokker_planck_derivation.md.

Interpretation: This PDE describes how probability density flows forward in time.

Step 2: Reverse Time¶

To reverse the process, substitute $\tau = T - t$ (reverse time variable).

Let $\tilde{p}_\tau(x) = p_{T-\tau}(x)$ be the distribution in reverse time.

Goal: Find an SDE whose solution has marginals $\tilde{p}_\tau(x)$.

Step 3: Transform the Fokker-Planck Equation¶

When we reverse time ($\tau = T - t$), we have:

\[ \frac{\partial \tilde{p}_\tau}{\partial \tau} = -\frac{\partial p_t}{\partial t} \]

Substitute the Fokker-Planck equation:

\[ \frac{\partial \tilde{p}_\tau}{\partial \tau} = \nabla \cdot (f p_t) - \frac{1}{2}g(t)^2 \nabla^2 p_t \]

Step 4: Rewrite the Diffusion Term¶

The key trick is to express $\nabla^2 p$ using the score.

Identity:

\[ \nabla^2 p = \nabla \cdot (\nabla p) = \nabla \cdot (p \nabla \log p) \]

Derivation: Since $\nabla \log p = \frac{\nabla p}{p}$, we have $\nabla p = p \nabla \log p$, so:

\[ \nabla^2 p = \nabla \cdot (p \nabla \log p) \]

Step 5: Substitute into Reverse Fokker-Planck¶

\[ \frac{\partial \tilde{p}_\tau}{\partial \tau} = \nabla \cdot (f p) - \frac{1}{2}g^2 \nabla \cdot (p \nabla \log p) \]

\[ = \nabla \cdot \left(f p - \frac{1}{2}g^2 p \nabla \log p\right) \]

\[ = \nabla \cdot \left(\left[f - \frac{1}{2}g^2 \nabla \log p\right] p\right) \]

Pattern recognition: Let me explain how we identify the drift from this form.

Recognizing the Fokker-Planck Structure¶

Recall the general Fokker-Planck equation for an SDE $dx = \tilde{f}(x,t)\,dt + \tilde{g}(t)\,dw$:

\[ \frac{\partial p}{\partial t} = -\nabla \cdot (\tilde{f} p) + \frac{1}{2}\tilde{g}^2 \nabla^2 p \]

The equation has two parts: 1. Advection term: $-\nabla \cdot (\tilde{f} p)$ — probability flow due to drift 2. Diffusion term: $+\frac{1}{2}\tilde{g}^2 \nabla^2 p$ — probability spreading due to noise

Matching Our Equation¶

We derived (ignoring the diffusion term for now):

\[ \frac{\partial \tilde{p}_\tau}{\partial \tau} = \nabla \cdot \left(\left[f - \frac{1}{2}g^2 \nabla \log p\right] p\right) \]

Comparison with standard form:

\[ \frac{\partial p}{\partial t} = -\nabla \cdot (\tilde{f} p) + \ldots \]

Key observation: Our equation has $+\nabla \cdot (\ldots)$ while the standard form has $-\nabla \cdot (\tilde{f} p)$.

To match the standard form, we need:

\[ \nabla \cdot \left(\left[f - \frac{1}{2}g^2 \nabla \log p\right] p\right) = -\nabla \cdot (\tilde{f} p) \]

This means:

\[ \left[f - \frac{1}{2}g^2 \nabla \log p\right] p = -\tilde{f} p \]

Dividing by $p$ (assuming $p > 0$):

\[ f - \frac{1}{2}g^2 \nabla \log p = -\tilde{f} \]

Therefore:

\[ \boxed{\tilde{f}(x) = -f(x,t) + \frac{1}{2}g(t)^2 \nabla_x \log p_t(x)} \]

Wait, this has a negative sign on $f$! This would be the reverse drift if we're going backward in time. But we want the form for the SDE...

Actually, let me reconsider. The issue is subtle and relates to the sign conventions in reverse time. Let me continue to Step 6 where we'll sort this out properly.

Step 6: The Diffusion Term in Reverse Time¶

Actually, we need to be more careful. The correct reverse-time Fokker-Planck equation is:

\[ \frac{\partial \tilde{p}_\tau}{\partial \tau} = -\nabla \cdot (\tilde{f} \tilde{p}) + \frac{1}{2}g^2 \nabla^2 \tilde{p} \]

where $\tilde{f}$ is the reverse drift.

Matching coefficients with our transformed equation:

\[ -\nabla \cdot (\tilde{f} \tilde{p}) + \frac{1}{2}g^2 \nabla^2 \tilde{p} = \nabla \cdot (f p) - \frac{1}{2}g^2 \nabla^2 p \]

The diffusion terms have opposite signs! To fix this, we need to include both the drift correction AND account for the sign change.

Result: The reverse SDE is:

\[ dx = \left[f(x,t) - g(t)^2 \nabla_x \log p_t(x)\right] dt + g(t)\,d\bar{w}(t) \]

where the $+g(t)\,d\bar{w}(t)$ term provides the diffusion in reverse time (note the positive sign, same as forward).

Summary of the Derivation Logic¶

Let me clarify the full picture:

Forward Fokker-Planck: $$\frac{\partial p_t}{\partial t} = -\nabla \cdot (f p_t) + \frac{1}{2}g^2 \nabla^2 p_t$$
Reverse time ($\tau = T - t$): $$\frac{\partial p_\tau}{\partial \tau} = +\nabla \cdot (f p) - \frac{1}{2}g^2 \nabla^2 p$$ (Sign flip on time derivative flips both terms)
Rewrite diffusion using score: $\nabla^2 p = \nabla \cdot (p \nabla \log p)$: $$\frac{\partial p_\tau}{\partial \tau} = \nabla \cdot \left(\left[f - \frac{1}{2}g^2 \nabla \log p\right] p\right)$$
This is almost a Fokker-Planck equation, but with a sign issue. The resolution is that when we write the SDE that generates this, we need to account for:
The advection term gives us the effective drift
The diffusion term (which we somewhat glossed over) contributes the $g(t)\,d\bar{w}$ term
The full SDE that produces the correct marginals in reverse time is: $$dx = [f - g^2 \nabla \log p]\,dt + g\,d\bar{w}$$

The key insight: The score term $-g^2 \nabla \log p$ corrects for the fact that probability is concentrated in certain regions, and we need to guide the reverse process toward those regions.

A Concrete Example to Build Intuition¶

Consider a simple 1D case where probability has flowed outward:

Forward process: Starting from $x_0 = 0$, particles diffuse outward. At time $t$, we have $p_t(x) \approx \mathcal{N}(0, t)$.

Score at time $t$: $$\nabla_x \log p_t(x) = -\frac{x}{t}$$

Interpretation: The score points toward $x=0$ (the center of the distribution), with magnitude proportional to distance from center.

Reverse process: To bring particles back, we need drift: $$\text{drift} = f(x,t) - g^2 \nabla \log p = f(x,t) + g^2 \frac{x}{t}$$

The term $+g^2 \frac{x}{t}$ pulls particles toward the origin, counteracting the outward diffusion. Without this term, simply reversing would not account for where probability is concentrated.

📘 Detailed Example: For a complete worked example with step-by-step calculations, numerical verification, and intuitive explanations, see reverse_process_example.md.

5. Why the Score Function Appears¶

The Score as Probability Flow¶

The score function:

\[ \nabla_x \log p_t(x) = \frac{\nabla_x p_t(x)}{p_t(x)} \]

is the normalized gradient of the probability density.

Physical interpretation:

Points from low probability to high probability
Magnitude is stronger in regions with steeper probability gradients
Tells particles "which way to go" to increase likelihood

The $g(t)^2$ Weighting¶

Why is the score multiplied by $g(t)^2$?

Answer: It's the diffusion coefficient squared.

Intuition:

Stronger diffusion ($g$ large) → more noise added → need stronger correction to reverse
Weaker diffusion ($g$ small) → less noise added → need smaller correction

Mathematical origin: From the Fokker-Planck equation, the diffusion term has coefficient $\frac{1}{2}g^2$, which when transformed gives $g^2$ in the drift correction.

Why Not Just $f(x,t)$ in Reverse?¶

If we only used $-f(x,t)$ (negating the forward drift), we'd be ignoring the shape of the distribution.

Example: Consider particles that have diffused to form a Gaussian blob. - Simply reversing $f$ would make them all move backward the same way - But they need to be "pulled" toward the center of the blob (high density) - The score term provides this pull

6. Connection to Diffusion Models¶

What We Know and Don't Know¶

In diffusion models:

Known (designed): - Forward SDE: $dx = f(x,t)\,dt + g(t)\,dw$ - Drift $f(x,t)$ and diffusion $g(t)$ are chosen

Unknown (needs learning): - Score function: $\nabla_x \log p_t(x)$

The Learning Problem¶

Since we don't know $p_t(x)$, we don't know its score $\nabla_x \log p_t(x)$.

Solution: Train a neural network $s_\theta(x,t)$ to approximate the score:

\[ s_\theta(x,t) \approx \nabla_x \log p_t(x) \]

The Reverse SDE for Generation¶

Once we have the learned score, we can sample by solving:

\[ dx = \left[f(x,t) - g(t)^2 s_\theta(x,t)\right] dt + g(t)\,d\bar{w}(t) \]

Starting from $x_T \sim \mathcal{N}(0, I)$, integrate backwards from $t=T$ to $t=0$ to generate $x_0$.

Example: VP-SDE¶

For the VP-SDE with $f(x,t) = -\frac{1}{2}\beta(t)x$ and $g(t) = \sqrt{\beta(t)}$:

Reverse SDE:

\[ dx = \left[-\frac{1}{2}\beta(t)x - \beta(t) s_\theta(x,t)\right] dt + \sqrt{\beta(t)}\,d\bar{w}(t) \]

Discretized (Euler-Maruyama):

x = torch.randn(batch_size, dim)  # Start from noise
dt = -T / num_steps

for i in range(num_steps):
    t = T - i * dt
    beta_t = beta(t)

    # Predict score
    score = score_network(x, t)

    # Drift term
    drift = -0.5 * beta_t * x - beta_t * score

    # Diffusion term
    noise = torch.randn_like(x)
    diffusion = np.sqrt(beta_t * abs(dt)) * noise

    # Update
    x = x + drift * dt + diffusion

return x  # This is x_0 (generated sample)

7. Summary¶

The Key Result¶

Process	SDE
Forward	$dx = f(x,t)\,dt + g(t)\,dw$
Reverse	$dx = [f(x,t) - g(t)^2 \nabla_x \log p_t(x)]\,dt + g(t)\,d\bar{w}(t)$

Why This Works¶

Forward SDE defines how probability evolves: $p_0 \to p_T$
Fokker-Planck equation describes this evolution as a PDE
Reverse time requires transforming this PDE
Score appears naturally from rewriting the Laplacian: $\nabla^2 p = \nabla \cdot (p \nabla \log p)$
Result: Reverse SDE with score correction

Three Levels of Understanding¶

Level 1 (Practical): To reverse diffusion, add a score correction term $-g^2 \nabla \log p$ to the drift.

Level 2 (Intuitive): The score tells particles where probability is concentrated, guiding them back to likely regions.

Level 3 (Mathematical): Reversing the Fokker-Planck equation requires expressing the Laplacian in terms of the score, yielding the correction term.

Appendix: Rigorous Statement of Anderson's Theorem¶

Theorem (Anderson, 1982):

Let $x(t)$ be the solution to the forward SDE:

\[ dx = f(x,t)\,dt + G(t)\,dw \]

where $G(t)$ is a $d \times d$ matrix (diffusion matrix).

Then the reverse-time process $x(T-t)$ satisfies the SDE:

\[ dx = \left[f(x,t) - \left(GG^T\right) \nabla_x \log p_t(x)\right] dt + G\,d\bar{w}(t) \]

where $p_t(x)$ is the marginal density of $x(t)$ under the forward process.

Special case (scalar diffusion $g(t)$):

\[ G(t) = g(t) I \quad \Rightarrow \quad GG^T = g(t)^2 I \]

This gives the form we use:

\[ dx = \left[f(x,t) - g(t)^2 \nabla_x \log p_t(x)\right] dt + g(t)\,d\bar{w}(t) \]

References¶

Anderson (1982): "Reverse-time diffusion equation models" — Original theorem
Song et al. (2021): Score-Based Generative Modeling through SDEs — Application to generative models
Haussmann & Pardoux (1986): "Time reversal of diffusions" — Mathematical treatment
Föllmer (1986): "Random fields and diffusion processes" — Connections to optimal transport
Fokker-Planck Derivation: notebooks/diffusion/02_sde_formulation/supplements/07_fokker_planck_equation.md

Fokker-Planck Equation: fokker_planck_derivation.md — Derivation and intuition for the probability evolution equation
Detailed Worked Example: reverse_process_example.md — Complete 1D Gaussian example with numerical verification

Deriving the Reverse-Time SDE: From Noise Back to Data¶

Table of Contents¶

Referenced From¶

1. The Problem: Reversing Diffusion¶

The Forward Process (Easy)¶

The Reverse Process (Hard)¶

2. Anderson's Theorem (1982)¶

3. Intuitive Understanding¶

Why We Need a Correction Term¶

The Score as a Guide¶

4. Derivation via Fokker-Planck Equation¶

Step 1: Forward SDE and Its Fokker-Planck Equation¶

Step 2: Reverse Time¶

Step 3: Transform the Fokker-Planck Equation¶

Step 4: Rewrite the Diffusion Term¶

Step 5: Substitute into Reverse Fokker-Planck¶

Recognizing the Fokker-Planck Structure¶

Matching Our Equation¶

Step 6: The Diffusion Term in Reverse Time¶

Summary of the Derivation Logic¶

A Concrete Example to Build Intuition¶

5. Why the Score Function Appears¶

The Score as Probability Flow¶

The \(g(t)^2\) Weighting¶

Why Not Just \(f(x,t)\) in Reverse?¶

6. Connection to Diffusion Models¶

What We Know and Don't Know¶

The Learning Problem¶

The Reverse SDE for Generation¶

Example: VP-SDE¶

7. Summary¶

The Key Result¶

Why This Works¶

Three Levels of Understanding¶

Appendix: Rigorous Statement of Anderson's Theorem¶

References¶

Process	SDE
Forward	\(dx = f(x,t)\,dt + g(t)\,dw\)
Reverse	\(dx = [f(x,t) - g(t)^2 \nabla_x \log p_t(x)]\,dt + g(t)\,d\bar{w}(t)\)

Deriving the Reverse-Time SDE: From Noise Back to Data¶

Table of Contents¶

Referenced From¶

1. The Problem: Reversing Diffusion¶

The Forward Process (Easy)¶

The Reverse Process (Hard)¶

2. Anderson's Theorem (1982)¶

3. Intuitive Understanding¶

Why We Need a Correction Term¶

The Score as a Guide¶

4. Derivation via Fokker-Planck Equation¶

Step 1: Forward SDE and Its Fokker-Planck Equation¶

Step 2: Reverse Time¶

Step 3: Transform the Fokker-Planck Equation¶

Step 4: Rewrite the Diffusion Term¶

Step 5: Substitute into Reverse Fokker-Planck¶

Recognizing the Fokker-Planck Structure¶

Matching Our Equation¶

Step 6: The Diffusion Term in Reverse Time¶

Summary of the Derivation Logic¶

A Concrete Example to Build Intuition¶

5. Why the Score Function Appears¶

The Score as Probability Flow¶

The \(g(t)^2\) Weighting¶

Why Not Just \(f(x,t)\) in Reverse?¶

6. Connection to Diffusion Models¶

What We Know and Don't Know¶

The Learning Problem¶

The Reverse SDE for Generation¶

Example: VP-SDE¶

7. Summary¶

The Key Result¶

Why This Works¶

Three Levels of Understanding¶

Appendix: Rigorous Statement of Anderson's Theorem¶

References¶

Related Documents¶