SDE View: Design Principles and Intuitions¶

This document addresses deeper questions about the SDE formulation of diffusion models: Why are specific drifts chosen? How do we think about the score-noise relationship in high dimensions? What are the design principles?

1. Choosing the Drift: Design Principles¶

The Question¶

When we write the forward SDE:

\[ dx(t) = f(x(t), t)\,dt + g(t)\,dw(t) \]

How do we choose the drift \(f(x, t)\)? For example, the VP-SDE uses:

\[ f(x, t) = -\frac{1}{2}\beta(t) x \]

Should we think of this as an "external force" moving probability mass?

The Answer¶

Yes — it is very helpful to think of the drift as an external force field acting on probability mass.

The main guideline is:

Choose a drift + noise schedule such that the forward process is simple, stable, and ends in a known distribution.

2. What the Drift Actually Does¶

Geometric View¶

In the Fokker–Planck equation, which governs how densities evolve under an SDE:

\[ \frac{\partial p_t(x)}{\partial t} = -\nabla \cdot \left(f(x, t) p_t(x)\right) + \frac{1}{2}\nabla^2 \left(g(t)^2 p_t(x)\right) \]

You can literally read off the roles:

\(f(x, t)\): transports probability mass (advection term)
\(g(t)\): diffuses (spreads) probability mass (diffusion term)

Key insight: Drift is an external force field moving probability mass. Noise spreads mass; drift organizes where it flows.

3. Why the VP Drift Is Linear¶

The VP-SDE uses:

\[ f(x, t) = -\frac{1}{2}\beta(t) x \]

This choice is not arbitrary. It satisfies three extremely strong constraints simultaneously:

(a) Globally Stabilizing¶

The drift points toward the origin everywhere:

No explosion
No runaway trajectories
Well-defined reverse dynamics

This is crucial in very high dimensions (e.g., \(d = 3072\) for a 32×32 RGB image).

(b) Preserves Gaussian Structure¶

If the forward process starts Gaussian, it stays Gaussian.

With this drift:

\[ x(t) \mid x_0 \sim \mathcal{N}\left(\sqrt{\bar{\alpha}_t} x_0, (1 - \bar{\alpha}_t) I\right) \]

This closed-form marginal is gold — it makes training trivial.

© Lets Noise Dominate Gradually¶

The linear drift slowly shrinks signal while noise grows:

Early times: signal-dominated
Late times: noise-dominated

This creates a smooth continuum of difficulty levels for learning.

4. What Other Drifts Would Mean¶

You could choose:

Nonlinear drift: \(f(x, t) = -\nabla U(x)\) (energy-based)
Data-dependent drift: \(f(x, t) = h(x)\)
Anisotropic drift: Different behavior in different directions

But then:

Marginals are no longer analytic
Score targets become implicit
Training becomes unstable or intractable

Guiding principle:

Choose the simplest drift that gives you a tractable forward process and a learnable reverse process.

VP-SDE hits that sweet spot.

5. Mental Model: Wind + Fog System¶

Think of the forward diffusion as:

A carefully engineered wind + fog system that:

Blows all probability mass toward a simple attractor (the origin)

Blurs everything along the way (Gaussian noise)

Leaves behind a smooth gradient field that can be learned (the score)

6. The Score-Noise Relationship in High Dimensions¶

The Question¶

When input \(x\) is an image (flattened to \(\mathbb{R}^d\)), we have:

\[ \nabla_x \log p(x_t \mid x_0) = -\frac{1}{\sigma_t} \varepsilon \]

Should I think of this score-to-noise relation as being prescribed to each dimension of \(x\) independently? Do we have \(d\) different score-to-noise relationships of the same form?

The Answer¶

Yes mathematically, but no conceptually.

7. The Score Is a Gradient in \(\mathbb{R}^d\)¶

When an image is flattened:

\[ x \in \mathbb{R}^d \]

The score is:

\[ \nabla_x \log p_t(x) = \left(\frac{\partial}{\partial x_1}, \ldots, \frac{\partial}{\partial x_d}\right) \log p_t(x) \]

Formally:

Yes, there are \(d\) partial derivatives
One per coordinate

For Gaussian corruption:

\[ \nabla_x \log p(x_t \mid x_0) = -\frac{1}{\sigma_t^2}(x_t - \mu_t) = -\frac{1}{\sigma_t} \varepsilon \]

This relation does hold coordinate-wise.

8. Why This Does NOT Mean "Independent Dimensions"¶

Key conceptual correction:

Even though the formula is coordinate-wise, and:

\[ \varepsilon \sim \mathcal{N}(0, I) \]

the marginal distribution \(p_t(x)\) does not factorize.

That means:

Pixels are not independent
Genes are not independent
Features are not independent

The dependencies live in:

\[ p_t(x) = \int p(x_t \mid x_0) p_{\text{data}}(x_0)\,dx_0 \]

The score reflects global structure.

9. Why the Network Output Must Be Global¶

Although the score is a \(d\)-vector, each component:

\[ \frac{\partial}{\partial x_i} \log p_t(x) \]

depends on all coordinates of \(x\).

That's why:

CNNs use receptive fields
Transformers use attention
U-Nets use multi-scale context

The model is not learning "\(d\) independent scores". It is learning one high-dimensional vector field.

10. A Helpful Analogy¶

Think of a potential energy surface \(U(x)\) in physics:

Force = \(-\nabla U(x)\)
Each force component is a partial derivative
But the potential is global

Same here:

\[ \text{score}(x) = \nabla \log p(x) \]

Local components, global meaning.

11. Summary: The Conceptual Core¶

On Drift¶

The drift is an externally designed force field that shapes how probability mass flows during corruption
VP-SDE uses a linear, stabilizing drift because it preserves Gaussian structure and yields tractable training
Think of it as a "wind system" guiding probability mass toward a simple attractor

On Score-Noise Relationship¶

The score-noise relation is coordinate-wise, but the score itself is globally coupled
High-dimensional diffusion does not mean independent dimensions
It means a vector field whose components depend on the entire object

12. Next Deep Steps¶

To go deeper:

Contrast VP-SDE with VE-SDE in this same force-field language
Discuss nonlinear drifts for structured data like graphs or gene networks
Derive the Fokker-Planck equation to see how drift and diffusion shape \(p_t(x)\)

SDE View: Design Principles and Intuitions¶

1. Choosing the Drift: Design Principles¶

The Question¶

The Answer¶

2. What the Drift Actually Does¶

Geometric View¶

3. Why the VP Drift Is Linear¶

(a) Globally Stabilizing¶

(b) Preserves Gaussian Structure¶

© Lets Noise Dominate Gradually¶

4. What Other Drifts Would Mean¶

5. Mental Model: Wind + Fog System¶

6. The Score-Noise Relationship in High Dimensions¶

The Question¶

The Answer¶

7. The Score Is a Gradient in \(\mathbb{R}^d\)¶

8. Why This Does NOT Mean "Independent Dimensions"¶

9. Why the Network Output Must Be Global¶

10. A Helpful Analogy¶

11. Summary: The Conceptual Core¶

On Drift¶

On Score-Noise Relationship¶

12. Next Deep Steps¶

Related Documents¶