Skip to content

SDE View: Design Principles and Intuitions

This document addresses deeper questions about the SDE formulation of diffusion models: Why are specific drifts chosen? How do we think about the score-noise relationship in high dimensions? What are the design principles?


1. Choosing the Drift: Design Principles

The Question

When we write the forward SDE:

\[ dx(t) = f(x(t), t)\,dt + g(t)\,dw(t) \]

How do we choose the drift \(f(x, t)\)? For example, the VP-SDE uses:

\[ f(x, t) = -\frac{1}{2}\beta(t) x \]

Should we think of this as an "external force" moving probability mass?

The Answer

Yes — it is very helpful to think of the drift as an external force field acting on probability mass.

The main guideline is:

Choose a drift + noise schedule such that the forward process is simple, stable, and ends in a known distribution.


2. What the Drift Actually Does

Geometric View

In the Fokker–Planck equation, which governs how densities evolve under an SDE:

\[ \frac{\partial p_t(x)}{\partial t} = -\nabla \cdot \left(f(x, t) p_t(x)\right) + \frac{1}{2}\nabla^2 \left(g(t)^2 p_t(x)\right) \]

You can literally read off the roles:

  • \(f(x, t)\): transports probability mass (advection term)
  • \(g(t)\): diffuses (spreads) probability mass (diffusion term)

Key insight: Drift is an external force field moving probability mass. Noise spreads mass; drift organizes where it flows.


3. Why the VP Drift Is Linear

The VP-SDE uses:

\[ f(x, t) = -\frac{1}{2}\beta(t) x \]

This choice is not arbitrary. It satisfies three extremely strong constraints simultaneously:

(a) Globally Stabilizing

The drift points toward the origin everywhere:

  • No explosion
  • No runaway trajectories
  • Well-defined reverse dynamics

This is crucial in very high dimensions (e.g., \(d = 3072\) for a 32×32 RGB image).

(b) Preserves Gaussian Structure

If the forward process starts Gaussian, it stays Gaussian.

With this drift:

\[ x(t) \mid x_0 \sim \mathcal{N}\left(\sqrt{\bar{\alpha}_t} x_0, (1 - \bar{\alpha}_t) I\right) \]

This closed-form marginal is gold — it makes training trivial.

© Lets Noise Dominate Gradually

The linear drift slowly shrinks signal while noise grows:

  • Early times: signal-dominated
  • Late times: noise-dominated

This creates a smooth continuum of difficulty levels for learning.


4. What Other Drifts Would Mean

You could choose:

  • Nonlinear drift: \(f(x, t) = -\nabla U(x)\) (energy-based)
  • Data-dependent drift: \(f(x, t) = h(x)\)
  • Anisotropic drift: Different behavior in different directions

But then:

  • Marginals are no longer analytic
  • Score targets become implicit
  • Training becomes unstable or intractable

Guiding principle:

Choose the simplest drift that gives you a tractable forward process and a learnable reverse process.

VP-SDE hits that sweet spot.


5. Mental Model: Wind + Fog System

Think of the forward diffusion as:

A carefully engineered wind + fog system that:

  1. Blows all probability mass toward a simple attractor (the origin)
  2. Blurs everything along the way (Gaussian noise)
  3. Leaves behind a smooth gradient field that can be learned (the score)

6. The Score-Noise Relationship in High Dimensions

The Question

When input \(x\) is an image (flattened to \(\mathbb{R}^d\)), we have:

\[ \nabla_x \log p(x_t \mid x_0) = -\frac{1}{\sigma_t} \varepsilon \]

Should I think of this score-to-noise relation as being prescribed to each dimension of \(x\) independently? Do we have \(d\) different score-to-noise relationships of the same form?

The Answer

Yes mathematically, but no conceptually.


7. The Score Is a Gradient in \(\mathbb{R}^d\)

When an image is flattened:

\[ x \in \mathbb{R}^d \]

The score is:

\[ \nabla_x \log p_t(x) = \left(\frac{\partial}{\partial x_1}, \ldots, \frac{\partial}{\partial x_d}\right) \log p_t(x) \]

Formally:

  • Yes, there are \(d\) partial derivatives
  • One per coordinate

For Gaussian corruption:

\[ \nabla_x \log p(x_t \mid x_0) = -\frac{1}{\sigma_t^2}(x_t - \mu_t) = -\frac{1}{\sigma_t} \varepsilon \]

This relation does hold coordinate-wise.


8. Why This Does NOT Mean "Independent Dimensions"

Key conceptual correction:

Even though the formula is coordinate-wise, and:

\[ \varepsilon \sim \mathcal{N}(0, I) \]

the marginal distribution \(p_t(x)\) does not factorize.

That means:

  • Pixels are not independent
  • Genes are not independent
  • Features are not independent

The dependencies live in:

\[ p_t(x) = \int p(x_t \mid x_0) p_{\text{data}}(x_0)\,dx_0 \]

The score reflects global structure.


9. Why the Network Output Must Be Global

Although the score is a \(d\)-vector, each component:

\[ \frac{\partial}{\partial x_i} \log p_t(x) \]

depends on all coordinates of \(x\).

That's why:

  • CNNs use receptive fields
  • Transformers use attention
  • U-Nets use multi-scale context

The model is not learning "\(d\) independent scores". It is learning one high-dimensional vector field.


10. A Helpful Analogy

Think of a potential energy surface \(U(x)\) in physics:

  • Force = \(-\nabla U(x)\)
  • Each force component is a partial derivative
  • But the potential is global

Same here:

\[ \text{score}(x) = \nabla \log p(x) \]

Local components, global meaning.


11. Summary: The Conceptual Core

On Drift

  • The drift is an externally designed force field that shapes how probability mass flows during corruption
  • VP-SDE uses a linear, stabilizing drift because it preserves Gaussian structure and yields tractable training
  • Think of it as a "wind system" guiding probability mass toward a simple attractor

On Score-Noise Relationship

  • The score-noise relation is coordinate-wise, but the score itself is globally coupled
  • High-dimensional diffusion does not mean independent dimensions
  • It means a vector field whose components depend on the entire object

12. Next Deep Steps

To go deeper:

  1. Contrast VP-SDE with VE-SDE in this same force-field language
  2. Discuss nonlinear drifts for structured data like graphs or gene networks
  3. Derive the Fokker-Planck equation to see how drift and diffusion shape \(p_t(x)\)