SDE View: Design Principles and Intuitions¶
This document addresses deeper questions about the SDE formulation of diffusion models: Why are specific drifts chosen? How do we think about the score-noise relationship in high dimensions? What are the design principles?
1. Choosing the Drift: Design Principles¶
The Question¶
When we write the forward SDE:
How do we choose the drift \(f(x, t)\)? For example, the VP-SDE uses:
Should we think of this as an "external force" moving probability mass?
The Answer¶
Yes — it is very helpful to think of the drift as an external force field acting on probability mass.
The main guideline is:
Choose a drift + noise schedule such that the forward process is simple, stable, and ends in a known distribution.
2. What the Drift Actually Does¶
Geometric View¶
In the Fokker–Planck equation, which governs how densities evolve under an SDE:
You can literally read off the roles:
- \(f(x, t)\): transports probability mass (advection term)
- \(g(t)\): diffuses (spreads) probability mass (diffusion term)
Key insight: Drift is an external force field moving probability mass. Noise spreads mass; drift organizes where it flows.
3. Why the VP Drift Is Linear¶
The VP-SDE uses:
This choice is not arbitrary. It satisfies three extremely strong constraints simultaneously:
(a) Globally Stabilizing¶
The drift points toward the origin everywhere:
- No explosion
- No runaway trajectories
- Well-defined reverse dynamics
This is crucial in very high dimensions (e.g., \(d = 3072\) for a 32×32 RGB image).
(b) Preserves Gaussian Structure¶
If the forward process starts Gaussian, it stays Gaussian.
With this drift:
This closed-form marginal is gold — it makes training trivial.
© Lets Noise Dominate Gradually¶
The linear drift slowly shrinks signal while noise grows:
- Early times: signal-dominated
- Late times: noise-dominated
This creates a smooth continuum of difficulty levels for learning.
4. What Other Drifts Would Mean¶
You could choose:
- Nonlinear drift: \(f(x, t) = -\nabla U(x)\) (energy-based)
- Data-dependent drift: \(f(x, t) = h(x)\)
- Anisotropic drift: Different behavior in different directions
But then:
- Marginals are no longer analytic
- Score targets become implicit
- Training becomes unstable or intractable
Guiding principle:
Choose the simplest drift that gives you a tractable forward process and a learnable reverse process.
VP-SDE hits that sweet spot.
5. Mental Model: Wind + Fog System¶
Think of the forward diffusion as:
A carefully engineered wind + fog system that:
- Blows all probability mass toward a simple attractor (the origin)
- Blurs everything along the way (Gaussian noise)
- Leaves behind a smooth gradient field that can be learned (the score)
6. The Score-Noise Relationship in High Dimensions¶
The Question¶
When input \(x\) is an image (flattened to \(\mathbb{R}^d\)), we have:
Should I think of this score-to-noise relation as being prescribed to each dimension of \(x\) independently? Do we have \(d\) different score-to-noise relationships of the same form?
The Answer¶
Yes mathematically, but no conceptually.
7. The Score Is a Gradient in \(\mathbb{R}^d\)¶
When an image is flattened:
The score is:
Formally:
- Yes, there are \(d\) partial derivatives
- One per coordinate
For Gaussian corruption:
This relation does hold coordinate-wise.
8. Why This Does NOT Mean "Independent Dimensions"¶
Key conceptual correction:
Even though the formula is coordinate-wise, and:
the marginal distribution \(p_t(x)\) does not factorize.
That means:
- Pixels are not independent
- Genes are not independent
- Features are not independent
The dependencies live in:
The score reflects global structure.
9. Why the Network Output Must Be Global¶
Although the score is a \(d\)-vector, each component:
depends on all coordinates of \(x\).
That's why:
- CNNs use receptive fields
- Transformers use attention
- U-Nets use multi-scale context
The model is not learning "\(d\) independent scores". It is learning one high-dimensional vector field.
10. A Helpful Analogy¶
Think of a potential energy surface \(U(x)\) in physics:
- Force = \(-\nabla U(x)\)
- Each force component is a partial derivative
- But the potential is global
Same here:
Local components, global meaning.
11. Summary: The Conceptual Core¶
On Drift¶
- The drift is an externally designed force field that shapes how probability mass flows during corruption
- VP-SDE uses a linear, stabilizing drift because it preserves Gaussian structure and yields tractable training
- Think of it as a "wind system" guiding probability mass toward a simple attractor
On Score-Noise Relationship¶
- The score-noise relation is coordinate-wise, but the score itself is globally coupled
- High-dimensional diffusion does not mean independent dimensions
- It means a vector field whose components depend on the entire object
12. Next Deep Steps¶
To go deeper:
- Contrast VP-SDE with VE-SDE in this same force-field language
- Discuss nonlinear drifts for structured data like graphs or gene networks
- Derive the Fokker-Planck equation to see how drift and diffusion shape \(p_t(x)\)