Equivalence of Score, Noise, and Clean Data Parameterizations¶
The Three Parameterizations¶
A neural network in a diffusion model can predict any of these:
- Score: \(s_\theta(x_t, t) \approx \nabla_x \log p_t(x_t)\)
- Noise: \(\varepsilon_\theta(x_t, t) \approx \varepsilon\)
- Clean data: \(\hat{x}_0(x_t, t) \approx x_0\)
Claim: These are mathematically equivalent—you can convert between them.
The Forward Process (Starting Point)¶
The forward process corrupts clean data \(x_0\) into noisy data \(x_t\):
where:
- \(\bar{\alpha}_t = \exp\left(-\int_0^t \beta(s)\,ds\right)\) (cumulative signal retention)
- \(\sqrt{\bar{\alpha}_t}\) scales the signal
- \(\sqrt{1-\bar{\alpha}_t}\) scales the noise
This is the closed-form marginal of the VP-SDE.
Deriving the Conditional Score¶
The conditional distribution is:
For a Gaussian \(\mathcal{N}(\mu, \Sigma)\), the score is:
Applying this:
The Key Relationship¶
From the forward process:
$$
x_t - \sqrt{\bar{\alpha}_t}\, x_0 = \sqrt{1-\bar{\alpha}_t}\, \varepsilon $$
Substitute into the score:
This is the fundamental relationship: Score = scaled noise.
Conversion Formulas¶
Let \(\sigma_t = \sqrt{1-\bar{\alpha}_t}\) (noise standard deviation) and \(\alpha_t = \sqrt{\bar{\alpha}_t}\) (signal scale).
Score ↔ Noise¶
Noise ↔ Clean Data¶
From \(x_t = \alpha_t x_0 + \sigma_t \varepsilon\), solve for \(x_0\):
Score ↔ Clean Data¶
Combine the above:
Summary Table¶
| If you have... | To get Score | To get Noise | To get Clean Data |
|---|---|---|---|
| Score \(s\) | — | \(\varepsilon = -\sigma_t s\) | \(\hat{x}_0 = \frac{x_t + \sigma_t^2 s}{\alpha_t}\) |
| Noise \(\varepsilon\) | \(s = -\varepsilon/\sigma_t\) | — | \(\hat{x}_0 = \frac{x_t - \sigma_t \varepsilon}{\alpha_t}\) |
| Clean Data \(\hat{x}_0\) | \(s = \frac{\alpha_t \hat{x}_0 - x_t}{\sigma_t^2}\) | \(\varepsilon = \frac{x_t - \alpha_t \hat{x}_0}{\sigma_t}\) | — |
Where: \(\alpha_t = \sqrt{\bar{\alpha}_t}\), \(\sigma_t = \sqrt{1-\bar{\alpha}_t}\)
Why Different Frameworks Use Different Parameterizations¶
DDPM (Ho et al. 2020): Predicts Noise \(\varepsilon\)¶
Reason: Empirically more stable training. The noise \(\varepsilon \sim \mathcal{N}(0, I)\) has a consistent scale across all timesteps, whereas the score magnitude varies with \(t\).
Loss:
Score-Based Models (Song et al. 2019): Predicts Score \(s\)¶
Reason: Directly motivated by score matching theory. The score has a clear interpretation as the gradient of log-density.
Loss (denoising score matching):
$$
\mathcal{L}{\text{DSM}} = \mathbb{E}\left[|s_\theta(x_t, t) - \nabla_x \log p_t(x_t \mid x_0)|^2\right] $$
v-prediction (Salimans & Ho 2022): Predicts a Combination¶
Why? For some noise schedules, predicting a linear combination of noise and data works better:
$$
v_t = \alpha_t \varepsilon - \sigma_t x_0 $$
This balances the learning signal across timesteps.
Intuitive Understanding¶
All three quantities answer the same question from different angles:
| Parameterization | Question Answered |
|---|---|
| Score \(\nabla_x \log p_t(x)\) | "Which direction increases probability?" |
| Noise \(\varepsilon\) | "What random noise was added to the clean data?" |
| Clean data \(x_0\) | "What was the original data before corruption?" |
Given the forward process, knowing any one of these determines the other two.
Practical Example¶
Suppose at timestep \(t\): - \(\bar{\alpha}_t = 0.5\) (so \(\alpha_t = \sqrt{0.5} \approx 0.707\), \(\sigma_t = \sqrt{0.5} \approx 0.707\)) - Current noisy state: \(x_t = [1.0, 2.0]\) - Network predicts noise: \(\varepsilon_\theta = [0.5, 1.0]\)
Then: - Score: \(s = -\varepsilon/\sigma_t = -[0.5, 1.0]/0.707 \approx [-0.707, -1.414]\) - Clean data: \(\hat{x}_0 = (x_t - \sigma_t \varepsilon)/\alpha_t = ([1.0, 2.0] - 0.707 \cdot [0.5, 1.0])/0.707 \approx [0.914, 1.828]\)
All three representations contain the same information, just expressed differently.
Key Takeaway¶
The forward process equation:
is the Rosetta Stone that connects all three parameterizations. Given any two of \((x_t, x_0, \varepsilon)\) and the noise schedule \((\alpha_t, \sigma_t)\), you can compute the third—and from there, derive the score.
References¶
- Ho et al. (2020): DDPM — Uses noise prediction
- Song & Ermon (2019): Score-based models — Uses score prediction
- Salimans & Ho (2022): Progressive Distillation — Introduces v-prediction
- Karras et al. (2022): EDM — Analyzes different parameterizations systematically