Score Matching: The Core Objective¶
This document explains the score matching objective—a technique for training energy-based models without computing the intractable partition function. Score matching is foundational to modern generative models including diffusion models.
1. What Problem Score Matching Solves¶
We want to learn a probability density over data \(p_D(x)\), but we only have samples \(x \sim p_D\).
This is the classic density estimation problem.
The difficulty: Many flexible models define densities only up to a normalizing constant, which makes maximum likelihood hard or impossible.
Score matching offers a workaround: Instead of matching the density itself, we match its score (the gradient of the log-density).
2. The Modeling Setup: Energy-Based Models (EBMs)¶
2.1 Data space and variables¶
- \(x \in \mathbb{R}^d\) — A data vector (e.g., an image flattened into pixels, a feature vector, etc.)
- \(p_D(x)\) — The true but unknown data-generating distribution
2.2 Model density via an energy function¶
We model the data using an energy-based model:
Where:
- \(E_\theta(x) : \mathbb{R}^d \to \mathbb{R}\) — A scalar-valued energy function, typically a neural network
- \(\theta\) — Model parameters
- \(Z_\theta = \int \exp(-E_\theta(x)) \, dx\) — The partition function (normalizing constant)
Key issue: \(Z_\theta\) depends on \(\theta\) and is usually intractable.
3. The Score Function: The Central Object¶
3.1 Definition¶
The score function of a density is:
This is a vector in \(\mathbb{R}^d\).
3.2 Why the score is special¶
Let's expand it:
Taking gradient w.r.t. \(x\):
Important observation:
- The normalizing constant \(Z_\theta\) disappears
- The score depends only on the energy gradient
This is the loophole score matching exploits.
4. What Does It Mean to "Match Scores"?¶
If two densities have the same score everywhere (under mild regularity conditions), then they are the same density up to a constant, which is exactly what we want.
So instead of minimizing:
we try to make:
5. The Explicit Score Matching Objective¶
5.1 The ideal (but infeasible) objective¶
We start with the explicit score matching (ESM) loss:
Let's unpack every symbol.
5.2 Notation breakdown¶
- \(\mathbb{E}_{x \sim p_D(x)}[\cdot]\) — Expectation over true data samples
- \(s_\theta(x)\) — Model score \(= \nabla_x \log p_\theta(x)\)
- \(\nabla_x \log p_D(x)\) — True data score (unknown!)
- \(|\cdot|\) — Euclidean norm
- Factor \(\frac{1}{2}\) — For mathematical convenience
5.3 Why this objective is impossible to compute¶
We do not know \(p_D(x)\), so:
- We cannot compute \(\log p_D(x)\)
- We cannot compute \(\nabla_x \log p_D(x)\)
So this loss is conceptually useful but computationally useless.
6. The Key Mathematical Trick: Integration by Parts¶
Score matching transforms the explicit objective into one that does not involve \(p_D\).
To do this, we introduce differential operators.
7. Differential Operators and Notation¶
7.1 Gradient operator¶
For a scalar function \(f : \mathbb{R}^d \to \mathbb{R}\):
7.2 Jacobian operator¶
For a vector-valued function \(f(x) = (f_1(x), \dots, f_d(x))^\top\), the Jacobian matrix is:
7.3 Trace operator¶
For a square matrix \(A\):
8. The Tractable Score Matching Objective¶
Using integration by parts (details omitted in the main text but standard), the explicit objective becomes:
9. Why This Works¶
9.1 What disappeared?¶
- \(\nabla_x \log p_D(x)\) is gone
- Only \(s_\theta(x)\) and its derivatives remain
9.2 What we can compute¶
Both terms in the tractable objective are computable:
- \(|s_\theta(x)|^2 = |\nabla_x \log p_\theta(x)|^2\) — squared norm of the model score
- \(\mathrm{tr}(J_x s_\theta(x))\) — trace of the Jacobian (sum of second derivatives)
The expectation is approximated by sampling from the data distribution.
10. Connection to EBMs and Beyond¶
Score matching is the foundation for:
- Training EBMs without computing \(Z_\theta\)
- Denoising score matching — a practical variant using noisy data
- Diffusion models — learn scores at multiple noise levels
- Fisher score matching — parameter-space analogue for simulation-based inference
What's Next¶
See Fisher Score Matching for how these ideas extend to estimating gradients w.r.t. parameters (not data) in likelihood-free settings.