Skip to content

Score Matching: The Core Objective

This document explains the score matching objective—a technique for training energy-based models without computing the intractable partition function. Score matching is foundational to modern generative models including diffusion models.


1. What Problem Score Matching Solves

We want to learn a probability density over data \(p_D(x)\), but we only have samples \(x \sim p_D\).

This is the classic density estimation problem.

The difficulty: Many flexible models define densities only up to a normalizing constant, which makes maximum likelihood hard or impossible.

Score matching offers a workaround: Instead of matching the density itself, we match its score (the gradient of the log-density).


2. The Modeling Setup: Energy-Based Models (EBMs)

2.1 Data space and variables

  • \(x \in \mathbb{R}^d\) — A data vector (e.g., an image flattened into pixels, a feature vector, etc.)
  • \(p_D(x)\) — The true but unknown data-generating distribution

2.2 Model density via an energy function

We model the data using an energy-based model:

\[ p_\theta(x) = \frac{\exp(-E_\theta(x))}{Z_\theta} \]

Where:

  • \(E_\theta(x) : \mathbb{R}^d \to \mathbb{R}\) — A scalar-valued energy function, typically a neural network
  • \(\theta\) — Model parameters
  • \(Z_\theta = \int \exp(-E_\theta(x)) \, dx\) — The partition function (normalizing constant)

Key issue: \(Z_\theta\) depends on \(\theta\) and is usually intractable.


3. The Score Function: The Central Object

3.1 Definition

The score function of a density is:

\[ s_\theta(x) := \nabla_x \log p_\theta(x) \]

This is a vector in \(\mathbb{R}^d\).

3.2 Why the score is special

Let's expand it:

\[ \log p_\theta(x) = -E_\theta(x) - \log Z_\theta \]

Taking gradient w.r.t. \(x\):

\[ \nabla_x \log p_\theta(x) = -\nabla_x E_\theta(x) \]

Important observation:

  • The normalizing constant \(Z_\theta\) disappears
  • The score depends only on the energy gradient

This is the loophole score matching exploits.


4. What Does It Mean to "Match Scores"?

If two densities have the same score everywhere (under mild regularity conditions), then they are the same density up to a constant, which is exactly what we want.

So instead of minimizing:

\[ \mathrm{KL}(p_D \| p_\theta) \]

we try to make:

\[ \nabla_x \log p_\theta(x) \approx \nabla_x \log p_D(x) \]

5. The Explicit Score Matching Objective

5.1 The ideal (but infeasible) objective

We start with the explicit score matching (ESM) loss:

\[ \mathcal{L}_{\text{ESM}}(\theta) = \mathbb{E}_{x \sim p_D(x)} \left[ \frac{1}{2} \left| s_\theta(x) - \nabla_x \log p_D(x) \right|^2 \right] \]

Let's unpack every symbol.


5.2 Notation breakdown

  • \(\mathbb{E}_{x \sim p_D(x)}[\cdot]\) — Expectation over true data samples
  • \(s_\theta(x)\) — Model score \(= \nabla_x \log p_\theta(x)\)
  • \(\nabla_x \log p_D(x)\)True data score (unknown!)
  • \(|\cdot|\) — Euclidean norm
  • Factor \(\frac{1}{2}\) — For mathematical convenience

5.3 Why this objective is impossible to compute

We do not know \(p_D(x)\), so:

  • We cannot compute \(\log p_D(x)\)
  • We cannot compute \(\nabla_x \log p_D(x)\)

So this loss is conceptually useful but computationally useless.


6. The Key Mathematical Trick: Integration by Parts

Score matching transforms the explicit objective into one that does not involve \(p_D\).

To do this, we introduce differential operators.


7. Differential Operators and Notation

7.1 Gradient operator

For a scalar function \(f : \mathbb{R}^d \to \mathbb{R}\):

\[ \nabla_x f(x) = \begin{pmatrix} \frac{\partial f}{\partial x_1} \\ \vdots \\ \frac{\partial f}{\partial x_d} \end{pmatrix} \]

7.2 Jacobian operator

For a vector-valued function \(f(x) = (f_1(x), \dots, f_d(x))^\top\), the Jacobian matrix is:

\[ J_x f(x) = \left[ \frac{\partial f_i}{\partial x_j} \right]_{i,j} \in \mathbb{R}^{d \times d} \]

7.3 Trace operator

For a square matrix \(A\):

\[ \mathrm{tr}(A) = \sum_i A_{ii} \]

8. The Tractable Score Matching Objective

Using integration by parts (details omitted in the main text but standard), the explicit objective becomes:

\[ \mathcal{L}_{\text{SM}}(\theta) = \mathbb{E}_{x \sim p_D(x)} \left[ \frac{1}{2}|s_\theta(x)|^2 + \mathrm{tr}(J_x s_\theta(x)) \right] + \text{const} \]

9. Why This Works

9.1 What disappeared?

  • \(\nabla_x \log p_D(x)\) is gone
  • Only \(s_\theta(x)\) and its derivatives remain

9.2 What we can compute

Both terms in the tractable objective are computable:

  • \(|s_\theta(x)|^2 = |\nabla_x \log p_\theta(x)|^2\) — squared norm of the model score
  • \(\mathrm{tr}(J_x s_\theta(x))\) — trace of the Jacobian (sum of second derivatives)

The expectation is approximated by sampling from the data distribution.


10. Connection to EBMs and Beyond

Score matching is the foundation for:

  • Training EBMs without computing \(Z_\theta\)
  • Denoising score matching — a practical variant using noisy data
  • Diffusion models — learn scores at multiple noise levels
  • Fisher score matching — parameter-space analogue for simulation-based inference

What's Next

See Fisher Score Matching for how these ideas extend to estimating gradients w.r.t. parameters (not data) in likelihood-free settings.