Skip to content

Score Matching Objective: The Integration-by-Parts Derivation

This document proves the classic identity behind (explicit) score matching—the "integration-by-parts trick" that makes score matching usable without knowing \(p_D\). This derivation shows exactly where the unknown data distribution terms drop out.

The key identity:

\[ \mathbb{E}_{p_D}\left[\frac{1}{2}|s_\theta(x) - \nabla_x \log p_D(x)|^2\right] = \mathbb{E}_{p_D}\left[\frac{1}{2}|s_\theta(x)|^2 + \mathrm{tr}(\nabla_x s_\theta(x))\right] + \text{const} \]

The left side requires the unknown \(\nabla_x \log p_D(x)\); the right side doesn't.


Step 0: Notation and Assumptions

Variables and distributions

  • \(x = (x_1, \dots, x_d) \in \mathbb{R}^d\) — data vector
  • \(p_D(x)\) — true (unknown) data density; we can sample from it
  • \(p_\theta(x)\) — model density (e.g., an EBM); \(\theta\) are model parameters

Score functions

  • Model score:
\[ s_\theta(x) := \nabla_x \log p_\theta(x) \in \mathbb{R}^d \]

Component form: \(s_{\theta,i}(x) = \frac{\partial}{\partial x_i} \log p_\theta(x)\)

  • Data score:
\[ s_D(x) := \nabla_x \log p_D(x) \in \mathbb{R}^d \]

which we cannot compute directly because \(p_D\) is unknown.

Differential operators

  • Gradient w.r.t. \(x\): \(\nabla_x\)
  • Jacobian of a vector field \(s_\theta(x)\):
\[ J_x s_\theta(x) \in \mathbb{R}^{d \times d}, \quad (J_x s_\theta)_{ij} = \frac{\partial s_{\theta,i}}{\partial x_j} \]
  • Divergence (a scalar):
\[ \nabla_x \cdot s_\theta(x) := \sum_{i=1}^d \frac{\partial s_{\theta,i}(x)}{\partial x_i} = \mathrm{tr}(J_x s_\theta(x)) \]

Boundary condition

We assume the boundary term vanishes. For \(\mathcal{X} = \mathbb{R}^d\):

\[ \lim_{|x| \to \infty} p_D(x) \cdot s_{\theta,i}(x) = 0 \quad \text{for each } i \]

This is the standard regularity condition used in score matching proofs.

Why we need it: Integration by parts produces a boundary term; we want it to be zero.


Step 1: Start from the Explicit Score Matching Objective

Define:

\[ \mathcal{L}_{\text{ESM}}(\theta) := \mathbb{E}_{x \sim p_D} \left[ \frac{1}{2} |s_\theta(x) - s_D(x)|^2 \right] \]

Explanation: We want the model score \(s_\theta\) to match the true score \(s_D\), because matching scores identifies the density (up to a constant).


Step 2: Expand the Squared Norm

\[ \frac{1}{2}|s_\theta - s_D|^2 = \frac{1}{2}|s_\theta|^2 - \langle s_\theta, s_D \rangle + \frac{1}{2}|s_D|^2 \]

Explanation: This is just \(|a-b|^2 = |a|^2 - 2a^\top b + |b|^2\), with the factor \(\frac{1}{2}\) removing the 2.

Taking expectation under \(p_D\):

\[ \mathcal{L}_{\text{ESM}}(\theta) = \mathbb{E}_{p_D}\left[\frac{1}{2}|s_\theta(x)|^2\right] - \mathbb{E}_{p_D}\left[\langle s_\theta(x), s_D(x) \rangle\right] + \mathbb{E}_{p_D}\left[\frac{1}{2}|s_D(x)|^2\right] \]

Explanation: Linear property of expectation.

Now note: the last term does not depend on \(\theta\), because \(s_D\) is fixed by the data distribution.

So:

\[ \mathcal{L}_{\text{ESM}}(\theta) = \mathbb{E}_{p_D}\left[\frac{1}{2}|s_\theta(x)|^2\right] - \mathbb{E}_{p_D}\left[\langle s_\theta(x), s_D(x) \rangle\right] + \text{const} \]

Explanation: When optimizing over \(\theta\), constants can be ignored.

The real enemy is the cross term \(\mathbb{E}_{p_D}[\langle s_\theta, s_D \rangle]\), because it contains \(s_D = \nabla \log p_D\), which we can't compute.


Step 3: Rewrite the Cross Term

Write expectation as an integral:

\[ \mathbb{E}_{p_D}\left[\langle s_\theta(x), s_D(x) \rangle\right] = \int p_D(x) \cdot s_\theta(x)^\top \nabla_x \log p_D(x) \, dx \]

Explanation: By definition, \(\mathbb{E}_{p_D}[f(x)] = \int f(x) p_D(x) \, dx\).

Now use the identity:

\[ \nabla_x \log p_D(x) = \frac{\nabla_x p_D(x)}{p_D(x)} \quad \text{(where } p_D(x) > 0 \text{)} \]

Substitute:

\[ \int p_D(x) \cdot s_\theta(x)^\top \nabla_x \log p_D(x) \, dx = \int p_D(x) \cdot s_\theta(x)^\top \frac{\nabla_x p_D(x)}{p_D(x)} \, dx = \int s_\theta(x)^\top \nabla_x p_D(x) \, dx \]

Explanation: The \(p_D(x)\) cancels. Now we no longer have \(\log p_D\), only \(\nabla p_D\). Still not computable directly, but now integration by parts can help.


Step 4: Apply Integration by Parts (Component-wise)

Write the dot product as a sum over components:

\[ \int s_\theta(x)^\top \nabla_x p_D(x) \, dx = \sum_{i=1}^d \int s_{\theta,i}(x) \cdot \frac{\partial}{\partial x_i} p_D(x) \, dx \]

Explanation: \(a^\top b = \sum_i a_i b_i\).

Now apply 1D integration by parts in the \(x_i\) direction while holding other coordinates fixed:

\[ \int s_{\theta,i}(x) \cdot \frac{\partial}{\partial x_i} p_D(x) \, dx = \Big[s_{\theta,i}(x) \cdot p_D(x)\Big]_{\text{boundary}} - \int p_D(x) \cdot \frac{\partial}{\partial x_i} s_{\theta,i}(x) \, dx \]

Explanation: This is the multivariate version of \(\int u \, dv = uv - \int v \, du\), where \(u = s_{\theta,i}\), \(dv = \partial_i p_D \, dx_i\), so \(v = p_D\), \(du = \partial_i s_{\theta,i} \, dx_i\).

Under the boundary condition, the boundary term is zero:

\[ \Big[s_{\theta,i}(x) \cdot p_D(x)\Big]_{\text{boundary}} = 0 \]

So:

\[ \int s_{\theta,i}(x) \cdot \frac{\partial}{\partial x_i} p_D(x) \, dx = -\int p_D(x) \cdot \frac{\partial}{\partial x_i} s_{\theta,i}(x) \, dx \]

Explanation: This is where the "magic" happens: the derivative moves from \(p_D\) to \(s_\theta\).

Summing over \(i\):

\[ \int s_\theta(x)^\top \nabla_x p_D(x) \, dx = -\sum_{i=1}^d \int p_D(x) \cdot \frac{\partial}{\partial x_i} s_{\theta,i}(x) \, dx \]

Recognize divergence:

\[ -\sum_{i=1}^d \int p_D(x) \cdot \frac{\partial}{\partial x_i} s_{\theta,i}(x) \, dx = -\int p_D(x) \cdot (\nabla_x \cdot s_\theta(x)) \, dx \]

So the cross term becomes:

\[ \mathbb{E}_{p_D}\left[\langle s_\theta(x), s_D(x) \rangle\right] = -\mathbb{E}_{p_D}\left[\nabla_x \cdot s_\theta(x)\right] \]

Explanation: We have removed the unknown \(s_D\). Everything left involves \(s_\theta\) and its derivatives.


Step 5: Substitute Back into the Objective

Recall:

\[ \mathcal{L}_{\text{ESM}}(\theta) = \mathbb{E}_{p_D}\left[\frac{1}{2}|s_\theta(x)|^2\right] - \mathbb{E}_{p_D}\left[\langle s_\theta(x), s_D(x) \rangle\right] + \text{const} \]

Substitute the identity we just proved:

\[ -\mathbb{E}_{p_D}\left[\langle s_\theta, s_D \rangle\right] = -\left(-\mathbb{E}_{p_D}[\nabla_x \cdot s_\theta]\right) = \mathbb{E}_{p_D}[\nabla_x \cdot s_\theta] \]

So:

\[ \boxed{\mathcal{L}_{\text{ESM}}(\theta) = \mathbb{E}_{p_D}\left[ \frac{1}{2}|s_\theta(x)|^2 + \nabla_x \cdot s_\theta(x) \right] + \text{const}} \]

Finally, use \(\nabla_x \cdot s_\theta(x) = \mathrm{tr}(J_x s_\theta(x))\):

\[ \boxed{\mathcal{L}_{\text{SM}}(\theta) = \mathbb{E}_{p_D}\left[ \frac{1}{2}|s_\theta(x)|^2 + \mathrm{tr}(J_x s_\theta(x)) \right] + \text{const}} \]

Explanation: This is exactly the tractable form stated in the paper's score-matching section.


Intuition: What This Proof Achieves

  • The explicit objective wanted to match \(s_\theta\) to \(s_D\), but \(s_D\) is unknown
  • The trick converts the unknown cross term into the divergence of the model score, which is computable (if you can differentiate your model w.r.t. \(x\))
  • The remaining constant is \(\frac{1}{2}\mathbb{E}_{p_D}|s_D|^2\), which doesn't matter for optimization over \(\theta\)

Why This Still Gets Expensive

That \(\mathrm{tr}(J_x s_\theta(x))\) term requires computing the trace of a Jacobian (often involving second derivatives of the energy). In high dimensions, this is costly—hence:

  • Denoising Score Matching (DSM) — avoids the trace term by adding noise
  • Sliced Score Matching (SSM) — uses random projections to approximate the trace

Connection to Fisher Score Matching

The same integration-by-parts trick applies in parameter space:

Aspect Data-Space Score Matching Parameter-Space (Fisher) Score Matching
Eliminates \(\nabla_x \log p_D(x)\) \(\nabla_\theta \log p(x\|\theta)\)
Replaces with \(\nabla_x \cdot s_\theta(x)\) \(\nabla_\theta \log q(\theta\|\theta_t)\)

See Fisher Score Matching for the parameter-space analogue used in simulation-based inference.