Score Matching Objective: The Integration-by-Parts Derivation¶
This document proves the classic identity behind (explicit) score matching—the "integration-by-parts trick" that makes score matching usable without knowing \(p_D\). This derivation shows exactly where the unknown data distribution terms drop out.
The key identity:
The left side requires the unknown \(\nabla_x \log p_D(x)\); the right side doesn't.
Step 0: Notation and Assumptions¶
Variables and distributions¶
- \(x = (x_1, \dots, x_d) \in \mathbb{R}^d\) — data vector
- \(p_D(x)\) — true (unknown) data density; we can sample from it
- \(p_\theta(x)\) — model density (e.g., an EBM); \(\theta\) are model parameters
Score functions¶
- Model score:
Component form: \(s_{\theta,i}(x) = \frac{\partial}{\partial x_i} \log p_\theta(x)\)
- Data score:
which we cannot compute directly because \(p_D\) is unknown.
Differential operators¶
- Gradient w.r.t. \(x\): \(\nabla_x\)
- Jacobian of a vector field \(s_\theta(x)\):
- Divergence (a scalar):
Boundary condition¶
We assume the boundary term vanishes. For \(\mathcal{X} = \mathbb{R}^d\):
This is the standard regularity condition used in score matching proofs.
Why we need it: Integration by parts produces a boundary term; we want it to be zero.
Step 1: Start from the Explicit Score Matching Objective¶
Define:
Explanation: We want the model score \(s_\theta\) to match the true score \(s_D\), because matching scores identifies the density (up to a constant).
Step 2: Expand the Squared Norm¶
Explanation: This is just \(|a-b|^2 = |a|^2 - 2a^\top b + |b|^2\), with the factor \(\frac{1}{2}\) removing the 2.
Taking expectation under \(p_D\):
Explanation: Linear property of expectation.
Now note: the last term does not depend on \(\theta\), because \(s_D\) is fixed by the data distribution.
So:
Explanation: When optimizing over \(\theta\), constants can be ignored.
The real enemy is the cross term \(\mathbb{E}_{p_D}[\langle s_\theta, s_D \rangle]\), because it contains \(s_D = \nabla \log p_D\), which we can't compute.
Step 3: Rewrite the Cross Term¶
Write expectation as an integral:
Explanation: By definition, \(\mathbb{E}_{p_D}[f(x)] = \int f(x) p_D(x) \, dx\).
Now use the identity:
Substitute:
Explanation: The \(p_D(x)\) cancels. Now we no longer have \(\log p_D\), only \(\nabla p_D\). Still not computable directly, but now integration by parts can help.
Step 4: Apply Integration by Parts (Component-wise)¶
Write the dot product as a sum over components:
Explanation: \(a^\top b = \sum_i a_i b_i\).
Now apply 1D integration by parts in the \(x_i\) direction while holding other coordinates fixed:
Explanation: This is the multivariate version of \(\int u \, dv = uv - \int v \, du\), where \(u = s_{\theta,i}\), \(dv = \partial_i p_D \, dx_i\), so \(v = p_D\), \(du = \partial_i s_{\theta,i} \, dx_i\).
Under the boundary condition, the boundary term is zero:
So:
Explanation: This is where the "magic" happens: the derivative moves from \(p_D\) to \(s_\theta\).
Summing over \(i\):
Recognize divergence:
So the cross term becomes:
Explanation: We have removed the unknown \(s_D\). Everything left involves \(s_\theta\) and its derivatives.
Step 5: Substitute Back into the Objective¶
Recall:
Substitute the identity we just proved:
So:
Finally, use \(\nabla_x \cdot s_\theta(x) = \mathrm{tr}(J_x s_\theta(x))\):
Explanation: This is exactly the tractable form stated in the paper's score-matching section.
Intuition: What This Proof Achieves¶
- The explicit objective wanted to match \(s_\theta\) to \(s_D\), but \(s_D\) is unknown
- The trick converts the unknown cross term into the divergence of the model score, which is computable (if you can differentiate your model w.r.t. \(x\))
- The remaining constant is \(\frac{1}{2}\mathbb{E}_{p_D}|s_D|^2\), which doesn't matter for optimization over \(\theta\)
Why This Still Gets Expensive¶
That \(\mathrm{tr}(J_x s_\theta(x))\) term requires computing the trace of a Jacobian (often involving second derivatives of the energy). In high dimensions, this is costly—hence:
- Denoising Score Matching (DSM) — avoids the trace term by adding noise
- Sliced Score Matching (SSM) — uses random projections to approximate the trace
Connection to Fisher Score Matching¶
The same integration-by-parts trick applies in parameter space:
| Aspect | Data-Space Score Matching | Parameter-Space (Fisher) Score Matching |
|---|---|---|
| Eliminates | \(\nabla_x \log p_D(x)\) | \(\nabla_\theta \log p(x\|\theta)\) |
| Replaces with | \(\nabla_x \cdot s_\theta(x)\) | \(\nabla_\theta \log q(\theta\|\theta_t)\) |
See Fisher Score Matching for the parameter-space analogue used in simulation-based inference.