Score Matching Documentation¶

This folder contains documentation on score matching techniques—methods for training models with intractable normalizing constants by matching the gradient of the log-density (the "score") rather than the density itself.

Reading Order¶

Foundations¶

Score Matching: The Core Objective
Explains the fundamental score matching objective for training energy-based models. Covers why the score function bypasses the partition function \(Z_\theta\), the explicit vs. tractable objectives, and the integration-by-parts trick.
Fisher Score Matching for Likelihood-Free Inference
Tutorial walkthrough of "Direct Fisher Score Estimation for Likelihood Maximization" (Khoo et al., 2025). Extends score matching from data-space gradients to parameter-space gradients for simulation-based inference.

Coming Soon¶

Denoising Score Matching — Practical variant using noisy data
Sliced Score Matching — Scalable approximation for high dimensions
Connection to Diffusion Models — How score matching underlies modern diffusion models

Practical Considerations: ESM vs DSM¶

When implementing score matching for real applications (e.g., modeling gene expression data), you have two main options:

Method	Objective	When to Use
Explicit SM (ESM)	Squared norm + trace of Jacobian	Low-dimensional data; need exact objective
Denoising SM (DSM)	Squared error to noise gradient	High-dimensional data; practical default

Why DSM is often preferred:

ESM requires computing \(\mathrm{tr}(\nabla_x s_\theta)\), which costs \(O(d)\) backprop passes (or Hutchinson estimation)
DSM only needs forward passes through the score network
With Gaussian noise \(\tilde{x} = x + \sigma\epsilon\), the target is analytic: \(\nabla_{\tilde{x}} \log p(\tilde{x}|x) = -(\tilde{x} - x)/\sigma^2\)

Both learn the same thing: the Stein score \(\nabla_x \log p(x)\), just with different computational trade-offs.

See Roadmap Stage 5 for implementation milestones.

Key Concepts¶

Concept	Symbol	Description
Stein score	\(s(x)\)	Gradient of log-density w.r.t. data
Fisher score	\(g(\theta)\)	Gradient of log-density w.r.t. parameters
Energy function	\(E_\theta(x)\)	Defines density via \(p_\theta(x) \propto \exp(-E_\theta(x))\)
Partition function	\(Z_\theta\)	Intractable normalizing constant

Two Flavors of Score Matching¶

Method	Estimates	Use Case
Original Score Matching	\(\nabla_x \log p_\theta(x)\)	Training EBMs, diffusion models
Fisher Score Matching	Fisher score \(\nabla_\theta \log p\)	Simulation-based inference, likelihood-free MLE

Both use integration-by-parts to eliminate intractable terms.

Connection to Other Topics¶

EBMs: See ../EBM/ — Score matching is the primary training method for EBMs
VAEs: See ../VAE/ — VAEs avoid the partition function via tractable encoder/decoder
Diffusion Models: Build on denoising score matching across noise levels

References¶

Hyvärinen (2005). Estimation of Non-Normalized Statistical Models by Score Matching
Vincent (2011). A Connection Between Score Matching and Denoising Autoencoders
Khoo et al. (2025). Direct Fisher Score Estimation for Likelihood Maximization