Fisher Score Matching for Likelihood-Free Inference¶
This document is a tutorial-style walkthrough of the key ideas from the paper "Direct Fisher Score Estimation for Likelihood Maximization" (Khoo et al., 2025), explaining how score matching ideas extend to parameter-space gradients for simulation-based inference.
The Problem: Implicit Simulators¶
Many scientific models (biology, physics, cosmology, neuroscience, etc.) are implicit simulators:
- You can simulate data \(x \sim p(x|\theta)\)
- But you cannot evaluate the likelihood \(p(x|\theta)\) or its gradient
This is called Simulation-Based Inference (SBI).
If you want to do maximum likelihood estimation (MLE), you need the Fisher score:
But this derivative is unavailable because the likelihood is unknown.
Main Idea: Local Fisher Score Matching¶
The authors propose Direct Fisher Score Estimation via a new method called Local Fisher Score Matching (FSM).
FSM directly estimates the Fisher score using only:
- Samples from a local region around the current parameter \(\theta_t\)
- A simple linear regression model
No likelihoods, no densities, no gradients of the simulator are needed.
This enables a gradient-based MLE method in fully likelihood-free settings.
The method works sequentially:
- At parameter iterate \(\theta_t\), draw nearby parameters from a Gaussian
- Simulate data at those parameters
- Fit a local surrogate model \(S_W(x) \approx \nabla_\theta \log p(x|\theta_t)\)
- Use this surrogate to take a gradient step in \(\theta\)
Why Score Matching?¶
Score matching is a classical method for training energy-based models when the normalizing constant is intractable.
The ordinary score matching objective is:
But FSM adapts the idea in a novel way:
| Approach | Differentiate w.r.t. | Estimates |
|---|---|---|
| Original score matching | Data \(x\) | Stein score \(\nabla_x \log p_\theta(x)\) |
| Fisher score matching | Parameters \(\theta\) | Fisher score \(\nabla_\theta \log p_\theta(x)\) |
This is non-standard and the main conceptual innovation of the paper.
Background: Score Matching (Section 2.1)¶
Score matching solves density estimation without computing the normalizing constant \(Z_\theta\).
Energy-Based Model (EBM)¶
The score is:
The explicit score matching loss¶
No need to compute \(Z_\theta\)! But it still requires computing Jacobians/Hessians, which may be expensive.
This background is crucial because FSM uses a parameter-space analogue of this trick.
How FSM Modifies Score Matching¶
FSM wants to estimate:
But this is unknown.
So the authors define a joint distribution over data and parameters:
- Sample parameters locally: \(\theta \sim q(\theta|\theta_t) = \mathcal{N}(\theta_t, \sigma^2 I)\)
- For each sampled \(\theta\), simulate data: \(x \sim p(x|\theta)\)
This gives a simple joint distribution:
Then define the local score-matching objective:
Here \(S_W(x)\) is a surrogate model for the Fisher score.
But \(\nabla_\theta \log p(x|\theta)\) is unknown! So how do we optimize (1)?
Trick #1: Integration by Parts Removes the Likelihood Term¶
(Section 3.1, Theorem 3.1)
Through an integration-by-parts identity very similar to original score matching, the intractable term disappears:
This is remarkable:
- The likelihood gradient vanishes entirely
- Only the proposal distribution's gradient remains:
Thus the entire objective is computable by simulation.
What Is the Optimal Solution of FSM?¶
(Theorem 3.2)
This is the Bayes estimator of the score under the local posterior:
Intuition:
- You can't estimate the true score at a single point \(\theta_t\) because you never see data exactly at that point
- So FSM estimates a locally smoothed version of the Fisher score
This becomes important in Section 5.
Trick #2: FSM = Gradient of a Gaussian-Smoothed Likelihood¶
(Section 5.1, Theorem 5.1)
FSM is exactly the gradient of the smoothed likelihood:
And:
Hence the algorithm is performing:
Gradient descent on a locally smoothed likelihood, not on the raw likelihood.
This explains:
- Robustness to non-smooth likelihoods
- Ability to escape flat regions
- Improved stability vs. finite-difference estimators
Practical Parameterization: Linear Score Model¶
The authors choose:
leading to a closed-form linear regression solution:
This converts FSM into an extremely efficient method.
Full FSM-MLE Algorithm¶
At iteration \(t\):
-
Sample parameters: \(\theta_j \sim \mathcal{N}(\theta_t, \sigma^2 I)\)
-
Simulate data: \(x_{j,k} \sim p(x|\theta_j)\)
-
Fit linear model \(S_{\hat{W}}(x)\) by solving the FSM least-squares problem
-
Estimate gradient of smoothed log-likelihood:
- Update parameters using SGD or Adam
Why This Works (Intuition)¶
FSM pulls off something surprising:
- It never evaluates \(p(x|\theta)\)
- It never computes \(\nabla_\theta \log p(x|\theta)\)
- It never estimates likelihoods like KDE-based methods do
Yet it performs a gradient-based maximum likelihood optimization.
The key is the joint sampling over \((x, \theta)\) and score-matching identity that replaces the intractable terms with derivatives of a simple Gaussian proposal.
Understanding the Bias / Smoothing Effect (Section 5.2)¶
If \(\sigma\) (local proposal width) is too small:
- Variance explodes (like finite differences)
- Estimator becomes unstable
If \(\sigma\) is too large:
- Bias grows (you oversmooth the likelihood)
Theorem 5.2 shows:
This formalizes the bias–variance tradeoff.
Summary of Core Contributions¶
- Novel local Fisher score matching objective
- Likelihood-free derivation using integration-by-parts
- Closed-form linear surrogate model
- Equivalence to Gaussian smoothing gradient estimator
- Strong theoretical properties
- Bias bounds
- Convergence via averaged SGD
- Asymptotic normality of estimator
- Superior empirical performance
- Over KDE + SPSA
- Over Neural Likelihood Estimators
- In high-dimensional SBI tasks
Connection to EBMs¶
Fisher score matching connects directly to the EBM training problem:
- EBM challenge: The MLE gradient requires \(\mathbb{E}_{p_\theta}[\nabla_\theta E_\theta(x')]\) (see EBM MLE Gradient Derivation)
- FSM solution: Bypass the likelihood entirely by estimating the Fisher score directly from simulations
Both approaches share the core insight: use integration-by-parts to eliminate intractable terms.
References¶
- Khoo et al. (2025). Direct Fisher Score Estimation for Likelihood Maximization
- Hyvärinen (2005). Estimation of Non-Normalized Statistical Models by Score Matching