Stein Score vs Fisher Score: Two Flavors of "Score"¶
This document clarifies the distinction between the Stein score (gradient w.r.t. data) and the Fisher score (gradient w.r.t. parameters)—two different objects that both go by "score" in the literature.
The Two Scores at a Glance¶
| Name | Symbol | Gradient w.r.t. | What it measures |
|---|---|---|---|
| Stein score | \(s(x) = \nabla_x \log p(x)\) | Data \(x\) | Direction of steepest increase in log-density |
| Fisher score | \(g(\theta) = \nabla_\theta \log p(x \| \theta)\) | Parameters \(\theta\) | Sensitivity of log-likelihood to parameters |
Both are gradients of a log-probability, but they differentiate with respect to different variables.
Stein Score: \(\nabla_x \log p(x)\)¶
Stein score definition¶
For a density \(p(x)\) over data \(x \in \mathbb{R}^d\):
This is a vector field over data space—at each point \(x\), it points in the direction where the density increases most rapidly.
Stein score intuition¶
- High-density regions: The score points "inward" toward the mode
- Low-density regions: The score points toward higher-density areas
- At the mode: The score is zero (gradient of log-density vanishes at maximum)
Why the Stein score is useful¶
The Stein score is central to:
- Training EBMs: For \(p_\theta(x) = \exp(-E_\theta(x))/Z_\theta\), the score is \(s_\theta(x) = -\nabla_x E_\theta(x)\), which doesn't depend on \(Z_\theta\)
- Diffusion models: Learn \(\nabla_x \log p_t(x)\) at multiple noise levels
- Langevin dynamics: Sample from \(p(x)\) using \(x_{t+1} = x_t + \epsilon \nabla_x \log p(x_t) + \sqrt{2\epsilon} z\)
The score matching objective¶
We want to learn \(s_\theta(x) \approx \nabla_x \log p_D(x)\), but \(p_D\) is unknown. Score matching solves this via integration by parts:
See Score Matching Objective Derivation for the full proof. For implementation guidance, see Roadmap Stage 5 and the ESM vs DSM comparison.
Fisher Score: \(\nabla_\theta \log p(x|\theta)\)¶
Fisher score definition¶
For a parametric model \(p(x|\theta)\) with parameters \(\theta \in \mathbb{R}^d\):
This is a vector in parameter space—it tells you how the log-likelihood of observation \(x\) changes as you vary \(\theta\).
Fisher score intuition¶
- Positive component \(g_i > 0\): Increasing \(\theta_i\) would increase the likelihood of \(x\)
- Negative component \(g_i < 0\): Increasing \(\theta_i\) would decrease the likelihood of \(x\)
- At the MLE: \(\mathbb{E}_{p_D}[g(x; \hat{\theta})] = 0\) (score equations)
Why the Fisher score is useful¶
The Fisher score is central to:
- Maximum likelihood estimation: The MLE gradient is \(\nabla_\theta \ell(\theta) = \sum_i \nabla_\theta \log p(x_i|\theta)\)
- Fisher information: \(I(\theta) = \mathbb{E}[g(x;\theta) g(x;\theta)^\top]\) measures parameter uncertainty
- Simulation-based inference: When \(p(x|\theta)\) is intractable but simulable
The Fisher score matching objective¶
We want to learn \(S_W(x) \approx \nabla_\theta \log p(x|\theta)\), but the likelihood is intractable. Fisher score matching solves this via integration by parts in parameter space:
See Fisher Score Matching Derivation for the full proof.
Side-by-Side Comparison¶
| Aspect | Stein Score | Fisher Score |
|---|---|---|
| Symbol | \(\nabla_x \log p(x)\) | \(\nabla_\theta \log p(x\|\theta)\) |
| Lives in | Data space \(\mathbb{R}^d\) | Parameter space \(\mathbb{R}^p\) |
| Input | Data point \(x\) | Data \(x\) and parameters \(\theta\) |
| Output | Vector in \(\mathbb{R}^d\) | Vector in \(\mathbb{R}^p\) |
| Measures | Where density increases in data space | How likelihood changes with parameters |
| Zero at | Mode of \(p(x)\) | MLE (in expectation) |
When to Use Which¶
Use Stein score when¶
- Training generative models (EBMs, diffusion models)
- Sampling via Langevin dynamics or score-based MCMC
- Density estimation where you want to model \(p(x)\) directly
- The partition function \(Z_\theta\) is intractable
Use Fisher score when¶
- Parameter estimation via MLE
- Simulation-based inference where \(p(x|\theta)\) is implicit
- Sensitivity analysis of model parameters
- The likelihood \(p(x|\theta)\) is intractable but simulable
The Same Trick, Different Spaces¶
Both score matching methods use the same mathematical trick:
- Start with a squared error loss against an unknown score
- The cross term contains the unknown score
- Use \(p \nabla \log p = \nabla p\) to eliminate the log
- Integration by parts moves the derivative onto something computable
| Method | IBP in | Eliminates | Replaces with |
|---|---|---|---|
| Stein score matching | Data space | Unknown data score | Trace of Jacobian |
| Fisher score matching | Parameter space | Intractable likelihood score | Proposal score |
Historical Note¶
The terminology can be confusing because:
- "Score function" in classical statistics usually means the Fisher score
- "Score" in the diffusion/EBM literature usually means the Stein score
- Both communities use "score matching" but for different objects
This document uses explicit names (Stein vs Fisher) to avoid ambiguity.
References¶
- Hyvärinen (2005). Estimation of Non-Normalized Statistical Models by Score Matching — Original Stein score matching
- Khoo et al. (2025). Direct Fisher Score Estimation for Likelihood Maximization — Fisher score matching for SBI
- Song & Ermon (2019). Generative Modeling by Estimating Gradients of the Data Distribution — Score-based generative models