Stein Score vs Fisher Score: Two Flavors of "Score"¶

This document clarifies the distinction between the Stein score (gradient w.r.t. data) and the Fisher score (gradient w.r.t. parameters)—two different objects that both go by "score" in the literature.

The Two Scores at a Glance¶

Name	Symbol	Gradient w.r.t.	What it measures
Stein score	\(s(x) = \nabla_x \log p(x)\)	Data \(x\)	Direction of steepest increase in log-density
Fisher score	\(g(\theta) = \nabla_\theta \log p(x \\| \theta)\)	Parameters \(\theta\)	Sensitivity of log-likelihood to parameters

Both are gradients of a log-probability, but they differentiate with respect to different variables.

Stein Score: \(\nabla_x \log p(x)\)¶

Stein score definition¶

For a density \(p(x)\) over data \(x \in \mathbb{R}^d\):

\[ s(x) := \nabla_x \log p(x) \]

This is a vector field over data space—at each point \(x\), it points in the direction where the density increases most rapidly.

Stein score intuition¶

High-density regions: The score points "inward" toward the mode
Low-density regions: The score points toward higher-density areas
At the mode: The score is zero (gradient of log-density vanishes at maximum)

Why the Stein score is useful¶

The Stein score is central to:

Training EBMs: For \(p_\theta(x) = \exp(-E_\theta(x))/Z_\theta\), the score is \(s_\theta(x) = -\nabla_x E_\theta(x)\), which doesn't depend on \(Z_\theta\)
Diffusion models: Learn \(\nabla_x \log p_t(x)\) at multiple noise levels
Langevin dynamics: Sample from \(p(x)\) using \(x_{t+1} = x_t + \epsilon \nabla_x \log p(x_t) + \sqrt{2\epsilon} z\)

The score matching objective¶

We want to learn \(s_\theta(x) \approx \nabla_x \log p_D(x)\), but \(p_D\) is unknown. Score matching solves this via integration by parts:

\[ \mathcal{L}_{\text{SM}}(\theta) = \mathbb{E}_{p_D}\left[ \frac{1}{2}|s_\theta(x)|^2 + \mathrm{tr}(\nabla_x s_\theta(x)) \right] \]

See Score Matching Objective Derivation for the full proof. For implementation guidance, see Roadmap Stage 5 and the ESM vs DSM comparison.

Fisher Score: \(\nabla_\theta \log p(x|\theta)\)¶

Fisher score definition¶

For a parametric model \(p(x|\theta)\) with parameters \(\theta \in \mathbb{R}^d\):

\[ g(x; \theta) := \nabla_\theta \log p(x|\theta) \]

This is a vector in parameter space—it tells you how the log-likelihood of observation \(x\) changes as you vary \(\theta\).

Fisher score intuition¶

Positive component \(g_i > 0\): Increasing \(\theta_i\) would increase the likelihood of \(x\)
Negative component \(g_i < 0\): Increasing \(\theta_i\) would decrease the likelihood of \(x\)
At the MLE: \(\mathbb{E}_{p_D}[g(x; \hat{\theta})] = 0\) (score equations)

Why the Fisher score is useful¶

The Fisher score is central to:

Maximum likelihood estimation: The MLE gradient is \(\nabla_\theta \ell(\theta) = \sum_i \nabla_\theta \log p(x_i|\theta)\)
Fisher information: \(I(\theta) = \mathbb{E}[g(x;\theta) g(x;\theta)^\top]\) measures parameter uncertainty
Simulation-based inference: When \(p(x|\theta)\) is intractable but simulable

The Fisher score matching objective¶

We want to learn \(S_W(x) \approx \nabla_\theta \log p(x|\theta)\), but the likelihood is intractable. Fisher score matching solves this via integration by parts in parameter space:

\[ J(W; \theta_t) = \mathbb{E}\left[ |S_W(x)|^2 + 2 S_W(x)^\top \nabla_\theta \log q(\theta|\theta_t) \right] \]

See Fisher Score Matching Derivation for the full proof.

Side-by-Side Comparison¶

Aspect	Stein Score	Fisher Score
Symbol	\(\nabla_x \log p(x)\)	\(\nabla_\theta \log p(x\\|\theta)\)
Lives in	Data space \(\mathbb{R}^d\)	Parameter space \(\mathbb{R}^p\)
Input	Data point \(x\)	Data \(x\) and parameters \(\theta\)
Output	Vector in \(\mathbb{R}^d\)	Vector in \(\mathbb{R}^p\)
Measures	Where density increases in data space	How likelihood changes with parameters
Zero at	Mode of \(p(x)\)	MLE (in expectation)

When to Use Which¶

Use Stein score when¶

Training generative models (EBMs, diffusion models)
Sampling via Langevin dynamics or score-based MCMC
Density estimation where you want to model \(p(x)\) directly
The partition function \(Z_\theta\) is intractable

Use Fisher score when¶

Parameter estimation via MLE
Simulation-based inference where \(p(x|\theta)\) is implicit
Sensitivity analysis of model parameters
The likelihood \(p(x|\theta)\) is intractable but simulable

The Same Trick, Different Spaces¶

Both score matching methods use the same mathematical trick:

Start with a squared error loss against an unknown score
The cross term contains the unknown score
Use \(p \nabla \log p = \nabla p\) to eliminate the log
Integration by parts moves the derivative onto something computable

Method	IBP in	Eliminates	Replaces with
Stein score matching	Data space	Unknown data score	Trace of Jacobian
Fisher score matching	Parameter space	Intractable likelihood score	Proposal score

Historical Note¶

The terminology can be confusing because:

"Score function" in classical statistics usually means the Fisher score
"Score" in the diffusion/EBM literature usually means the Stein score
Both communities use "score matching" but for different objects

This document uses explicit names (Stein vs Fisher) to avoid ambiguity.

References¶

Hyvärinen (2005). Estimation of Non-Normalized Statistical Models by Score Matching — Original Stein score matching
Khoo et al. (2025). Direct Fisher Score Estimation for Likelihood Maximization — Fisher score matching for SBI
Song & Ermon (2019). Generative Modeling by Estimating Gradients of the Data Distribution — Score-based generative models