Why MLE for EBMs is Hard: The Villain Origin Story¶
This document derives the gradient of the log-likelihood for energy-based models, revealing why maximum likelihood estimation (MLE) is computationally challenging. The derivation shows that the gradient contains an intractable expectation under the model distribution—the fundamental obstacle that motivates alternative training methods like score matching and contrastive divergence.
Goal¶
For an energy-based model
show that the gradient of the log-likelihood has the form
That second term is the painful part: it's an expectation under the model distribution \(p_\theta\), which is usually intractable and needs sampling (often MCMC).
Step 0: Notation¶
- \(x\): a data point
- \(\theta\): model parameters
- \(E_\theta(x)\): energy function
- \(Z_\theta\): partition function (normalizer)
- \(\nabla_\theta\): gradient w.r.t. parameters \(\theta\)
- \(\mathbb{E}_{p_\theta}[\cdot]\): expectation where \(x' \sim p_\theta(x')\)
Assumption (standard regularity): we can swap gradient and integral:
(You can justify this with dominated convergence / Leibniz rule; most ML papers assume it.)
Step 1: Start with the log density¶
Explanation: Just take logs of the EBM definition.
Step 2: Split numerator and denominator¶
Explanation: \(\log(a/b) = \log a - \log b\).
Step 3: Simplify the first term¶
Explanation: \(\log(e^{u}) = u\). Here \(u = -E_\theta(x)\).
Step 4: Differentiate w.r.t. \(\theta\)¶
Explanation: Gradient is linear; derivative of \(-E_\theta(x)\) is \(-\nabla_\theta E_\theta(x)\).
So the only remaining job is to compute \(\nabla_\theta \log Z_\theta\).
Step 5: Differentiate \(\log Z_\theta\) using the chain rule¶
Explanation: \(\nabla_\theta \log u = (\nabla_\theta u)/u\).
Step 6: Expand \(Z_\theta\) and move gradient inside the integral¶
Explanation: This is the "swap gradient and integral" step.
Step 7: Differentiate the exponential¶
Explanation: Chain rule: derivative of \(e^{u}\) is \(e^{u} \nabla u\). Here \(u = -E_\theta(x')\).
Step 8: Substitute back into \(\nabla_\theta Z_\theta\)¶
Explanation: Just plug in the expression from Step 7.
Step 9: Plug into \(\nabla_\theta \log Z_\theta\)¶
Explanation: Divide the integral by \(Z_\theta\); that's exactly how \(p_\theta\) is defined.
Step 10: Recognize \(p_\theta(x')\) and rewrite as an expectation¶
Since
we have
Explanation: An expectation is just an integral weighted by the density.
Step 11: Put it all together (the classic EBM gradient)¶
Recall Step 4:
Substitute Step 10:
Explanation (the intuition):
- First term (data term): push down energy on observed data \(x\) → "make data likely."
- Second term (model term): push up energy on typical samples \(x' \sim p_\theta\) → "make non-data less likely."
Why This Makes MLE Hard (The Punchline)¶
That expectation
requires sampling from \(p_\theta\).
But sampling from \(p_\theta\) is hard because:
- \(p_\theta\) is only defined via an energy (unnormalized form)
- You usually need MCMC (Langevin dynamics, HMC, Gibbs, etc.)
- MCMC can be slow and biased if it doesn't mix well
- Doing this inside every gradient step is brutal
This is why people use:
- Contrastive divergence / persistent CD (approximate MCMC)
- Score matching / denoising score matching (avoid \(Z_\theta\))
- Noise-contrastive estimation (reframing as classification)
- Diffusion/score-based models (learn \(\nabla_x \log p\) directly)
What's Next¶
If you want to go one level deeper, the natural continuation is: derive the score matching objective's "trace(Jacobian)" form via integration by parts, and show exactly where the \(p_D\) terms drop out. That's the other half of the magic.