Energy-Based Models (EBM) Documentation¶
This folder contains documentation on Energy-Based Models, covering the mathematical foundations and the computational challenges that motivate modern training techniques.
Reading Order¶
The documents are designed to be read in sequence, building from foundational concepts to more advanced topics:
Foundations¶
-
Energy Function Normalization
Proves that the energy-based probability formulation \(p_\theta(x) = \exp(-E_\theta(x))/Z_\theta\) is a valid normalized probability density. -
MLE Gradient Derivation
Derives the gradient of the log-likelihood for EBMs, revealing the intractable expectation \(\mathbb{E}_{p_\theta}[\nabla_\theta E_\theta(x')]\) that makes MLE computationally challenging. This is the "villain origin story." -
Stein vs Fisher Score
Clarifies the distinction between the Stein score (\(\nabla_x \log p\)) and Fisher score (\(\nabla_\theta \log p\))—two different "scores" used in different contexts. -
Score Matching Objective Derivation
Proves the integration-by-parts trick that eliminates the unknown \(p_D\) from the score matching objective, yielding the tractable trace-of-Jacobian form. -
Fisher Score Matching Derivation
The parameter-space analogue: proves how integration-by-parts eliminates the intractable \(\nabla_\theta \log p(x|\theta)\) for simulation-based inference.
Training Methods¶
-
Score Matching (detailed) — Full treatment of the score matching objective.
-
Fisher Score Matching — Parameter-space analogue for simulation-based inference.
Coming Soon¶
- Contrastive Divergence — Approximate MCMC for tractable training
- Noise-Contrastive Estimation — Reframing EBM training as classification
- Denoising Score Matching — Practical variant avoiding the trace term
Key Concepts¶
| Concept | Symbol | Description |
|---|---|---|
| Energy function | \(E_\theta(x)\) | Maps data to scalar "energy" (lower = more probable) |
| Partition function | \(Z_\theta\) | Normalizing constant \(\int \exp(-E_\theta(x)) dx\) |
| Score function | \(\nabla_x \log p(x)\) | Gradient of log-density w.r.t. data |
Connection to Other Topics¶
- VAE: See
../VAE/— VAEs avoid the partition function problem by using tractable encoder/decoder distributions. - Score Matching: See
../score_matching/— Directly estimates the score function without computing \(Z_\theta\). - Diffusion Models: Build on score matching to learn \(\nabla_x \log p_t(x)\) across noise levels.