VAE Q&A: Why Keep the Posterior Close to the Prior?¶
Clarifying the role of the KL divergence term and the prior assumption in VAEs.
The Question¶
In VAEs, the KL divergence term \(\mathrm{KL}(q(z|x) \| p(z))\) encourages the approximate posterior \(q(z|x)\) to stay close to the prior \(p(z)\). One justification given is that this makes the latent space "smooth"—nearby latents produce similar outputs.
But the prior \(p(z) = \mathcal{N}(0, I)\) is our assumption. How do we know it's a good assumption? If the prior doesn't match reality, why is pushing \(q(z|x)\) toward it beneficial? Is this just mathematical intuition, or is there a principled reason?
1. The Prior Is an Assumption—and a Strong One¶
Let's be explicit:
The prior \(p(z)\) is not discovered from data. It is imposed by us.
Typically:
There is no guarantee that the true latent causes of your data are Gaussian, isotropic, or even unimodal.
So if anyone claims "the prior matches the true data-generating process"—that's false in general.
The real question is:
If the prior is arbitrary, why force the posterior toward it at all?
2. The Prior Is Not a Belief—It Is a Coordinate System¶
This is the key mental pivot.
In VAEs, the prior is not primarily a probabilistic belief about reality. It is a chosen reference measure—a coordinate system in which we want the latent space to live.
Think of it this way:
- We are not saying "the world is Gaussian"
- We are saying "we choose to represent latent causes in a Gaussian coordinate system"
This is analogous to choosing:
- Euclidean vs. polar coordinates
- A basis in linear algebra
- A gauge in physics
The KL term enforces compatibility with that coordinate system.
3. Why Compatibility Matters (The Unavoidable Constraint)¶
The unavoidable fact:
At generation time, we must sample from something.
We do not know the aggregate posterior:
So we choose a simple distribution we can sample from. This forces a constraint:
The latent space must be arranged so that sampling from a simple distribution lands us in "valid" regions.
The KL term is the enforcement mechanism.
Without it:
- Each datapoint can occupy a disconnected island in latent space
- There is no globally meaningful geometry
- Sampling becomes undefined behavior
This is not aesthetic—it is operational necessity.
4. Smoothness Is About Learnability, Not Truth¶
The smoothness argument deserves clarification:
Smoothness is not about correctness of the prior. It is about controlling the hypothesis class of the decoder.
The decoder is a continuous function:
If nearby \(z\)'s map to wildly different \(x\)'s, then \(f_\theta\) must be extremely non-smooth, which makes generalization impossible.
The KL term forces the decoder to operate in a regime where:
- Small changes in \(z\) matter locally
- Global structure is shared
This is regularization, not epistemology.
5. The Precise Statement¶
Here is the non-hand-wavy statement you can defend:
The KL term does not encode a belief that the prior is true. It enforces that the encoder and decoder agree on a common latent reference distribution so that generation, interpolation, and generalization are possible.
No mysticism required.
6. Why Not Learn the Prior Instead?¶
Indeed—and people do. Examples:
- VampPrior (Tomczak & Welling, 2018)
- Hierarchical VAEs
- Mixture priors
- Normalizing flow priors
These relax the Gaussian assumption while keeping the same logic:
The posterior must stay close to some tractable reference distribution.
The principle survives:
- The form of the prior can change
- The role of the prior cannot
7. Why Geometry Matters Even with a "Wrong" Prior¶
Even if the prior is "wrong":
- Forcing consistency produces a shared latent manifold
- The decoder learns relative structure, not absolute coordinates
- Many different priors yield equivalent expressive power up to reparameterization
In fact:
Any continuous latent model is only identifiable up to smooth transformations.
So the "true" geometry is unobservable anyway. This is a deep but underappreciated fact.
8. The Trade-Off VAEs Make¶
Here is the sentence most papers avoid saying explicitly:
VAEs trade representational faithfulness for controllability.
The KL term is the price we pay to:
- Sample from the model
- Interpolate in latent space
- Generalize to new data
- Reason about uncertainty
Diffusion models later drop this constraint—and gain fidelity—but lose explicit latents. This is not an accident. It's a fundamental trade-off.
9. Summary¶
To answer the question precisely:
- No, it is not because the prior is assumed to be true
- No, it is not just vague intuition
- Yes, it is a deliberate inductive bias
- The bias is chosen because it makes learning, inference, and generation possible at all
One sentence you can defend publicly:
We keep \(q(z|x)\) close to \(p(z)\) not because the prior is correct, but because it defines a shared, tractable latent coordinate system that makes sampling, generalization, and learning feasible.
Connection to Other Models¶
This clarifies the evolution of generative models:
| Model | Approach |
|---|---|
| VAE | Impose global latent geometry via KL |
| Diffusion | No global latent; iterative denoising instead |
| EBMs | No normalized distribution; learn energy landscape |
Each makes a different trade-off between tractability and expressiveness.
Follow-Up: VAEs vs Diffusion vs EBMs¶
Contrasting VAEs with diffusion models and EBMs, which refuse to impose a global latent geometry—and pay the computational price instead.
The Fork in the Road: Impose Geometry vs. Refuse Geometry¶
At a high level, generative models must answer one unavoidable question:
Where do we put structure?
There are two fundamentally different answers.
Path A: VAEs — Impose a Latent Coordinate System¶
Core Commitment¶
VAEs say:
"We will represent data through a low-dimensional latent variable \(z\), and we will force that latent space to live in a simple, shared geometry."
That geometry is defined by the prior \(p(z)\).
The KL term enforces:
- Global consistency — all data points share the same latent space
- Smoothness — nearby latents produce similar outputs
- Sampleability — we can generate new data by sampling from the prior
Consequences¶
Benefits:
- Explicit latent representations
- Fast sampling (single forward pass)
- Interpolation and controllable generation
Costs:
- Decoder must explain everything through a constrained latent bottleneck
- Likelihood pressure + KL pressure → blurred outputs
- Mismatch between imposed geometry and true data complexity
This is not a bug—it's the cost of choosing structure up front.
Path B: Diffusion / EBMs — Refuse a Global Latent Geometry¶
Diffusion models and EBMs make a radically different choice:
"We will not assume there exists a simple latent coordinate system at all."
Instead:
- Generation happens in data space
- Uncertainty is modeled directly
- Structure emerges implicitly
No global \(z\) that must look Gaussian. No KL to a prior.
Diffusion Models: Structure via Noise, Not Coordinates¶
What Diffusion Replaces¶
Diffusion models drop:
- Explicit latent variables
- Amortized inference
- KL divergence to a prior
And replace them with:
- A Markov chain of noising and denoising steps
- Score matching instead of likelihood maximization
The Key Philosophical Move¶
Instead of asking:
"What latent variable caused this data?"
Diffusion asks:
"How does this data locally deform probability mass?"
Geometry is learned implicitly via gradients of the log density (the score function):
Consequences¶
Benefits:
- Extremely high sample quality
- No pressure to compress information into a bottleneck
- No need for a "correct" prior
Costs:
- Sampling is slow (many iterative steps)
- No explicit, compact latent representation
- Control is indirect (classifier guidance, conditioning tricks)
Energy-Based Models (EBMs): No Normalization, No Geometry¶
EBMs go even further. They say:
"We won't even define a normalized probability distribution."
They learn an energy function:
This defines an energy landscape over data:
- Low energy = plausible data
- High energy = implausible data
The (unnormalized) probability is:
No latent space. No decoder likelihood. No tractable partition function.
Consequences¶
Benefits:
- Maximum flexibility
- No imposed geometry
- Very expressive
Costs:
- Training is hard (contrastive divergence, noise contrastive estimation)
- Sampling requires MCMC or Langevin dynamics
- Inference is expensive
EBMs trade everything for expressiveness.
Why Diffusion Beat VAEs in Vision¶
Here's the uncomfortable truth:
Natural images do not live on a globally smooth, low-dimensional manifold.
They have:
- Sharp edges
- Multi-scale structure
- Combinatorial variation
Forcing all of that through:
is an extreme compression.
Diffusion avoids that compression entirely. That's why it wins on fidelity.
Why VAEs Are Still Essential¶
For applications like world models, JEPA, and structured reasoning, VAEs provide critical capabilities:
What VAEs give you:
- Explicit latent variables
- Amortized inference (fast encoding)
- Compact representations
- Fast rollout for planning
What these enable:
- World models and dynamics learning
- Planning and reasoning
- Controllable generation
- Uncertainty quantification
Diffusion struggles with these use cases. That's why modern systems often combine both approaches.
Modern Hybrids: The Synthesis Phase¶
Current research is converging on hybrids:
- VAE-style latents for structure and fast inference
- Diffusion-style decoders for high-fidelity generation
- Learned priors instead of fixed Gaussians
- Latent diffusion (run diffusion in latent space, not pixel space)
This is not regression—it's reconciliation.
Summary: The Core Trade-Off¶
VAEs assume a simple latent geometry and pay with fidelity. Diffusion refuses a global latent geometry and pays with computation.
This captures the fundamental design choice in generative modeling.
The Meta-Lesson¶
The question asked earlier:
"Is the prior just intuition?"
turns out to be the central design question of generative modeling.
Every modern model answers it differently:
| Model | Answer to "Where is structure?" |
|---|---|
| VAE | In an explicit latent space with imposed geometry |
| Diffusion | In the score function, learned implicitly |
| EBM | In the energy landscape, no normalization |
| Latent Diffusion | Hybrid: latent structure + diffusion fidelity |
And that's why this field is still very much alive.
Where to Go Next¶
The natural continuation:
- Latent Diffusion — the explicit bridge between VAEs and diffusion
- World Models / JEPA — where the latent is learned by prediction, not reconstruction
These build directly on the concepts covered here.
References¶
- VAE-01-overview.md — Main VAE theory
- VAE-02-elbo.md — ELBO derivation
- VAE-03-inference.md — Why we introduce \(q(z|x)\)
- reparameterization-trick.md — The reparameterization trick