Skip to content

VAE Q&A: Why Keep the Posterior Close to the Prior?

Clarifying the role of the KL divergence term and the prior assumption in VAEs.


The Question

In VAEs, the KL divergence term \(\mathrm{KL}(q(z|x) \| p(z))\) encourages the approximate posterior \(q(z|x)\) to stay close to the prior \(p(z)\). One justification given is that this makes the latent space "smooth"—nearby latents produce similar outputs.

But the prior \(p(z) = \mathcal{N}(0, I)\) is our assumption. How do we know it's a good assumption? If the prior doesn't match reality, why is pushing \(q(z|x)\) toward it beneficial? Is this just mathematical intuition, or is there a principled reason?


1. The Prior Is an Assumption—and a Strong One

Let's be explicit:

The prior \(p(z)\) is not discovered from data. It is imposed by us.

Typically:

\[ p(z) = \mathcal{N}(0, I) \]

There is no guarantee that the true latent causes of your data are Gaussian, isotropic, or even unimodal.

So if anyone claims "the prior matches the true data-generating process"—that's false in general.

The real question is:

If the prior is arbitrary, why force the posterior toward it at all?


2. The Prior Is Not a Belief—It Is a Coordinate System

This is the key mental pivot.

In VAEs, the prior is not primarily a probabilistic belief about reality. It is a chosen reference measure—a coordinate system in which we want the latent space to live.

Think of it this way:

  • We are not saying "the world is Gaussian"
  • We are saying "we choose to represent latent causes in a Gaussian coordinate system"

This is analogous to choosing:

  • Euclidean vs. polar coordinates
  • A basis in linear algebra
  • A gauge in physics

The KL term enforces compatibility with that coordinate system.


3. Why Compatibility Matters (The Unavoidable Constraint)

The unavoidable fact:

At generation time, we must sample from something.

We do not know the aggregate posterior:

\[ \int q(z|x) \, p_{\text{data}}(x) \, dx \]

So we choose a simple distribution we can sample from. This forces a constraint:

The latent space must be arranged so that sampling from a simple distribution lands us in "valid" regions.

The KL term is the enforcement mechanism.

Without it:

  • Each datapoint can occupy a disconnected island in latent space
  • There is no globally meaningful geometry
  • Sampling becomes undefined behavior

This is not aesthetic—it is operational necessity.


4. Smoothness Is About Learnability, Not Truth

The smoothness argument deserves clarification:

Smoothness is not about correctness of the prior. It is about controlling the hypothesis class of the decoder.

The decoder is a continuous function:

\[ x = f_\theta(z) \]

If nearby \(z\)'s map to wildly different \(x\)'s, then \(f_\theta\) must be extremely non-smooth, which makes generalization impossible.

The KL term forces the decoder to operate in a regime where:

  • Small changes in \(z\) matter locally
  • Global structure is shared

This is regularization, not epistemology.


5. The Precise Statement

Here is the non-hand-wavy statement you can defend:

The KL term does not encode a belief that the prior is true. It enforces that the encoder and decoder agree on a common latent reference distribution so that generation, interpolation, and generalization are possible.

No mysticism required.


6. Why Not Learn the Prior Instead?

Indeed—and people do. Examples:

  • VampPrior (Tomczak & Welling, 2018)
  • Hierarchical VAEs
  • Mixture priors
  • Normalizing flow priors

These relax the Gaussian assumption while keeping the same logic:

The posterior must stay close to some tractable reference distribution.

The principle survives:

  • The form of the prior can change
  • The role of the prior cannot

7. Why Geometry Matters Even with a "Wrong" Prior

Even if the prior is "wrong":

  • Forcing consistency produces a shared latent manifold
  • The decoder learns relative structure, not absolute coordinates
  • Many different priors yield equivalent expressive power up to reparameterization

In fact:

Any continuous latent model is only identifiable up to smooth transformations.

So the "true" geometry is unobservable anyway. This is a deep but underappreciated fact.


8. The Trade-Off VAEs Make

Here is the sentence most papers avoid saying explicitly:

VAEs trade representational faithfulness for controllability.

The KL term is the price we pay to:

  • Sample from the model
  • Interpolate in latent space
  • Generalize to new data
  • Reason about uncertainty

Diffusion models later drop this constraint—and gain fidelity—but lose explicit latents. This is not an accident. It's a fundamental trade-off.


9. Summary

To answer the question precisely:

  • No, it is not because the prior is assumed to be true
  • No, it is not just vague intuition
  • Yes, it is a deliberate inductive bias
  • The bias is chosen because it makes learning, inference, and generation possible at all

One sentence you can defend publicly:

We keep \(q(z|x)\) close to \(p(z)\) not because the prior is correct, but because it defines a shared, tractable latent coordinate system that makes sampling, generalization, and learning feasible.


Connection to Other Models

This clarifies the evolution of generative models:

Model Approach
VAE Impose global latent geometry via KL
Diffusion No global latent; iterative denoising instead
EBMs No normalized distribution; learn energy landscape

Each makes a different trade-off between tractability and expressiveness.


Follow-Up: VAEs vs Diffusion vs EBMs

Contrasting VAEs with diffusion models and EBMs, which refuse to impose a global latent geometry—and pay the computational price instead.


The Fork in the Road: Impose Geometry vs. Refuse Geometry

At a high level, generative models must answer one unavoidable question:

Where do we put structure?

There are two fundamentally different answers.


Path A: VAEs — Impose a Latent Coordinate System

Core Commitment

VAEs say:

"We will represent data through a low-dimensional latent variable \(z\), and we will force that latent space to live in a simple, shared geometry."

That geometry is defined by the prior \(p(z)\).

The KL term enforces:

  • Global consistency — all data points share the same latent space
  • Smoothness — nearby latents produce similar outputs
  • Sampleability — we can generate new data by sampling from the prior

Consequences

Benefits:

  • Explicit latent representations
  • Fast sampling (single forward pass)
  • Interpolation and controllable generation

Costs:

  • Decoder must explain everything through a constrained latent bottleneck
  • Likelihood pressure + KL pressure → blurred outputs
  • Mismatch between imposed geometry and true data complexity

This is not a bug—it's the cost of choosing structure up front.


Path B: Diffusion / EBMs — Refuse a Global Latent Geometry

Diffusion models and EBMs make a radically different choice:

"We will not assume there exists a simple latent coordinate system at all."

Instead:

  • Generation happens in data space
  • Uncertainty is modeled directly
  • Structure emerges implicitly

No global \(z\) that must look Gaussian. No KL to a prior.


Diffusion Models: Structure via Noise, Not Coordinates

What Diffusion Replaces

Diffusion models drop:

  • Explicit latent variables
  • Amortized inference
  • KL divergence to a prior

And replace them with:

  • A Markov chain of noising and denoising steps
  • Score matching instead of likelihood maximization

The Key Philosophical Move

Instead of asking:

"What latent variable caused this data?"

Diffusion asks:

"How does this data locally deform probability mass?"

Geometry is learned implicitly via gradients of the log density (the score function):

\[ \nabla_x \log p(x) \]

Consequences

Benefits:

  • Extremely high sample quality
  • No pressure to compress information into a bottleneck
  • No need for a "correct" prior

Costs:

  • Sampling is slow (many iterative steps)
  • No explicit, compact latent representation
  • Control is indirect (classifier guidance, conditioning tricks)

Energy-Based Models (EBMs): No Normalization, No Geometry

EBMs go even further. They say:

"We won't even define a normalized probability distribution."

They learn an energy function:

\[ E_\theta(x) \]

This defines an energy landscape over data:

  • Low energy = plausible data
  • High energy = implausible data

The (unnormalized) probability is:

\[ p_\theta(x) \propto \exp(-E_\theta(x)) \]

No latent space. No decoder likelihood. No tractable partition function.

Consequences

Benefits:

  • Maximum flexibility
  • No imposed geometry
  • Very expressive

Costs:

  • Training is hard (contrastive divergence, noise contrastive estimation)
  • Sampling requires MCMC or Langevin dynamics
  • Inference is expensive

EBMs trade everything for expressiveness.


Why Diffusion Beat VAEs in Vision

Here's the uncomfortable truth:

Natural images do not live on a globally smooth, low-dimensional manifold.

They have:

  • Sharp edges
  • Multi-scale structure
  • Combinatorial variation

Forcing all of that through:

\[ z \sim \mathcal{N}(0, I) \]

is an extreme compression.

Diffusion avoids that compression entirely. That's why it wins on fidelity.


Why VAEs Are Still Essential

For applications like world models, JEPA, and structured reasoning, VAEs provide critical capabilities:

What VAEs give you:

  • Explicit latent variables
  • Amortized inference (fast encoding)
  • Compact representations
  • Fast rollout for planning

What these enable:

  • World models and dynamics learning
  • Planning and reasoning
  • Controllable generation
  • Uncertainty quantification

Diffusion struggles with these use cases. That's why modern systems often combine both approaches.


Modern Hybrids: The Synthesis Phase

Current research is converging on hybrids:

  • VAE-style latents for structure and fast inference
  • Diffusion-style decoders for high-fidelity generation
  • Learned priors instead of fixed Gaussians
  • Latent diffusion (run diffusion in latent space, not pixel space)

This is not regression—it's reconciliation.


Summary: The Core Trade-Off

VAEs assume a simple latent geometry and pay with fidelity. Diffusion refuses a global latent geometry and pays with computation.

This captures the fundamental design choice in generative modeling.


The Meta-Lesson

The question asked earlier:

"Is the prior just intuition?"

turns out to be the central design question of generative modeling.

Every modern model answers it differently:

Model Answer to "Where is structure?"
VAE In an explicit latent space with imposed geometry
Diffusion In the score function, learned implicitly
EBM In the energy landscape, no normalization
Latent Diffusion Hybrid: latent structure + diffusion fidelity

And that's why this field is still very much alive.


Where to Go Next

The natural continuation:

  1. Latent Diffusion — the explicit bridge between VAEs and diffusion
  2. World Models / JEPA — where the latent is learned by prediction, not reconstruction

These build directly on the concepts covered here.


References