Skip to content

VAE: Why We Introduce q(z|x)

Addressing the fundamental question: "Why are we allowed to just introduce a distribution over latent variables?"

1. The Natural Objection

We introduced the posterior \(q(z|x)\) in order to derive the ELBO.

And that immediately raises the natural objection:

"If our goal is to learn latent variables \(z\), why are we allowed to just introduce a distribution over them?"

This is the right question.

2. The Key Clarification

We do NOT introduce \(q(z|x)\) because we want it

We introduce it because we cannot compute the true posterior.

In words: The true object of interest is the posterior \(p_\theta(z|x)\).

In math:

\[ p_\theta(z|x) = \frac{p_\theta(x|z) \cdot p(z)}{p_\theta(x)} \]

The problem: The denominator requires:

\[ p_\theta(x) = \int p_\theta(x|z) \, p(z) \, dz \]

which is intractable.

So the logic is not:

"Let's introduce \(q\) because it's convenient."

It is:

"Exact inference is impossible, so we must approximate it."

This is classical variational inference, not a neural-network trick.

3. What Is Actually Being Optimized

Let's separate variables, distributions, and parameters.

Latent variable \(z\)

  • \(z\) is a random variable
  • It is not a parameter we optimize directly

True posterior (unavailable)

  • \(p_\theta(z|x)\) — what Bayes' rule would give us
  • Depends on unknown normalization \(p_\theta(x)\)
  • Cannot be evaluated or sampled from

Variational posterior (introduced)

  • \(q_\phi(z|x)\) — our approximation
  • Tractable, learnable
  • Parameterized by a neural network (the encoder)

The crucial distinction: We introduce \(q(z|x)\) not to replace \(z\), but to replace inference about \(z\).

4. What "Learning z" Really Means

This sentence usually causes confusion:

"VAEs learn latent variables."

What they actually do is:

VAEs learn a conditional distribution over latent variables given data.

In math:

\[ z \sim q_\phi(z|x) \]

So instead of learning:

  • A single latent code \(z_i\) per datapoint

We learn:

  • A function that maps \(x \to\) distribution over plausible \(z\)'s

This is Bayesian inference, amortized across the dataset.

5. Why Introducing \(q(z|x)\) Is Mathematically Legitimate

Here's the clean justification:

"We introduce \(q(z|x)\) as an auxiliary distribution. This does not change the likelihood. It only allows us to rewrite it in a form we can optimize."

Nothing is assumed about the data at this step — only about tractability.

In math: The ELBO derivation starts with an exact identity:

\[ \log p(x) = \log \mathbb{E}_{q(z|x)} \left[ \frac{p(x, z)}{q(z|x)} \right] \]

This identity is exact, before Jensen's inequality.

The approximation only enters when we lower-bound this expression, not when we introduce \(q\).

6. What Assumptions Are Actually Being Made

Assumptions about the model

  • Data is generated from latent variables:
\[ z \sim p(z), \quad x \sim p_\theta(x|z) \]
  • The prior \(p(z)\) is simple (e.g., Gaussian)

Assumptions about inference

  • Exact posterior inference is intractable
  • A parametric family \(q_\phi(z|x)\) is expressive enough to approximate it

What is NOT assumed

  • That there is a single "true" latent code
  • That the posterior is Gaussian in reality
  • That \(q(z|x)\) is correct — only that it is optimizable

7. Known vs Unknown (Final Summary)

Known / Fixed

What Value
Observed data \(x\)
Prior \(p(z) = \mathcal{N}(0, I)\)
Network architectures Encoder and decoder structure

Unknown / Learned

What Learned by
Decoder parameters \(\theta\) Maximizing ELBO
Encoder parameters \(\phi\) Maximizing ELBO
Latent geometry Emerges from training

Random (Sampled, Not Learned)

What Role
\(z\) Latent variable, sampled from $q_\phi(z
\(\epsilon\) Noise for reparameterization, sampled from \(\mathcal{N}(0, I)\)

8. The One Sentence That Resolves Everything

If you remember only one sentence, make it this:

We are not learning latent variables directly — we are learning how to perform inference over latent variables.

That sentence dissolves the apparent contradiction.

9. Why This Matters for What Comes Next

Once this clicks, the evolution of generative models becomes obvious:

Model Approach to Inference
VAE Explicit approximate posterior $q_\phi(z
Diffusion Implicit posterior via score matching
EBMs / JEPA No normalized posterior at all
World models Latent inference + dynamics

But they all inherit this core move:

Replace intractable inference with learnable inference.

References