Skip to content

Deriving the Pathwise Gradient Estimator

This document builds the pathwise (reparameterization) gradient estimator step by step, then explains why the simple derivative \(\partial z / \partial \mu = 1\) is the key insight.


Step 0: The Object We Want to Differentiate

We want gradients of an expectation where the distribution depends on parameters:

\[ J(\phi) = \mathbb{E}_{z \sim q_\phi(z)}\left[f(z)\right] \]

For VAEs: \(f(z) = \log p_\theta(x \mid z)\), and \(q_\phi\) is \(q_\phi(z \mid x)\).


Step 1: Reparameterize the Random Variable

Assume we can write samples as a deterministic transform of parameter-free noise:

\[ \epsilon \sim p(\epsilon) \quad \text{(does NOT depend on } \phi \text{)} \]
\[ z = g_\phi(\epsilon) \]

Example (Gaussian): \(g_\phi(\epsilon) = \mu_\phi + \sigma_\phi \odot \epsilon\), with \(\epsilon \sim \mathcal{N}(0, I)\).

Then:

\[ J(\phi) = \mathbb{E}_{\epsilon \sim p(\epsilon)}\left[f(g_\phi(\epsilon))\right] \]

Step 2: Differentiate Under the Expectation (The "Pathwise" Move)

Because \(p(\epsilon)\) doesn't depend on \(\phi\), we can move the gradient inside:

\[ \nabla_\phi J(\phi) = \mathbb{E}_{\epsilon \sim p(\epsilon)}\left[\nabla_\phi f(g_\phi(\epsilon))\right] \]

Now apply the chain rule:

\[ \nabla_\phi f(g_\phi(\epsilon)) = \underbrace{\nabla_z f(z)}_{\text{gradient of downstream loss}} \bigg|_{z=g_\phi(\epsilon)} \cdot \underbrace{\nabla_\phi g_\phi(\epsilon)}_{\text{how sample moves w.r.t. } \phi} \]

So the pathwise gradient estimator is:

\[ \nabla_\phi J(\phi) = \mathbb{E}_{\epsilon \sim p(\epsilon)}\left[\nabla_z f(g_\phi(\epsilon)) \cdot \nabla_\phi g_\phi(\epsilon)\right] \]

And the Monte Carlo estimator with \(K\) samples is:

\[ \nabla_\phi J(\phi) \approx \frac{1}{K} \sum_{k=1}^{K} \nabla_z f(g_\phi(\epsilon_k)) \cdot \nabla_\phi g_\phi(\epsilon_k), \quad \epsilon_k \sim p(\epsilon) \]

That's the whole method: differentiate through the sampling path.


Exercise: The Gaussian Case

Take the Gaussian case:

\[ z = \mu + \sigma \epsilon, \quad \epsilon \sim \mathcal{N}(0, 1) \]

Assume \(\phi\) parameterizes \(\mu\) and \(\sigma\).

Question: What is \(\frac{\partial z}{\partial \mu}\)?

Answer: It's 1.

And that simple answer is actually the entire reason the reparameterization trick works.


1. Why \(\partial z / \partial \mu = 1\) Matters

You have:

\[ z = \mu + \sigma \epsilon \]

Treat \(\epsilon\) as a fixed input (a sampled number) during backprop.

Then:

\[ \frac{\partial z}{\partial \mu} = 1, \qquad \frac{\partial z}{\partial \sigma} = \epsilon \]

These derivatives are:

  • Well-defined
  • Finite
  • Independent of probability theory

They are just ordinary calculus.

This is the key:

Once the randomness is made explicit, the gradient is completely classical.


2. Contrast with Naïve Sampling

If instead you wrote:

\[ z \sim \mathcal{N}(\mu, \sigma^2) \]

there is no expression for:

\[ \frac{\partial z}{\partial \mu} \]

because:

  • "Sampling" is not a mathematical function
  • It hides randomness inside an opaque operation

Autodiff has nothing to differentiate.

So it's not that the derivative "should" be 1 mathematically — it's that you never gave the system a function to differentiate.


3. What Gradient Actually Flows in a VAE

Putting it together, the gradient w.r.t. \(\mu\) becomes:

\[ \nabla_\mu \mathbb{E}[f(z)] \approx \frac{1}{K} \sum_{k=1}^{K} \underbrace{\nabla_z f(z_k)}_{\text{decoder gradient}} \cdot \underbrace{\frac{\partial z_k}{\partial \mu}}_{= 1} \]

So gradients flow straight through the sample.

  • No likelihood ratios
  • No REINFORCE
  • No variance explosion

4. Why It's Called a "Pathwise" Derivative

Each sampled \(\epsilon_k\) defines a path:

\[ \mu \longrightarrow z_k \longrightarrow f(z_k) \]

You differentiate along that path.

That's why the name "pathwise gradient estimator" is more descriptive than "reparameterization trick".


5. A Subtle but Crucial Point

During backprop:

  • \(\epsilon\) is treated as a constant
  • We are not differentiating randomness
  • We are differentiating a deterministic computation conditioned on noise

This is why the intuition about "determinism" is right, but needs refinement: pathwise determinism, not absolute determinism.


6. Where This Fails

This only works when:

  • You can write \(z = g_\phi(\epsilon)\)
  • With \(\epsilon\) independent of \(\phi\)

That's why:

  • Discrete sampling breaks it — no continuous path to differentiate
  • Gumbel-Softmax is an approximation — continuous relaxation of discrete
  • Score-function estimators exist as a fallback — REINFORCE for non-reparameterizable cases

7. Differentiating Sample Paths in SDEs (Clarification)

There's potential confusion between "pathwise derivatives" in ML and "sample paths" in SDEs. Let's clarify.

In ML: "Pathwise" = Along a Fixed Noise Realization

Given a fixed \(\epsilon\), we have a deterministic path:

\[ \theta \xrightarrow{g_\theta} z \xrightarrow{f} \text{loss} \]

We differentiate this path using the standard chain rule. The word "path" refers to the computational graph path.

In SDEs: "Sample Path" = A Realization of a Stochastic Process

A sample path \(\{X_t(\omega)\}_{t \geq 0}\) is one realization of the process, indexed by outcome \(\omega\).

Key differences:

Aspect ML Pathwise SDE Sample Path
What varies Parameters \(\theta\) Time \(t\)
Noise Fixed \(\epsilon\) Continuous \(W_t(\omega)\)
Differentiability Standard calculus Nowhere differentiable in \(t\)
Goal \(\nabla_\theta \mathbb{E}[f(z)]\) Describe evolution \(X_t\)

Why SDEs Need Special Calculus

For an SDE sample path \(X_t(\omega)\):

\[ X_t = X_0 + \int_0^t b(X_s)\,ds + \int_0^t \sigma(X_s)\,dW_s \]
  • The path \(t \mapsto X_t(\omega)\) is continuous but nowhere differentiable
  • You cannot write \(dX_t/dt\) in the classical sense
  • Itô calculus provides rules for manipulating these objects

The Confusion Resolved

When ML papers say "pathwise gradient," they mean:

Differentiate w.r.t. parameters along a fixed noise realization

When SDE papers say "sample path," they mean:

One realization of a continuous-time stochastic process

These are different uses of the word "path":

  • ML: path through the computational graph
  • SDEs: path through state space over time

Can We Differentiate SDE Sample Paths w.r.t. Parameters?

Yes! This is called sensitivity analysis or pathwise sensitivity:

\[ \frac{\partial X_t}{\partial \theta} \]

where \(\theta\) parameterizes the drift \(b_\theta\) or diffusion \(\sigma_\theta\).

This requires solving an auxiliary SDE (the sensitivity equation):

\[ d\left(\frac{\partial X_t}{\partial \theta}\right) = \frac{\partial b}{\partial x}\frac{\partial X_t}{\partial \theta}\,dt + \frac{\partial b}{\partial \theta}\,dt + \text{(diffusion terms)} \]

This is computationally expensive and rarely used in ML. Instead, diffusion models use:

  • Denoising score matching — avoids differentiating through the SDE
  • Reparameterization at each timestep\(x_t = \alpha_t x_0 + \sigma_t \epsilon\)

8. Summary

The reparameterization trick works because it turns sampling into a differentiable computation graph with respect to model parameters.

The key insight: \(\partial z / \partial \mu = 1\) is just ordinary calculus once you externalize the noise.


References

  • VAE-05-followup-1.md — Pathwise derivatives vs. stochastic calculus
  • VAE-04-reparameterization.md — The reparameterization trick
  • Kingma & Welling (2014) — "Auto-Encoding Variational Bayes"
  • Mohamed et al. (2020) — "Monte Carlo Gradient Estimation in Machine Learning"