Deriving the Pathwise Gradient Estimator¶
This document builds the pathwise (reparameterization) gradient estimator step by step, then explains why the simple derivative \(\partial z / \partial \mu = 1\) is the key insight.
Step 0: The Object We Want to Differentiate¶
We want gradients of an expectation where the distribution depends on parameters:
For VAEs: \(f(z) = \log p_\theta(x \mid z)\), and \(q_\phi\) is \(q_\phi(z \mid x)\).
Step 1: Reparameterize the Random Variable¶
Assume we can write samples as a deterministic transform of parameter-free noise:
Example (Gaussian): \(g_\phi(\epsilon) = \mu_\phi + \sigma_\phi \odot \epsilon\), with \(\epsilon \sim \mathcal{N}(0, I)\).
Then:
Step 2: Differentiate Under the Expectation (The "Pathwise" Move)¶
Because \(p(\epsilon)\) doesn't depend on \(\phi\), we can move the gradient inside:
Now apply the chain rule:
So the pathwise gradient estimator is:
And the Monte Carlo estimator with \(K\) samples is:
That's the whole method: differentiate through the sampling path.
Exercise: The Gaussian Case¶
Take the Gaussian case:
Assume \(\phi\) parameterizes \(\mu\) and \(\sigma\).
Question: What is \(\frac{\partial z}{\partial \mu}\)?
Answer: It's 1.
And that simple answer is actually the entire reason the reparameterization trick works.
1. Why \(\partial z / \partial \mu = 1\) Matters¶
You have:
Treat \(\epsilon\) as a fixed input (a sampled number) during backprop.
Then:
These derivatives are:
- Well-defined
- Finite
- Independent of probability theory
They are just ordinary calculus.
This is the key:
Once the randomness is made explicit, the gradient is completely classical.
2. Contrast with Naïve Sampling¶
If instead you wrote:
there is no expression for:
because:
- "Sampling" is not a mathematical function
- It hides randomness inside an opaque operation
Autodiff has nothing to differentiate.
So it's not that the derivative "should" be 1 mathematically — it's that you never gave the system a function to differentiate.
3. What Gradient Actually Flows in a VAE¶
Putting it together, the gradient w.r.t. \(\mu\) becomes:
So gradients flow straight through the sample.
- No likelihood ratios
- No REINFORCE
- No variance explosion
4. Why It's Called a "Pathwise" Derivative¶
Each sampled \(\epsilon_k\) defines a path:
You differentiate along that path.
That's why the name "pathwise gradient estimator" is more descriptive than "reparameterization trick".
5. A Subtle but Crucial Point¶
During backprop:
- \(\epsilon\) is treated as a constant
- We are not differentiating randomness
- We are differentiating a deterministic computation conditioned on noise
This is why the intuition about "determinism" is right, but needs refinement: pathwise determinism, not absolute determinism.
6. Where This Fails¶
This only works when:
- You can write \(z = g_\phi(\epsilon)\)
- With \(\epsilon\) independent of \(\phi\)
That's why:
- Discrete sampling breaks it — no continuous path to differentiate
- Gumbel-Softmax is an approximation — continuous relaxation of discrete
- Score-function estimators exist as a fallback — REINFORCE for non-reparameterizable cases
7. Differentiating Sample Paths in SDEs (Clarification)¶
There's potential confusion between "pathwise derivatives" in ML and "sample paths" in SDEs. Let's clarify.
In ML: "Pathwise" = Along a Fixed Noise Realization¶
Given a fixed \(\epsilon\), we have a deterministic path:
We differentiate this path using the standard chain rule. The word "path" refers to the computational graph path.
In SDEs: "Sample Path" = A Realization of a Stochastic Process¶
A sample path \(\{X_t(\omega)\}_{t \geq 0}\) is one realization of the process, indexed by outcome \(\omega\).
Key differences:
| Aspect | ML Pathwise | SDE Sample Path |
|---|---|---|
| What varies | Parameters \(\theta\) | Time \(t\) |
| Noise | Fixed \(\epsilon\) | Continuous \(W_t(\omega)\) |
| Differentiability | Standard calculus | Nowhere differentiable in \(t\) |
| Goal | \(\nabla_\theta \mathbb{E}[f(z)]\) | Describe evolution \(X_t\) |
Why SDEs Need Special Calculus¶
For an SDE sample path \(X_t(\omega)\):
- The path \(t \mapsto X_t(\omega)\) is continuous but nowhere differentiable
- You cannot write \(dX_t/dt\) in the classical sense
- Itô calculus provides rules for manipulating these objects
The Confusion Resolved¶
When ML papers say "pathwise gradient," they mean:
Differentiate w.r.t. parameters along a fixed noise realization
When SDE papers say "sample path," they mean:
One realization of a continuous-time stochastic process
These are different uses of the word "path":
- ML: path through the computational graph
- SDEs: path through state space over time
Can We Differentiate SDE Sample Paths w.r.t. Parameters?¶
Yes! This is called sensitivity analysis or pathwise sensitivity:
where \(\theta\) parameterizes the drift \(b_\theta\) or diffusion \(\sigma_\theta\).
This requires solving an auxiliary SDE (the sensitivity equation):
This is computationally expensive and rarely used in ML. Instead, diffusion models use:
- Denoising score matching — avoids differentiating through the SDE
- Reparameterization at each timestep — \(x_t = \alpha_t x_0 + \sigma_t \epsilon\)
8. Summary¶
The reparameterization trick works because it turns sampling into a differentiable computation graph with respect to model parameters.
The key insight: \(\partial z / \partial \mu = 1\) is just ordinary calculus once you externalize the noise.
References¶
- VAE-05-followup-1.md — Pathwise derivatives vs. stochastic calculus
- VAE-04-reparameterization.md — The reparameterization trick
- Kingma & Welling (2014) — "Auto-Encoding Variational Bayes"
- Mohamed et al. (2020) — "Monte Carlo Gradient Estimation in Machine Learning"