Pathwise Derivatives, Stochastic Calculus, and Autodiff¶
This document clarifies:
- Determinism vs. randomness — what backprop actually requires
- Pathwise derivatives vs. distributional derivatives — two different mathematical operations
- Why SDEs are not a counterexample — stochastic calculus and autodiff serve different purposes
- What diffusion models actually do — and how they relate to reparameterization
1. The Precise Requirement for Backpropagation¶
The Question¶
"For \(\partial z / \partial \mu\) to be computable, does \(\mu\) need to be deterministic?"
The Correct Answer¶
Not quite. The precise statement is:
For \(\partial z / \partial \mu\) to be computable by backprop, \(z\) must be a deterministic function of \(\mu\) given the source of randomness.
This is called pathwise determinism.
What This Means¶
- Randomness is allowed
- But it must be externalized — treated as an input, not generated inside the computation
- Given a fixed noise sample \(\epsilon\), the mapping \(\mu \to z\) must be deterministic
2. What Actually Breaks Gradients in Naïve Sampling¶
The Problematic Code¶
Mathematically, this means:
Why Autodiff Fails¶
The sampling operator is opaque to the computational graph:
- Autograd treats
sample()as a black box - There is no explicit functional relationship \(z = g(\mu, \sigma, \text{noise})\)
- The graph sees no edge connecting \(\mu\) to \(z\)
Computational graph (naïve sampling):
μ ──────?──────> z ──────> f(z) ──────> loss
↑
no defined path
(sampling is opaque)
Therefore:
Not because math forbids it — but because the path is not represented in the graph.
3. What the Reparameterization Trick Actually Does¶
Rewrite the same random variable as:
Key Properties¶
- Randomness is now explicit and external
- \(z\) is a deterministic function of \((\mu, \sigma, \epsilon)\)
- \(\epsilon\) is treated as an input, not a parameter to differentiate through
Computational graph (reparameterized):
ε (external input)
↓
μ ──────> z = μ + σ·ε ──────> f(z) ──────> loss
↑
σ ───┘
Clear paths: ∂z/∂μ = 1, ∂z/∂σ = ε
Now the gradient is well-defined:
The Critical Distinction¶
We are not differentiating through randomness. We are differentiating through a deterministic function that uses randomness as input.
This distinction is everything.
4. Two Different Notions of "Stochastic Derivative"¶
This is where confusion often arises. There are two fundamentally different operations that both involve "derivatives" and "randomness":
(A) Pathwise Derivatives — What ML Uses¶
Goal: Compute \(\nabla_\theta \mathbb{E}_{z \sim p_\theta}[f(z)]\)
Method:
- Reparameterize: \(z = g_\theta(\epsilon)\) where \(\epsilon \sim p(\epsilon)\) is fixed
- Differentiate the deterministic function \(g_\theta\)
- Average over noise samples
Key property: Given \(\epsilon\), everything is deterministic. Standard chain rule applies.
(B) Stochastic Calculus — What SDEs Use¶
Goal: Define dynamics driven by continuous noise (Brownian motion)
Method: Itô or Stratonovich calculus — special rules for integrating against nowhere-differentiable processes
Key property: \(W_t\) is nowhere differentiable. "Derivatives" are defined in an integral/distributional sense.
Comparison Table¶
| Aspect | Pathwise (ML) | Stochastic Calculus (SDEs) |
|---|---|---|
| Noise | Fixed sample \(\epsilon\) | Continuous process \(W_t\) |
| Derivative | Standard chain rule | Itô's lemma |
| Result | Deterministic gradient | Distribution over paths |
| Use case | Backprop, VAEs, RL | Physics, finance, diffusion |
5. How SDEs Actually Handle Randomness¶
Consider a stochastic differential equation:
Key Facts About SDEs¶
- \(W_t\) (Brownian motion) is nowhere differentiable — you cannot write \(dW_t/dt\)
- Individual sample paths are not classically differentiable
- SDEs are defined in an integral sense, not as pointwise derivatives
What the Notation Actually Means¶
When we write \(dX_t\), we are not taking a derivative. We are defining:
The second integral is a stochastic integral (Itô integral), which requires special rules because \(W_t\) has infinite variation.
Itô's Lemma — The Chain Rule for SDEs¶
For a function \(f(X_t)\) where \(X_t\) follows an SDE:
The extra \(\frac{1}{2}\sigma^2 f''\) term is the Itô correction — it arises because \(dW_t \cdot dW_t = dt\) (quadratic variation).
This is fundamentally different from the standard chain rule.
6. Why SDE Theory Doesn't Help Naïve Backprop¶
What Autodiff Requires¶
- Explicit functional dependencies in a computational graph
- Pathwise gradients via chain rule
- Deterministic mappings given all inputs
What SDE Theory Provides¶
- Weak/distributional derivatives
- Distributions over paths
- Expectation-level results (e.g., Fokker-Planck equations)
These live in different mathematical worlds.
SDE machinery does not automatically give you gradients for:
You still need to reparameterize to get pathwise gradients.
7. How Diffusion Models Actually Work¶
Diffusion models use SDE language but train with pathwise derivatives.
The Forward Process (Adding Noise)¶
This is an SDE — but we don't backprop through it. We just use it to generate noisy training data.
The Reverse Process (Denoising)¶
The score \(\nabla_x \log p_t(x)\) is approximated by a neural network \(s_\theta(x, t)\).
How Training Works¶
Key insight: Training does NOT backprop through the SDE.
Instead, it uses a denoising score matching objective:
where \(x_t = \alpha_t x_0 + \sigma_t \epsilon\) is a reparameterized noisy sample.
The Connection to Reparameterization¶
# Diffusion training (simplified)
epsilon = torch.randn_like(x0) # External noise
x_t = alpha_t * x0 + sigma_t * epsilon # Reparameterized!
predicted_noise = model(x_t, t)
loss = F.mse_loss(predicted_noise, epsilon)
loss.backward() # Pathwise gradient!
This is the reparameterization trick at scale:
- Noise \(\epsilon\) is sampled externally
- \(x_t\) is a deterministic function of \((x_0, \epsilon, t)\)
- Gradients flow through the deterministic path
Sampling (Inference)¶
At inference, we do solve an SDE/ODE:
But we don't need gradients here — just forward simulation.
8. Reparameterization in Other Domains¶
Reinforcement Learning (SAC, TD3)¶
Exactly the same trick — externalize noise, differentiate through the deterministic path.
Bayesian Neural Networks¶
Weight uncertainty becomes differentiable.
Normalizing Flows¶
Pure reparameterization — all expressivity in the transformation.
9. Summary: Two Worlds, One Bridge¶
The Core Insight¶
Backpropagation requires deterministic paths in the computational graph. The reparameterization trick works by making randomness an explicit input, so that the mapping from parameters to samples is deterministic conditioned on that noise.
The Companion Insight¶
Stochastic calculus defines derivatives in distribution or expectation, not in the pathwise sense required by autodiff.
No contradiction — just different tools for different jobs.
Visual Summary¶
┌─────────────────────────────────────────────────────────────────┐
│ PATHWISE DERIVATIVES (what ML uses) │
│ │
│ 1. Sample noise ε ~ N(0,I) once │
│ 2. Compute z = g_θ(ε) deterministically │
│ 3. Backprop through g_θ using standard chain rule │
│ 4. Average gradients over many ε samples │
│ │
│ Result: ∇_θ E[f(z)] ≈ (1/N) Σ ∇_θ f(g_θ(εᵢ)) │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ STOCHASTIC CALCULUS (what SDEs use) │
│ │
│ 1. Define dynamics: dX = b(X)dt + σ(X)dW │
│ 2. W_t is nowhere differentiable │
│ 3. Use Itô's lemma (modified chain rule) │
│ 4. Results are distributions over paths │
│ │
│ Result: Fokker-Planck equations, path distributions │
└─────────────────────────────────────────────────────────────────┘
10. Intuition Check¶
Think of randomness like an input image:
- You don't differentiate with respect to the pixels
- You differentiate with respect to the parameters that process the image
Reparameterization treats noise the same way:
- \(\epsilon\) is like input data — fixed during the forward/backward pass
- \(\theta\) is what we optimize — gradients flow through the deterministic transformation
References¶
- VAE-04-reparameterization.md — The reparameterization trick
- VAE-QA.md — Why the prior matters
- Kingma & Welling (2014) — "Auto-Encoding Variational Bayes"
- Song et al. (2021) — "Score-Based Generative Modeling through SDEs"