Pathwise Derivatives, Stochastic Calculus, and Autodiff¶

This document clarifies:

Determinism vs. randomness — what backprop actually requires
Pathwise derivatives vs. distributional derivatives — two different mathematical operations
Why SDEs are not a counterexample — stochastic calculus and autodiff serve different purposes
What diffusion models actually do — and how they relate to reparameterization

1. The Precise Requirement for Backpropagation¶

The Question¶

"For \(\partial z / \partial \mu\) to be computable, does \(\mu\) need to be deterministic?"

The Correct Answer¶

Not quite. The precise statement is:

For \(\partial z / \partial \mu\) to be computable by backprop, \(z\) must be a deterministic function of \(\mu\) given the source of randomness.

This is called pathwise determinism.

What This Means¶

Randomness is allowed
But it must be externalized — treated as an input, not generated inside the computation
Given a fixed noise sample \(\epsilon\), the mapping \(\mu \to z\) must be deterministic

2. What Actually Breaks Gradients in Naïve Sampling¶

The Problematic Code¶

z = Normal(mu, sigma).sample()
loss = f(z)

Mathematically, this means:

\[ z \sim \mathcal{N}(\mu, \sigma^2) \]

Why Autodiff Fails¶

The sampling operator is opaque to the computational graph:

Autograd treats sample() as a black box
There is no explicit functional relationship \(z = g(\mu, \sigma, \text{noise})\)
The graph sees no edge connecting \(\mu\) to \(z\)

Computational graph (naïve sampling):

    μ ──────?──────> z ──────> f(z) ──────> loss
              ↑
         no defined path
         (sampling is opaque)

Therefore:

\[ \frac{\partial z}{\partial \mu} \quad \text{is undefined in autodiff} \]

Not because math forbids it — but because the path is not represented in the graph.

3. What the Reparameterization Trick Actually Does¶

Rewrite the same random variable as:

\[ \epsilon \sim \mathcal{N}(0, I) \quad \text{(sampled once, treated as input)} \]

\[ z = \mu + \sigma \cdot \epsilon \quad \text{(deterministic given } \epsilon \text{)} \]

Key Properties¶

Randomness is now explicit and external
\(z\) is a deterministic function of \((\mu, \sigma, \epsilon)\)
\(\epsilon\) is treated as an input, not a parameter to differentiate through

Computational graph (reparameterized):

    ε (external input)
         ↓
    μ ──────> z = μ + σ·ε ──────> f(z) ──────> loss
         ↑
    σ ───┘

    Clear paths: ∂z/∂μ = 1, ∂z/∂σ = ε

Now the gradient is well-defined:

\[ \frac{\partial z}{\partial \mu} = 1, \quad \frac{\partial z}{\partial \sigma} = \epsilon \]

The Critical Distinction¶

We are not differentiating through randomness. We are differentiating through a deterministic function that uses randomness as input.

This distinction is everything.

4. Two Different Notions of "Stochastic Derivative"¶

This is where confusion often arises. There are two fundamentally different operations that both involve "derivatives" and "randomness":

(A) Pathwise Derivatives — What ML Uses¶

Goal: Compute \(\nabla_\theta \mathbb{E}_{z \sim p_\theta}[f(z)]\)

Method:

Reparameterize: \(z = g_\theta(\epsilon)\) where \(\epsilon \sim p(\epsilon)\) is fixed
Differentiate the deterministic function \(g_\theta\)
Average over noise samples

\[ \nabla_\theta \mathbb{E}_{z \sim p_\theta}[f(z)] = \mathbb{E}_{\epsilon}\left[\nabla_\theta f(g_\theta(\epsilon))\right] \]

Key property: Given \(\epsilon\), everything is deterministic. Standard chain rule applies.

(B) Stochastic Calculus — What SDEs Use¶

Goal: Define dynamics driven by continuous noise (Brownian motion)

Method: Itô or Stratonovich calculus — special rules for integrating against nowhere-differentiable processes

\[ dX_t = b(X_t)\,dt + \sigma(X_t)\,dW_t \]

Key property: \(W_t\) is nowhere differentiable. "Derivatives" are defined in an integral/distributional sense.

Comparison Table¶

Aspect	Pathwise (ML)	Stochastic Calculus (SDEs)
Noise	Fixed sample \(\epsilon\)	Continuous process \(W_t\)
Derivative	Standard chain rule	Itô's lemma
Result	Deterministic gradient	Distribution over paths
Use case	Backprop, VAEs, RL	Physics, finance, diffusion

5. How SDEs Actually Handle Randomness¶

Consider a stochastic differential equation:

\[ dX_t = b(X_t)\,dt + \sigma(X_t)\,dW_t \]

Key Facts About SDEs¶

\(W_t\) (Brownian motion) is nowhere differentiable — you cannot write \(dW_t/dt\)
Individual sample paths are not classically differentiable
SDEs are defined in an integral sense, not as pointwise derivatives

What the Notation Actually Means¶

When we write \(dX_t\), we are not taking a derivative. We are defining:

\[ X_t = X_0 + \int_0^t b(X_s)\,ds + \int_0^t \sigma(X_s)\,dW_s \]

The second integral is a stochastic integral (Itô integral), which requires special rules because \(W_t\) has infinite variation.

Itô's Lemma — The Chain Rule for SDEs¶

For a function \(f(X_t)\) where \(X_t\) follows an SDE:

\[ df = \frac{\partial f}{\partial x}\,dX + \frac{1}{2}\frac{\partial^2 f}{\partial x^2}\sigma^2\,dt \]

The extra \(\frac{1}{2}\sigma^2 f''\) term is the Itô correction — it arises because \(dW_t \cdot dW_t = dt\) (quadratic variation).

This is fundamentally different from the standard chain rule.

6. Why SDE Theory Doesn't Help Naïve Backprop¶

What Autodiff Requires¶

Explicit functional dependencies in a computational graph
Pathwise gradients via chain rule
Deterministic mappings given all inputs

What SDE Theory Provides¶

Weak/distributional derivatives
Distributions over paths
Expectation-level results (e.g., Fokker-Planck equations)

These live in different mathematical worlds.

SDE machinery does not automatically give you gradients for:

z = sample(mu)  # Still opaque to autodiff

You still need to reparameterize to get pathwise gradients.

7. How Diffusion Models Actually Work¶

Diffusion models use SDE language but train with pathwise derivatives.

The Forward Process (Adding Noise)¶

\[ dx_t = -\frac{1}{2}\beta(t)x_t\,dt + \sqrt{\beta(t)}\,dW_t \]

This is an SDE — but we don't backprop through it. We just use it to generate noisy training data.

The Reverse Process (Denoising)¶

\[ dx_t = \left[-\frac{1}{2}\beta(t)x_t - \beta(t)\nabla_x \log p_t(x_t)\right]dt + \sqrt{\beta(t)}\,d\bar{W}_t \]

The score \(\nabla_x \log p_t(x)\) is approximated by a neural network \(s_\theta(x, t)\).

How Training Works¶

Key insight: Training does NOT backprop through the SDE.

Instead, it uses a denoising score matching objective:

\[ \mathcal{L}(\theta) = \mathbb{E}_{t, x_0, \epsilon}\left[\|s_\theta(x_t, t) - \nabla_{x_t} \log p(x_t | x_0)\|^2\right] \]

where \(x_t = \alpha_t x_0 + \sigma_t \epsilon\) is a reparameterized noisy sample.

The Connection to Reparameterization¶

# Diffusion training (simplified)
epsilon = torch.randn_like(x0)           # External noise
x_t = alpha_t * x0 + sigma_t * epsilon   # Reparameterized!
predicted_noise = model(x_t, t)
loss = F.mse_loss(predicted_noise, epsilon)
loss.backward()                          # Pathwise gradient!

This is the reparameterization trick at scale:

Noise \(\epsilon\) is sampled externally
\(x_t\) is a deterministic function of \((x_0, \epsilon, t)\)
Gradients flow through the deterministic path

Sampling (Inference)¶

At inference, we do solve an SDE/ODE:

\[ dx_t = \left[-\frac{1}{2}\beta(t)x_t - \beta(t)s_\theta(x_t, t)\right]dt \]

But we don't need gradients here — just forward simulation.

8. Reparameterization in Other Domains¶

Reinforcement Learning (SAC, TD3)¶

\[ a = \mu_\theta(s) + \sigma_\theta(s) \cdot \epsilon, \quad \epsilon \sim \mathcal{N}(0, I) \]

Exactly the same trick — externalize noise, differentiate through the deterministic path.

Bayesian Neural Networks¶

\[ w = \mu + \sigma \cdot \epsilon \]

Weight uncertainty becomes differentiable.

Normalizing Flows¶

\[ z = f_\theta(\epsilon), \quad \epsilon \sim \mathcal{N}(0, I) \]

Pure reparameterization — all expressivity in the transformation.

9. Summary: Two Worlds, One Bridge¶

The Core Insight¶

Backpropagation requires deterministic paths in the computational graph. The reparameterization trick works by making randomness an explicit input, so that the mapping from parameters to samples is deterministic conditioned on that noise.

The Companion Insight¶

Stochastic calculus defines derivatives in distribution or expectation, not in the pathwise sense required by autodiff.

No contradiction — just different tools for different jobs.

Visual Summary¶

┌─────────────────────────────────────────────────────────────────┐
│  PATHWISE DERIVATIVES (what ML uses)                            │
│                                                                  │
│  1. Sample noise ε ~ N(0,I) once                                │
│  2. Compute z = g_θ(ε) deterministically                        │
│  3. Backprop through g_θ using standard chain rule              │
│  4. Average gradients over many ε samples                       │
│                                                                  │
│  Result: ∇_θ E[f(z)] ≈ (1/N) Σ ∇_θ f(g_θ(εᵢ))                  │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│  STOCHASTIC CALCULUS (what SDEs use)                            │
│                                                                  │
│  1. Define dynamics: dX = b(X)dt + σ(X)dW                       │
│  2. W_t is nowhere differentiable                               │
│  3. Use Itô's lemma (modified chain rule)                       │
│  4. Results are distributions over paths                        │
│                                                                  │
│  Result: Fokker-Planck equations, path distributions            │
└─────────────────────────────────────────────────────────────────┘

10. Intuition Check¶

Think of randomness like an input image:

You don't differentiate with respect to the pixels
You differentiate with respect to the parameters that process the image

Reparameterization treats noise the same way:

\(\epsilon\) is like input data — fixed during the forward/backward pass
\(\theta\) is what we optimize — gradients flow through the deterministic transformation

References¶

VAE-04-reparameterization.md — The reparameterization trick
VAE-QA.md — Why the prior matters
Kingma & Welling (2014) — "Auto-Encoding Variational Bayes"
Song et al. (2021) — "Score-Based Generative Modeling through SDEs"