Understanding Brownian Motion: From Random Walks to Diffusion Models¶

A ground-up introduction to the stochastic process that powers modern diffusion models

Brownian motion is the mathematical heartbeat of diffusion models. It's the reason why adding noise "just works," why the $\sqrt{dt}$ scaling appears everywhere, and why continuous-time SDEs can describe discrete-time algorithms like DDPM.

This tutorial builds intuition from physical observations, connects to rigorous mathematics, and shows exactly how Brownian motion enables generative modeling.

Table of Contents¶

The Physical Origin Story
Random Walk: The Discrete Cousin
Brownian Motion: The Continuous Limit
The Four Defining Properties (And What They Really Mean)
The Mysterious $\sqrt{dt}$ Scaling
From Random Walk to Brownian Motion: The Scaling Limit
Why This Matters for Diffusion Models

1. The Physical Origin Story¶

The Discovery (1827)¶

Robert Brown, a botanist, looked through a microscope at pollen grains suspended in water and noticed something uncanny:

They jittered forever.

No trend. No settling. No rest. The motion was real, persistent, and structureless.

At first, he suspected life. Then he tried dust particles. Same thing. The motion was universal, relentless, and seemingly random.

The Explanation (1905)¶

Decades later, Albert Einstein explained it: invisible molecular collisions from the surrounding fluid.

The key insights: - Water molecules are in constant thermal motion - They bombard the visible particle from all directions - Each collision is tiny, but there are billions per second - The cumulative effect is visible, persistent random motion

This wasn't just a curiosity—Einstein used it to prove atoms exist and to measure Avogadro's number.

The Mathematical Skeleton (1920s)¶

Norbert Wiener gave the phenomenon its clean mathematical structure, now called the Wiener process or Brownian motion.

This mathematical object became the foundation for: - Stochastic calculus (Itô, Stratonovich) - Financial mathematics (Black-Scholes) - Statistical physics (Langevin dynamics) - Modern machine learning (diffusion models, score matching)

2. Random Walk: The Discrete Cousin¶

Before understanding Brownian motion, let's start with something simpler: random walks.

Definition¶

A random walk is defined step by step in discrete time.

Time: $t = 0, 1, 2, \ldots$

At each step, the position changes by a random increment:

\[ X_{n+1} = X_n + \varepsilon_{n+1} \]

where $\varepsilon_n$ are independent and identically distributed (i.i.d.) random variables.

Classic Example: Simple Symmetric Random Walk¶

\[ \varepsilon_n = \begin{cases} +1 & \text{with probability } \tfrac{1}{2} \\ -1 & \text{with probability } \tfrac{1}{2} \end{cases} \]

Interpretation:

Flip a coin
Heads → step right
Tails → step left
Repeat forever

Key Properties¶

Discrete time and usually discrete space
Independent increments: Each step doesn't depend on previous steps
Paths look jagged, stair-like
Easy to simulate, easy to analyze

Where Random Walks Appear¶

Gambling (gambler's ruin)
Population genetics (genetic drift)
Markov chains
Graph theory (random walks on graphs)
Toy models for diffusion

3. Brownian Motion: The Continuous Limit¶

Brownian motion is what you get when a random walk is pushed to the limit of infinite refinement.

The Transition¶

Think of it this way:

Random walk is flipping coins while walking on tiles.

Brownian motion is what the walk looks like after you: - Shrink the tiles to dust - Speed up time - Step infinitely often

The path never smooths out. Reality just stops showing you the individual steps.

Continuous Time¶

Brownian motion is defined in continuous time:

\[ t \in [0,\infty) \]

Unlike random walks where you count steps $n = 0, 1, 2, \ldots$, Brownian motion evolves continuously. You can ask "where is the particle at time $t = \pi$?" and get a meaningful answer.

4. The Four Defining Properties¶

A process $B_t$ is Brownian motion (or a Wiener process) if it satisfies four properties. Let's unpack what each one really means.

Conceptual Clarification: What is $B_t$?¶

Before diving into the properties, let's clarify what we mean by "process" and how to think about $B_t$.

Is $B_t$ a Random Variable?¶

Yes, but more precisely:

For each fixed time $t$: $B_t$ is a random variable representing the position/displacement at time $t$
For the collection $\{B_t : t \geq 0\}$: This is a stochastic process—a family of random variables indexed by time

Think of it this way:

At time $t = 0.5$, $B_{0.5}$ is a random variable (could be any real number, with probabilities given by a Gaussian)
At time $t = 1.0$, $B_{1.0}$ is a different random variable
The process $B_t$ is the entire collection: $\{B_{0.5}, B_{1.0}, B_{2.3}, \ldots\}$

In diffusion models: $B_t$ represents the noise/displacement at time $t$. For a d-dimensional process, $B_t \in \mathbb{R}^d$ is a random vector.

Is "Process" the Same as in Gaussian Process, Dirichlet Process, etc.?¶

Yes! The word "process" here means stochastic process in the same sense as:

Gaussian Process: A collection of random variables $\{f(x) : x \in \mathcal{X}\}$ where any finite subset is jointly Gaussian
Dirichlet Process: A stochastic process whose realizations are probability distributions
Markov Process: A stochastic process with the Markov property
Brownian Motion: A stochastic process with specific properties (the four we're about to define)

Common structure: All are collections of random variables indexed by some parameter (time, space, etc.)

Key difference: What properties they satisfy: - Brownian motion: Independent increments, Gaussian, variance = time - Gaussian process: Any finite subset is jointly Gaussian - Dirichlet process: Realizations are probability distributions

In summary:

$B_t$ (for fixed $t$) = random variable (position at time $t$)
$\{B_t : t \geq 0\}$ = stochastic process (the entire trajectory)
"Process" = same mathematical concept as GP, DP, etc., but with different properties

Property 1: $B_0 = 0$¶

"Start at zero."

Mathematical meaning: The process begins at the origin.

Physical meaning: We measure displacement, not absolute position. The particle's initial location is irrelevant; only changes matter.

Interpretation: Wherever the particle starts, call that position zero. This is a choice of reference frame, not a physical constraint.

Why it matters: Origins are conventions, not truths. This mirrors how physics works—we care about relative motion.

Property 2: Independent Increments¶

\[ B_t - B_s \perp\!\!\!\perp B_u - B_v \quad \text{for disjoint intervals} \]

Mathematical meaning: Changes over non-overlapping time intervals are independent random variables.

Physical meaning: The future does not remember the past. Molecular collisions are so rapid and chaotic that past kicks don't influence future kicks.

Why this is reasonable:

Billions of collisions per second
Correlations wash out almost instantly
The surrounding fluid has no memory at the scale of observation

Interpretation: Yesterday's shove tells you nothing about tomorrow's shove.

This is the Markov assumption, but earned, not assumed. It's why Brownian motion became the backbone of Markov processes, diffusion, and SDEs.

Property 3: Stationary Gaussian Increments¶

\[ B_t - B_s \sim \mathcal{N}(0, t-s) \]

This is the most information-dense property. Let's break it down.

Mean = 0¶

No drift. No preferred direction.

Physical meaning: The fluid is isotropic and at equilibrium. If there were a mean, you'd have wind, flow, or external force.

Variance = $t-s$¶

Spread grows linearly with time.

Physical meaning:

Each collision adds a tiny displacement
Variances add, not displacements
This is diffusion, not ballistic motion

Key insight: Distance $\propto \sqrt{\text{time}}$, not $\propto \text{time}$.

This single fact distinguishes diffusion from motion with velocity.

Gaussian Shape¶

Why normal distribution?

Because the displacement over time is the sum of enormously many tiny, independent kicks, and the Central Limit Theorem kicks in mercilessly.

Gaussianity is not sacred—it's statistical inevitability under these conditions.

Property 4: Continuous Paths (But Nowhere Differentiable)¶

Mathematical meaning: $B_t$ is continuous, but has no well-defined velocity anywhere.

Physical meaning:

The particle never teleports
But at every scale, motion is jittery

The strange reality:

Zoom in: Still jagged.
Zoom in again: Still jagged.
Zoom in infinitely: Still jagged.

There is no smallest wiggle.

Consequences: This forced mathematics to abandon classical calculus and invent: - Itô calculus - Stratonovich calculus - Stochastic integration

Nature demanded new math.

5. The Mysterious $\sqrt{dt}$ Scaling¶

One of the most confusing aspects of Brownian motion is why noise scales as $\sqrt{dt}$, not $dt$. Let's build intuition step by step.

The Question¶

In SDEs, we write:

\[ dw(t) = \sqrt{dt} \cdot \varepsilon, \quad \varepsilon \sim \mathcal{N}(0,1) \]

Why the square root? Why not just $dt$ or some other power?

Intuition 1: Diffusive Scaling¶

Imagine a particle getting random kicks over time.

Over 1 second, it wanders a lot
Over 0.01 seconds, it wanders a little
Over 0 seconds, it doesn't move at all

Key empirical fact: If you wait 4× longer, the typical distance only doubles—not quadruples.

\[ \text{typical distance} \propto \sqrt{\text{time}} \]

This is called diffusive scaling.

Intuition 2: Variance Adds, Not Displacements¶

Consider a simple random walk with $n$ steps:

\[ X_n = \sum_{k=1}^n \varepsilon_k, \quad \varepsilon_k \sim \mathcal{N}(0, 1) \]

Variance:

\[ \text{Var}(X_n) = \text{Var}\left(\sum_{k=1}^n \varepsilon_k\right) = n \]

Why this formula? The variance of a sum of independent random variables equals the sum of their variances:

\[ \text{Var}\left(\sum_{k=1}^n \varepsilon_k\right) = \sum_{k=1}^n \text{Var}(\varepsilon_k) = \sum_{k=1}^n 1 = n \]

General formula for variance of sum:

For independent random variables $X_1, X_2, \ldots, X_n$:

$$

\text{Var}\left(\sum_{k=1}^n X_k\right) = \sum_{k=1}^n \text{Var}(X_k) $$

For dependent random variables (with covariance):

$$

\text{Var}\left(\sum_{k=1}^n X_k\right) = \sum_{k=1}^n \text{Var}(X_k) + 2\sum_{i < j} \text{Cov}(X_i, X_j) $$

In our case, the $\varepsilon_k$ are independent (each step is independent), so the covariance terms are zero, and we get the simple sum.

Standard deviation:

\[ \text{SD}(X_n) = \sqrt{n} \]

So: - Variance grows linearly with number of steps - Standard deviation (typical distance) grows like $\sqrt{n}$

From Steps to Time¶

Now connect steps to continuous time:

Total time: $t$
Step size: $\Delta t$
Number of steps: $n = t / \Delta t$

If each step had noise of fixed size, variance would blow up as $\Delta t \to 0$. That's bad.

Solution: Scale down the noise per step:

\[ \Delta x_k = \sqrt{\Delta t} \cdot \varepsilon_k \]

Variance over time $t$:

\[ \text{Var}(x_t) = \sum_{k=1}^n \text{Var}(\sqrt{\Delta t} \cdot \varepsilon_k) = \sum_{k=1}^n \Delta t = n \cdot \Delta t = t \]

Perfect! We get: - Finite variance - Linear growth in time - Well-defined continuous limit

This is why noise per step must scale as $\sqrt{\Delta t}$.

The Mathematical Statement¶

Brownian motion is defined to satisfy:

\[ w(t+\Delta t) - w(t) \sim \mathcal{N}(0, \Delta t) \]

Since any $\mathcal{N}(0, \Delta t)$ variable can be written as $\sqrt{\Delta t}$ times a standard normal:

\[ dw(t) = \sqrt{dt} \cdot \varepsilon, \quad \varepsilon \sim \mathcal{N}(0,1) \]

This is not an approximation or modeling choice—it's baked into the definition.

Why Not Other Scalings?¶

Let's check alternatives:

If noise scaled as $dt$¶

\[ dw \sim dt \]

Then: - Variance per step $\sim dt^2$ - Total variance over time $\sim dt \to 0$

Result: Noise disappears. You just get an ODE.

If noise scaled as constant¶

\[ dw \sim 1 \]

Then: - Variance per unit time $\to \infty$

Result: Process explodes. No limit exists.

Only $\sqrt{dt}$ works¶

It's the unique scaling that gives: - Nontrivial randomness - Finite variance - A meaningful continuous-time limit

This is why Brownian motion is so universal.

The Key Insight for SDEs¶

In an SDE:

\[ dx = f(x,t)\,dt + g(t)\,dw \]

You can now clearly see: - Deterministic motion scales like $dt$ - Stochastic motion scales like $\sqrt{dt}$

They live on different scales. This difference is crucial—it's why noise doesn't vanish or explode.

6. From Random Walk to Brownian Motion¶

Now let's make the connection rigorous. How exactly does a random walk become Brownian motion?

The Scaling Limit¶

Setup: Take a random walk with: - Step size: $\Delta x$ - Time step: $\Delta t$ - Steps: $\varepsilon_1, \varepsilon_2, \ldots$ with $\mathbb{E}[\varepsilon_i] = 0$, $\text{Var}(\varepsilon_i) = 1$

Discrete random walk:

\[ S_n = \sum_{i=1}^n \varepsilon_i, \quad S_0 = 0 \]

Rescaling: Define a continuous-time process:

\[ W_n(t) = \frac{1}{\sqrt{n}} S_{\lfloor nt \rfloor} \]

Interpretation:

Speed up time by factor $n$ (so $nt$ steps happen by time $t$)
Shrink space by $\sqrt{n}$

Variance check:

\[ \text{Var}(W_n(t)) = \text{Var}\left(\frac{1}{\sqrt{n}} S_{\lfloor nt \rfloor}\right) = \frac{1}{n} \cdot \lfloor nt \rfloor \approx t \]

Perfect!

Donsker's Invariance Principle¶

Theorem (Donsker, 1951): As $n \to \infty$,

\[ W_n(\cdot) \Rightarrow B(\cdot) \]

as processes (convergence in distribution in the space of continuous functions).

What this means:

The rescaled random walk converges to Brownian motion
Not just at individual times, but as entire random curves
This is a functional Central Limit Theorem

Interpretation:

Random walk = microscopic description
Brownian motion = macroscopic, continuum description

Just like: - Individual molecules → thermodynamics - Coin flips → Gaussian noise - Discrete Markov chains → SDEs

7. Why This Matters for Diffusion Models¶

Now we can connect everything to diffusion models.

The Forward Process¶

In diffusion models, the forward noising process gradually corrupts clean data into noise.

Discrete view (DDPM):

\[ x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1-\bar{\alpha}_t} \varepsilon \]

This is a discrete random walk in noise space.

Continuous view (SDE):

\[ dx = f(x,t)\,dt + g(t)\,dw(t) \]

This treats the same process as Brownian motion with drift.

Key insight: DDPM's discrete noise exactly matches the $\sqrt{dt}$ scaling of Brownian motion.

Why Brownian Motion is Perfect for Diffusion¶

Continuous paths: Data never jumps discontinuously
Gaussian increments: Noise is well-behaved, tractable
Independent increments: Markov property enables efficient training
Stationary increments: Noise schedule is time-translation invariant
Universal limit: Any reasonable discrete noise schedule converges to it

The SDE Formulation¶

The forward SDE:

\[ dx = f(x,t)\,dt + g(t)\,dw(t) \]

Components:

$f(x,t)$: Drift (deterministic shrinkage toward origin)
$g(t)$: Diffusion coefficient (noise strength)
$dw(t) = \sqrt{dt} \cdot \varepsilon$: Brownian motion

Why this works:

Drift scales as $dt$ (slow, deterministic)
Diffusion scales as $\sqrt{dt}$ (fast, stochastic)
Together they balance to create smooth corruption

The Reverse Process¶

The reverse SDE:

\[ dx = \left[f(x,t) - g(t)^2 \nabla_x \log p_t(x)\right]dt + g(t)\,d\bar{w}(t) \]

Key point: The reverse process also uses Brownian motion (with time running backwards).

Why this is remarkable:

The same stochastic process that corrupts data
Can be reversed to generate data
All thanks to Anderson's theorem (1982)

Connection to Score Matching¶

The score function $\nabla_x \log p_t(x)$ tells you: - Which direction increases probability - How to "undo" the Brownian noise

Training: Learn to predict the score at all noise levels.

Sampling: Use the learned score in the reverse SDE to denoise.

The bridge: Brownian motion is what makes this connection clean and tractable.

Summary: The Big Picture¶

Let's synthesize everything:

What Brownian Motion Is¶

The universal macroscopic description of motion driven by countless microscopic, memoryless, unbiased fluctuations.

Why It Appears Everywhere¶

Pollen grains in water
Heat diffusion
Stock prices (with caveats)
Noise in neural models
Forward processes in diffusion models
SDE limits of random walks

It is not "the truth of motion"—it is the inevitable limit when you refuse to track microscopic detail.

The Key Properties (Recap)¶

$B_0 = 0$: Start at origin (choice of reference frame)
Independent increments: No memory (Markov property)
Gaussian increments with variance $\propto$ time: Diffusive scaling
Continuous but nowhere differentiable: Infinitely rough

The $\sqrt{dt}$ Scaling (One-Line Takeaway)¶

Noise scales as $\sqrt{dt}$ because variance must grow linearly in time for a continuous-time random process to exist; this is the only scaling that yields finite, nontrivial stochastic behavior.

For Diffusion Models¶

DDPM: Discrete random walk in noise space
SDE formulation: Continuous Brownian motion with drift
Same process: Different descriptions, unified by Donsker's theorem
Why it works: Brownian motion provides the mathematical foundation for adding and removing noise in a principled way

Next Steps¶

Now that you understand Brownian motion:

See it in action: Run the SDE tutorial notebook to visualize Brownian paths
Study the forward SDE: Learn how $f(x,t)$ and $g(t)$ are chosen
Understand score matching: See how the score function reverses diffusion
Implement DDPM: Connect discrete and continuous views

The mathematical machinery is now in place. Time to build diffusion models!

Understanding Brownian Motion: From Random Walks to Diffusion Models¶

Table of Contents¶

1. The Physical Origin Story¶

The Discovery (1827)¶

The Explanation (1905)¶

The Mathematical Skeleton (1920s)¶

2. Random Walk: The Discrete Cousin¶

Definition¶

Classic Example: Simple Symmetric Random Walk¶

Key Properties¶

Where Random Walks Appear¶

3. Brownian Motion: The Continuous Limit¶

The Transition¶

Continuous Time¶

4. The Four Defining Properties¶

Conceptual Clarification: What is \(B_t\)?¶

Is \(B_t\) a Random Variable?¶

Is "Process" the Same as in Gaussian Process, Dirichlet Process, etc.?¶

Property 1: \(B_0 = 0\)¶

Property 2: Independent Increments¶

Property 3: Stationary Gaussian Increments¶

Mean = 0¶

Variance = \(t-s\)¶

Gaussian Shape¶

Property 4: Continuous Paths (But Nowhere Differentiable)¶

5. The Mysterious \(\sqrt{dt}\) Scaling¶

The Question¶

Intuition 1: Diffusive Scaling¶

Intuition 2: Variance Adds, Not Displacements¶

From Steps to Time¶

The Mathematical Statement¶

Why Not Other Scalings?¶

If noise scaled as \(dt\)¶

If noise scaled as constant¶

Only \(\sqrt{dt}\) works¶

The Key Insight for SDEs¶

6. From Random Walk to Brownian Motion¶

The Scaling Limit¶

Donsker's Invariance Principle¶

7. Why This Matters for Diffusion Models¶

The Forward Process¶

Why Brownian Motion is Perfect for Diffusion¶

The SDE Formulation¶

The Reverse Process¶

Connection to Score Matching¶

Summary: The Big Picture¶

What Brownian Motion Is¶

Why It Appears Everywhere¶

The Key Properties (Recap)¶

The \(\sqrt{dt}\) Scaling (One-Line Takeaway)¶

For Diffusion Models¶

Further Reading¶

Primary Papers¶

Textbooks¶

Related Topics¶

Next Steps¶