Diffusion Transformers (DiT): Overview¶

This document provides a high-level introduction to Diffusion Transformers (DiT) — the architectural shift from convolutional U-Nets to Transformers for generative modeling, particularly when combined with rectified flow.

What is DiT?¶

Diffusion Transformer (DiT) is an architectural choice, not a new diffusion theory.

DiT uses a Transformer to parameterize the function learned in diffusion or flow-based models:

\[ v_\theta(x, t, c) \quad \text{(velocity for flow matching)} \]

\[ \epsilon_\theta(x, t, c) \quad \text{(noise for DDPM)} \]

\[ s_\theta(x, t) \quad \text{(score for score matching)} \]

Key insight: The objective (what to learn) and the architecture (how to learn it) are orthogonal design choices.

Why DiT Matters¶

The Architectural Evolution¶

Historical progression: 1. U-Net era (2020-2022): Convolutional architectures dominated 2. DiT era (2023+): Transformers became the standard 3. Modern stack: DiT + Rectified Flow

What Changed¶

Aspect	U-Net	DiT
Inductive bias	Spatial locality	Global attention
Input format	Fixed grids	Flexible tokens
Conditioning	Architectural changes needed	First-class via modulation
Scaling	Limited	Excellent
Flexibility	Images only	Any modality

The Core Idea: Grids → Tokens¶

U-Net thinking: Process images as spatial grids with local convolutions

DiT thinking: Represent inputs as sequences of tokens, process with global attention

For images: 1. Split image into patches (e.g., 16×16 pixels) 2. Flatten each patch into a vector 3. Embed into token space 4. Process with Transformer 5. Project back to image space

For other domains:

Genes, cells, regions → tokens
Time series → temporal tokens
Latent representations → abstract tokens

DiT + Rectified Flow: The Modern Stack¶

Why This Combination Works¶

Rectified Flow provides: - Simple regression target: $v = x_1 - x_0$ - Straight paths in data space - Fast ODE sampling - No density assumptions

DiT provides: - Global context via self-attention - Flexible conditioning via modulation - Scalability to large models - Modality-agnostic architecture

Together:

Simple objective + Powerful architecture = State-of-the-art generation

Key Components¶

1. Tokenization: Convert input to sequence

Image → Patches → Tokens

2. Time Conditioning: Adaptive LayerNorm (AdaLN)

t → TimeEmbed → (γ, β) → Modulate features

3. Self-Attention: Global dependencies

Tokens → Attention → Updated tokens

4. Output Projection: Map back to target space

Tokens → Linear → Velocity field

Training Objective¶

Rectified Flow Loss¶

Standard form:

\[ \mathcal{L} = \mathbb{E}_{x_0, x_1, t} \left[ \left\| v_\theta(x_t, t) - (x_1 - x_0) \right\|^2 \right] \]

where:

$x_0 \sim p_{\text{data}}$ (real data)
$x_1 \sim \mathcal{N}(0, I)$ (noise)
$x_t = t x_1 + (1-t) x_0$ (linear interpolation)
$v_\theta$ is the DiT network

With conditioning:

$$

\mathcal{L} = \mathbb{E}{x_0, x_1, t, c} \left[ \left| v\theta(x_t, t, c) - (x_1 - x_0) \right|^2 \right] $$

Why This is Simple¶

Compared to DDPM:

No noise schedules to tune
No variance parameterization
Direct regression target

Compared to score matching:

No score function computation
No Langevin dynamics
Deterministic sampling via ODE

Sampling Process¶

ODE Integration¶

Forward ODE (noise → data):

\[ \frac{dx}{dt} = v_\theta(x, t, c) \]

Discretization (Euler method):

x = torch.randn(shape)  # Start from noise
dt = 1.0 / num_steps

for k in range(num_steps):
    t = k * dt
    v = model(x, t, condition)
    x = x + v * dt

return x  # Generated sample

Properties:

Deterministic (same noise → same output)
Fast (20-50 steps typical)
Straight paths (rectified flow)

Why DiT Scales Better¶

1. Global Context is Native¶

U-Net: Needs deep pyramids to propagate information - Downsample → process → upsample - Limited receptive field at each layer - Information bottleneck

DiT: Self-attention is global by default - Every token attends to every other token - No information bottleneck - Direct long-range dependencies

2. Flexible Input Shapes¶

U-Net: Fixed grid sizes - Must pad/crop to specific resolutions - Awkward for variable-size inputs - Hard to batch different sizes

DiT: Variable-length sequences - Different number of tokens per sample - Batch with masking/packing - Natural for heterogeneous data

3. First-Class Conditioning¶

U-Net: Conditioning requires architectural changes - Concatenate channels - Add FiLM layers - Modify skip connections

DiT: Conditioning is built-in - Add condition tokens (cross-attention) - Modulate with AdaLN - No architectural surgery

Applications¶

Images¶

Stable Diffusion 3: DiT backbone
DALL-E 3: Transformer-based
Imagen: Cascaded DiT

Videos¶

Sora: DiT for video generation
Goku: Efficient video DiT

Beyond Vision¶

Audio: AudioLDM, MusicGen
Molecules: Protein structure generation
Robotics: Trajectory generation
Biology: Gene expression, cell states

Key Advantages¶

Theoretical¶

Unified framework: Same architecture for all modalities
Scalability: Proven to scale to billions of parameters
Interpretability: Attention maps show what model focuses on

Practical¶

Training stability: Rectified flow is well-behaved
Fast sampling: 20-50 steps vs 1000 for DDPM
Easy conditioning: Add tokens or modulation
Transfer learning: Pretrained transformers can be adapted

Engineering¶

Infrastructure: Leverage existing Transformer tools
Optimization: Well-understood training dynamics
Debugging: Attention visualization helps
Deployment: Standard Transformer serving

Comparison: U-Net vs DiT¶

Aspect	U-Net	DiT
Architecture	Convolutional	Transformer
Receptive field	Local → Global	Global from start
Input format	Fixed grids	Flexible tokens
Conditioning	Concatenation/FiLM	AdaLN/Cross-attention
Scaling	Limited	Excellent
Speed	Fast convolutions	Slower attention
Memory	Moderate	Higher
Best for	Images, fixed size	Any modality, variable size

Modern trend: DiT is becoming the default for new models.

DiT for Computational Biology¶

Why DiT is Promising for Biology¶

Traditional challenges:

Gene expression: High-dimensional, unordered
Cell states: Continuous, compositional
Perturbations: Need flexible conditioning
Time series: Variable-length trajectories

DiT solutions:

Tokens: Genes, cells, regions, timepoints
Attention: Capture gene-gene interactions
Conditioning: Perturbations, cell types, experimental conditions
Flexibility: Handle variable numbers of cells/genes

Potential Applications¶

Perturb-seq modeling: Predict perturbation effects
Cell state generation: Sample from cell type distributions
Trajectory inference: Model developmental paths
Counterfactual generation: "What if" scenarios

Open Questions¶

Tokenization: How to represent gene expression as tokens?
Ordering: Genes have no natural sequence — use set-based attention?
Sparsity: Many genes have zero expression — special handling?
Latent space: Better to work in latent space than raw expression?

See: Advanced topics in supplementary documents for deeper exploration.

Document Organization¶

This DiT documentation is organized as follows:

Core Series (Practical Workflow)¶

00_dit_overview.md (this document) — High-level introduction
01_dit_foundations.md — Architecture details, components
02_dit_training.md — How to train DiT + rectified flow
03_dit_sampling.md — How to generate samples

Supplementary Documents (Deep Dives)¶

Located in docs/diffusion/DiT/: - diffusion_transformer.md — Comprehensive tutorial with biology focus - time_embeddings_explained.md — Deep dive on time conditioning - Additional topics as needed

Learning Path¶

For Beginners¶

Start here (00_dit_overview.md) — Understand the big picture
Flow matching basics (docs/flow_matching/) — Learn rectified flow
DiT foundations (01_dit_foundations.md) — Architecture details
Training guide (02_dit_training.md) — Practical implementation

For Implementers¶

Training guide (02_dit_training.md) — Complete training pipeline
Sampling guide (03_dit_sampling.md) — Generation strategies
Supplementary docs — Advanced techniques
Code examples — See examples/ and notebooks/

For Theorists¶

Flow matching theory (docs/flow_matching/) — Mathematical foundations
DiT paper (Peebles & Xie 2023) — Original architecture
Supplementary docs — Deep dives on specific topics
SDE view (docs/SDE/) — Continuous-time perspective

Key Takeaways¶

Conceptual¶

DiT is an architecture, not a new diffusion method
Transformers replace U-Nets for better scaling and flexibility
Rectified flow + DiT is the modern generative stack
Tokenization enables modality-agnostic generation

Practical¶

Training is simple: Regression on velocity field
Sampling is fast: 20-50 ODE steps
Conditioning is easy: Add tokens or modulation
Scales well: Proven to billions of parameters

For Biology¶

Flexible representation: Genes, cells, perturbations as tokens
Global interactions: Attention captures dependencies
Conditional generation: Model perturbation effects
Open research: Best tokenization strategies still being explored

Next Steps¶

Continue to:

01_dit_foundations.md — Detailed architecture
02_dit_training.md — Training pipeline
03_dit_sampling.md — Sampling strategies

Related documentation:

Flow Matching — Rectified flow theory
DDPM — Discrete diffusion models
SDE — Continuous-time perspective

Supplementary deep dives:

diffusion_transformer.md — Comprehensive tutorial
time_embeddings_explained.md — Time conditioning

References¶

Key Papers¶

Peebles & Xie (2023): "Scalable Diffusion Models with Transformers" (DiT)
Liu et al. (2022): "Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow"
Dosovitskiy et al. (2020): "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale" (ViT)
Perez et al. (2018): "FiLM: Visual Reasoning with a General Conditioning Layer"

Modern Implementations¶

Stable Diffusion 3: DiT-based text-to-image
Sora: DiT for video generation
Hugging Face Diffusers: DiT implementations

Summary¶

Diffusion Transformers (DiT) represent the modern approach to generative modeling:

Replace U-Nets with Transformers for better scaling
Combine with rectified flow for simple, fast training
Enable flexible conditioning via tokens and modulation
Scale to any modality through tokenization

The modern generative stack:

Rectified Flow (objective) + DiT (architecture) + AdaLN (conditioning)

This combination has become the foundation for state-of-the-art generative models across images, video, audio, and emerging applications in biology.