Diffusion Transformers (DiT) + Rectified Flow¶

This directory contains comprehensive documentation on Diffusion Transformers (DiT) combined with rectified flow — the modern architecture for scalable, flexible generative modeling.

DiT represents the shift from convolutional U-Nets to Transformers, enabling better scaling, flexible conditioning, and modality-agnostic generation.

Core Documentation Series¶

This series follows the same structure as DDPM, SDE, and flow matching documentation.

Document	Description
00_dit_overview.md	Overview: Why DiT matters, key concepts, modern stack
01_dit_foundations.md	Foundations: Architecture details, components, design choices
02_dit_training.md	Training: How to train DiT + rectified flow models
03_dit_sampling.md	Sampling: How to generate samples efficiently

Supplementary Documents¶

Deep dives on specific topics (located in docs/diffusion/DiT/):

Document	Description
diffusion_transformer.md	Comprehensive tutorial with biology applications
time_embeddings_explained.md	Deep dive on time conditioning mechanisms

For Beginners¶

Start with the overview to understand the big picture, then move through foundations and training.

Path: Overview → Foundations → Training

For Implementation¶

Focus on the practical training and sampling guides with code examples.

Path: Training → Sampling → Supplementary docs

For Theory Deep Dive¶

Understand the architectural choices and mathematical foundations.

Path: Foundations → Supplementary docs → Flow matching theory

Key Concepts¶

The Modern Generative Stack¶

Rectified Flow (objective) + DiT (architecture) + AdaLN (conditioning)

Rectified Flow: Simple regression target $$ \mathcal{L} = \mathbb{E}{x_0, x_1, t} \left[ \left| v\theta(x_t, t) - (x_1 - x_0) \right|^2 \right] $$

DiT: Transformer-based architecture - Tokenization: Input → patches → tokens - Self-attention: Global dependencies - AdaLN: Time/condition modulation

Result: Fast, scalable, flexible generation

DiT vs U-Net¶

Aspect	U-Net	DiT
Architecture	Convolutional	Transformer
Receptive field	Local → Global	Global from start
Input format	Fixed grids	Flexible tokens
Conditioning	Architectural changes	Built-in (AdaLN)
Scaling	Limited	Excellent
Best for	Images, fixed size	Any modality

Core Components¶

1. Tokenization

Image/Data → Patches → Flatten → Embed → Tokens

2. Time Conditioning (AdaLN)

t → TimeEmbed(t) → MLP → (γ, β) → Modulate features

3. Transformer Blocks

Tokens → Self-Attention → MLP → Updated Tokens

4. Output Projection

Tokens → Linear → Velocity Field

Training Overview¶

Rectified Flow Loss¶

Simple regression:

\[ \mathcal{L} = \mathbb{E}_{x_0, x_1, t} \left[ \left\| v_\theta(x_t, t) - (x_1 - x_0) \right\|^2 \right] \]

where:

$x_0 \sim p_{\text{data}}$ (real data)
$x_1 \sim \mathcal{N}(0, I)$ (noise)
$x_t = t x_1 + (1-t) x_0$ (linear interpolation)

Key advantages:

No noise schedules
No variance parameterization
Direct regression target
Stable training

Training Algorithm¶

for batch in dataloader:
    x_0 = batch  # Real data
    x_1 = torch.randn_like(x_0)  # Noise
    t = torch.rand(batch_size)  # Random time

    # Linear interpolation
    x_t = t * x_1 + (1 - t) * x_0

    # Predict velocity
    v_pred = model(x_t, t)

    # Compute loss
    target = x_1 - x_0
    loss = F.mse_loss(v_pred, target)

    # Update
    loss.backward()
    optimizer.step()

Sampling Overview¶

ODE Integration¶

Forward ODE (noise → data):

\[ \frac{dx}{dt} = v_\theta(x, t) \]

Euler discretization:

x = torch.randn(shape)  # Start from noise
dt = 1.0 / num_steps

for k in range(num_steps):
    t = k * dt
    v = model(x, t)
    x = x + v * dt

return x  # Generated sample

Properties:

Deterministic (same noise → same output)
Fast (20-50 steps)
Straight paths (rectified flow)

Applications¶

Vision¶

Images: Stable Diffusion 3, DALL-E 3
Videos: Sora, Goku
3D: Point clouds, meshes

Audio¶

Music: MusicGen
Speech: AudioLDM
Sound effects: Foley generation

Biology¶

Gene expression: Cell state generation
Perturbations: Predict intervention effects
Trajectories: Developmental paths
Molecules: Protein structure

Other¶

Robotics: Trajectory planning
Physics: Simulation
Design: CAD, architecture

Why DiT for Biology?¶

Challenges with Traditional Approaches¶

Gene expression data:

High-dimensional (10K-30K genes)
Unordered (no natural sequence)
Sparse (many zeros)
Compositional (relative values matter)

U-Net limitations:

Assumes spatial structure
Fixed input sizes
Hard to condition on perturbations

DiT Advantages¶

Flexibility:

Genes/cells/regions as tokens
Variable-length sequences
Natural conditioning on perturbations

Global interactions:

Self-attention captures gene-gene dependencies
No locality bias
Learn regulatory networks

Scalability:

Handle large gene panels
Batch different experiments
Scale to billions of parameters

Open Questions¶

Tokenization: How to represent genes as tokens?
Rank by expression? (Geneformer approach)
Gene embeddings? (learned representations)
Set-based? (permutation invariant)
Latent space: Better to work in latent space?
Encode expression → latent → diffusion
Avoids sparsity issues
More stable training
Architecture: DiT vs alternatives?
State-space models (Mamba, S4)
Hyena (long convolutions)
Hybrid approaches

See: Supplementary documents for deeper exploration.

Learning Path¶

Conceptual Understanding¶

DiT Overview — Why DiT matters
Architectural shift from U-Net
Modern generative stack
Key advantages
Flow Matching Basics — Rectified flow theory
Velocity fields
Linear interpolation
ODE sampling
DiT Foundations — Architecture details
Tokenization strategies
Transformer blocks
Time conditioning

Practical Implementation¶

DiT Training — Training pipeline
Data preparation
Model architecture
Training loop
Hyperparameters
DiT Sampling — Generation strategies
ODE solvers
Conditional generation
Quality vs speed

Advanced Topics¶

Comprehensive Tutorial — Deep dive
Alternative backbones
Biology applications
State-space models
Time Embeddings — Conditioning mechanisms
Sinusoidal embeddings
AdaLN details
FiLM modulation

Comparison with Other Methods¶

DiT vs DDPM¶

Aspect	DDPM	DiT + Rectified Flow
Architecture	U-Net	Transformer
Objective	Noise prediction	Velocity prediction
Training	Noise schedule needed	Simple regression
Sampling	1000 steps (SDE)	20-50 steps (ODE)
Conditioning	Concatenation/FiLM	AdaLN/Cross-attention
Flexibility	Images mainly	Any modality

DiT vs Flow Matching (U-Net)¶

Aspect	Flow Matching + U-Net	DiT + Rectified Flow
Objective	Same (velocity)	Same (velocity)
Architecture	Convolutional	Transformer
Scaling	Limited	Excellent
Conditioning	Moderate	Excellent
Speed	Fast convolutions	Slower attention

Key insight: DiT is an architectural choice, orthogonal to the training objective.

Key Takeaways¶

Conceptual¶

DiT = Transformer architecture for diffusion/flow models
Rectified flow = simple objective (velocity regression)
Together = modern stack for state-of-the-art generation
Tokenization enables modality-agnostic modeling

Practical¶

Training is simple: Regression on $v = x_1 - x_0$
Sampling is fast: 20-50 ODE steps
Conditioning is easy: Tokens or AdaLN
Scales well: Proven to billions of parameters

For Biology¶

Flexible representation: Genes, cells, perturbations
Global interactions: Attention captures dependencies
Conditional generation: Model interventions
Active research: Best practices still emerging

Prerequisites¶

Flow Matching — Rectified flow theory
DDPM — Discrete diffusion models
SDE — Continuous-time perspective

Advanced Topics¶

Latent Diffusion — Diffusion in latent space
JEPA — Joint-embedding predictive architectures

Code Examples¶

notebooks/diffusion/ — Interactive tutorials
examples/ — Production scripts

References¶

Key Papers¶

DiT:

Peebles & Xie (2023): "Scalable Diffusion Models with Transformers"

Rectified Flow:

Liu et al. (2022): "Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow"
Liu et al. (2023): "InstaFlow: One Step is Enough for High-Quality Diffusion-Based Text-to-Image Generation"

Transformers:

Dosovitskiy et al. (2020): "An Image is Worth 16x16 Words" (ViT)
Vaswani et al. (2017): "Attention is All You Need"

Conditioning:

Perez et al. (2018): "FiLM: Visual Reasoning with a General Conditioning Layer"

Modern Implementations¶

Stable Diffusion 3: DiT-based text-to-image
Sora: DiT for video generation
Hugging Face Diffusers: DiT implementations
OpenAI: DALL-E 3

Summary¶

Diffusion Transformers (DiT) combined with rectified flow represent the modern approach to generative modeling:

Architecture: Transformers replace U-Nets - Global attention from the start - Flexible tokenization - First-class conditioning

Objective: Rectified flow simplifies training - Direct velocity regression - No noise schedules - Fast ODE sampling

Result: State-of-the-art generation - Images, video, audio - Scalable to billions of parameters - Emerging applications in biology

The modern stack:

Rectified Flow + DiT + AdaLN = Powerful, flexible generation

This combination has become the foundation for cutting-edge generative models and is particularly promising for computational biology applications.

Diffusion Transformers (DiT) + Rectified Flow¶

Core Documentation Series¶

Supplementary Documents¶

Quick Navigation¶

For Beginners¶

For Implementation¶

For Theory Deep Dive¶

Key Concepts¶

The Modern Generative Stack¶

DiT vs U-Net¶

Core Components¶

Training Overview¶

Rectified Flow Loss¶

Training Algorithm¶

Sampling Overview¶

ODE Integration¶

Applications¶

Vision¶

Audio¶

Biology¶

Other¶

Why DiT for Biology?¶

Challenges with Traditional Approaches¶

DiT Advantages¶

Open Questions¶

Learning Path¶

Conceptual Understanding¶

Practical Implementation¶

Advanced Topics¶

Comparison with Other Methods¶

DiT vs DDPM¶

DiT vs Flow Matching (U-Net)¶

Key Takeaways¶

Conceptual¶

Practical¶

For Biology¶

Related Documentation¶

Prerequisites¶

Advanced Topics¶

Code Examples¶

References¶

Key Papers¶

Modern Implementations¶

Summary¶