Skip to content

Diffusion Transformers (DiT) + Rectified Flow

This directory contains comprehensive documentation on Diffusion Transformers (DiT) combined with rectified flow — the modern architecture for scalable, flexible generative modeling.

DiT represents the shift from convolutional U-Nets to Transformers, enabling better scaling, flexible conditioning, and modality-agnostic generation.


Core Documentation Series

This series follows the same structure as DDPM, SDE, and flow matching documentation.

Document Description
00_dit_overview.md Overview: Why DiT matters, key concepts, modern stack
01_dit_foundations.md Foundations: Architecture details, components, design choices
02_dit_training.md Training: How to train DiT + rectified flow models
03_dit_sampling.md Sampling: How to generate samples efficiently

Supplementary Documents

Deep dives on specific topics (located in docs/diffusion/DiT/):

Document Description
diffusion_transformer.md Comprehensive tutorial with biology applications
time_embeddings_explained.md Deep dive on time conditioning mechanisms

Quick Navigation

For Beginners

Start with the overview to understand the big picture, then move through foundations and training.

Path: OverviewFoundationsTraining

For Implementation

Focus on the practical training and sampling guides with code examples.

Path: TrainingSampling → Supplementary docs

For Theory Deep Dive

Understand the architectural choices and mathematical foundations.

Path: FoundationsSupplementary docsFlow matching theory


Key Concepts

The Modern Generative Stack

Rectified Flow (objective) + DiT (architecture) + AdaLN (conditioning)

Rectified Flow: Simple regression target $$ \mathcal{L} = \mathbb{E}{x_0, x_1, t} \left[ \left| v\theta(x_t, t) - (x_1 - x_0) \right|^2 \right] $$

DiT: Transformer-based architecture - Tokenization: Input → patches → tokens - Self-attention: Global dependencies - AdaLN: Time/condition modulation

Result: Fast, scalable, flexible generation

DiT vs U-Net

Aspect U-Net DiT
Architecture Convolutional Transformer
Receptive field Local → Global Global from start
Input format Fixed grids Flexible tokens
Conditioning Architectural changes Built-in (AdaLN)
Scaling Limited Excellent
Best for Images, fixed size Any modality

Core Components

1. Tokenization

Image/Data → Patches → Flatten → Embed → Tokens

2. Time Conditioning (AdaLN)

t → TimeEmbed(t) → MLP → (γ, β) → Modulate features

3. Transformer Blocks

Tokens → Self-Attention → MLP → Updated Tokens

4. Output Projection

Tokens → Linear → Velocity Field


Training Overview

Rectified Flow Loss

Simple regression:

\[ \mathcal{L} = \mathbb{E}_{x_0, x_1, t} \left[ \left\| v_\theta(x_t, t) - (x_1 - x_0) \right\|^2 \right] \]

where:

  • \(x_0 \sim p_{\text{data}}\) (real data)
  • \(x_1 \sim \mathcal{N}(0, I)\) (noise)
  • \(x_t = t x_1 + (1-t) x_0\) (linear interpolation)

Key advantages:

  • No noise schedules
  • No variance parameterization
  • Direct regression target
  • Stable training

Training Algorithm

for batch in dataloader:
    x_0 = batch  # Real data
    x_1 = torch.randn_like(x_0)  # Noise
    t = torch.rand(batch_size)  # Random time

    # Linear interpolation
    x_t = t * x_1 + (1 - t) * x_0

    # Predict velocity
    v_pred = model(x_t, t)

    # Compute loss
    target = x_1 - x_0
    loss = F.mse_loss(v_pred, target)

    # Update
    loss.backward()
    optimizer.step()

Sampling Overview

ODE Integration

Forward ODE (noise → data):

\[ \frac{dx}{dt} = v_\theta(x, t) \]

Euler discretization:

x = torch.randn(shape)  # Start from noise
dt = 1.0 / num_steps

for k in range(num_steps):
    t = k * dt
    v = model(x, t)
    x = x + v * dt

return x  # Generated sample

Properties:

  • Deterministic (same noise → same output)
  • Fast (20-50 steps)
  • Straight paths (rectified flow)

Applications

Vision

  • Images: Stable Diffusion 3, DALL-E 3
  • Videos: Sora, Goku
  • 3D: Point clouds, meshes

Audio

  • Music: MusicGen
  • Speech: AudioLDM
  • Sound effects: Foley generation

Biology

  • Gene expression: Cell state generation
  • Perturbations: Predict intervention effects
  • Trajectories: Developmental paths
  • Molecules: Protein structure

Other

  • Robotics: Trajectory planning
  • Physics: Simulation
  • Design: CAD, architecture

Why DiT for Biology?

Challenges with Traditional Approaches

Gene expression data:

  • High-dimensional (10K-30K genes)
  • Unordered (no natural sequence)
  • Sparse (many zeros)
  • Compositional (relative values matter)

U-Net limitations:

  • Assumes spatial structure
  • Fixed input sizes
  • Hard to condition on perturbations

DiT Advantages

Flexibility:

  • Genes/cells/regions as tokens
  • Variable-length sequences
  • Natural conditioning on perturbations

Global interactions:

  • Self-attention captures gene-gene dependencies
  • No locality bias
  • Learn regulatory networks

Scalability:

  • Handle large gene panels
  • Batch different experiments
  • Scale to billions of parameters

Open Questions

  1. Tokenization: How to represent genes as tokens?
  2. Rank by expression? (Geneformer approach)
  3. Gene embeddings? (learned representations)
  4. Set-based? (permutation invariant)

  5. Latent space: Better to work in latent space?

  6. Encode expression → latent → diffusion
  7. Avoids sparsity issues
  8. More stable training

  9. Architecture: DiT vs alternatives?

  10. State-space models (Mamba, S4)
  11. Hyena (long convolutions)
  12. Hybrid approaches

See: Supplementary documents for deeper exploration.


Learning Path

Conceptual Understanding

  1. DiT Overview — Why DiT matters
  2. Architectural shift from U-Net
  3. Modern generative stack
  4. Key advantages

  5. Flow Matching Basics — Rectified flow theory

  6. Velocity fields
  7. Linear interpolation
  8. ODE sampling

  9. DiT Foundations — Architecture details

  10. Tokenization strategies
  11. Transformer blocks
  12. Time conditioning

Practical Implementation

  1. DiT Training — Training pipeline
  2. Data preparation
  3. Model architecture
  4. Training loop
  5. Hyperparameters

  6. DiT Sampling — Generation strategies

  7. ODE solvers
  8. Conditional generation
  9. Quality vs speed

Advanced Topics

  1. Comprehensive Tutorial — Deep dive
  2. Alternative backbones
  3. Biology applications
  4. State-space models

  5. Time Embeddings — Conditioning mechanisms

  6. Sinusoidal embeddings
  7. AdaLN details
  8. FiLM modulation

Comparison with Other Methods

DiT vs DDPM

Aspect DDPM DiT + Rectified Flow
Architecture U-Net Transformer
Objective Noise prediction Velocity prediction
Training Noise schedule needed Simple regression
Sampling 1000 steps (SDE) 20-50 steps (ODE)
Conditioning Concatenation/FiLM AdaLN/Cross-attention
Flexibility Images mainly Any modality

DiT vs Flow Matching (U-Net)

Aspect Flow Matching + U-Net DiT + Rectified Flow
Objective Same (velocity) Same (velocity)
Architecture Convolutional Transformer
Scaling Limited Excellent
Conditioning Moderate Excellent
Speed Fast convolutions Slower attention

Key insight: DiT is an architectural choice, orthogonal to the training objective.


Key Takeaways

Conceptual

  1. DiT = Transformer architecture for diffusion/flow models
  2. Rectified flow = simple objective (velocity regression)
  3. Together = modern stack for state-of-the-art generation
  4. Tokenization enables modality-agnostic modeling

Practical

  1. Training is simple: Regression on \(v = x_1 - x_0\)
  2. Sampling is fast: 20-50 ODE steps
  3. Conditioning is easy: Tokens or AdaLN
  4. Scales well: Proven to billions of parameters

For Biology

  1. Flexible representation: Genes, cells, perturbations
  2. Global interactions: Attention captures dependencies
  3. Conditional generation: Model interventions
  4. Active research: Best practices still emerging

Prerequisites

  • Flow Matching — Rectified flow theory
  • DDPM — Discrete diffusion models
  • SDE — Continuous-time perspective

Advanced Topics

Code Examples

  • notebooks/diffusion/ — Interactive tutorials
  • examples/ — Production scripts

References

Key Papers

DiT:

  • Peebles & Xie (2023): "Scalable Diffusion Models with Transformers"

Rectified Flow:

  • Liu et al. (2022): "Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow"
  • Liu et al. (2023): "InstaFlow: One Step is Enough for High-Quality Diffusion-Based Text-to-Image Generation"

Transformers:

  • Dosovitskiy et al. (2020): "An Image is Worth 16x16 Words" (ViT)
  • Vaswani et al. (2017): "Attention is All You Need"

Conditioning:

  • Perez et al. (2018): "FiLM: Visual Reasoning with a General Conditioning Layer"

Modern Implementations

  • Stable Diffusion 3: DiT-based text-to-image
  • Sora: DiT for video generation
  • Hugging Face Diffusers: DiT implementations
  • OpenAI: DALL-E 3

Summary

Diffusion Transformers (DiT) combined with rectified flow represent the modern approach to generative modeling:

Architecture: Transformers replace U-Nets - Global attention from the start - Flexible tokenization - First-class conditioning

Objective: Rectified flow simplifies training - Direct velocity regression - No noise schedules - Fast ODE sampling

Result: State-of-the-art generation - Images, video, audio - Scalable to billions of parameters - Emerging applications in biology

The modern stack:

Rectified Flow + DiT + AdaLN = Powerful, flexible generation

This combination has become the foundation for cutting-edge generative models and is particularly promising for computational biology applications.