Open Research: Tokenization in Diffusion Transformers¶

Status: Active research area (as of January 2026)

Core Question: What is the "right" way to tokenize complex objects (images, gene expression, molecules) for transformer-based generative models?

The Problem¶

Transformers operate on sequences of tokens. For natural language and DNA/RNA, tokenization is natural — these are inherently sequential. But for other modalities, tokenization feels contrived and arbitrary.

The Uncomfortable Truth¶

Patch-based tokenization (16×16, 8×8, etc.) is a pragmatic hack that works, but lacks principled justification.

Engineering Reality: "Does it work?" ✅
Theoretical Satisfaction: "Is it right?" ❌

This document explores why current approaches feel unsatisfying and outlines open research directions.

1. Images: The Patch Problem¶

Current Approach¶

Standard practice (ViT, DiT, Stable Diffusion 3):

# Split image into fixed-size patches
patches = image.unfold(dimension=2, size=16, step=16)  # 16×16 patches
tokens = embed(patches)
output = transformer(tokens)

Why This Feels Wrong¶

Q1: Why 16×16?

Not "right" — just empirically tuned
Different models use different sizes (2×2, 4×4, 8×8, 14×14, 16×16)
No principled way to choose

Q2: Should patch size depend on content?

Medical images (smooth gradients): Large patches OK
Text images (fine details): Small patches needed
Satellite images: Depends on scale of features
Current approach: One size for all!

Q3: Do patches respect semantic boundaries?

A 16×16 patch might contain:
Half a face, half background
Part of an object, part of another
Arbitrary image regions
Our visual cortex doesn't work this way

Trade-offs¶

Patch Size	Pros	Cons
Small (2×2, 4×4)	Fine details, local structure	More tokens, O(n²) attention cost
Large (16×16, 32×32)	Fewer tokens, faster	Loss of detail, coarse representation

The problem: This is a hyperparameter, not a principled design choice.

2. Gene Expression: Even Less Obvious¶

Gene expression vectors: \(x \in \mathbb{R}^{20000}\) (20K genes)

Properties:

Unordered: No natural sequence (unlike DNA)
Dense: Most genes have non-zero expression
Compositional: Relative values matter
High-dimensional: 10K-30K genes typical

Current Approaches (2023-2026)¶

Approach 1: Rank by Expression (Geneformer)

# Sort genes by expression level
genes_sorted = sort_by_expression(gene_expression)
tokens = [gene_1, gene_2, ..., gene_20000]
output = transformer(tokens)

Problems:

Ranking is arbitrary — not biological
20K tokens = huge sequences (O(n²) = 400M operations)
What about genes with same expression?
Loses biological structure

Approach 2: Gene Modules/Pathways

# Group genes by function
modules = {
    "glycolysis": [gene_1, gene_5, ...],
    "cell_cycle": [gene_2, gene_15, ...],
}
tokens = [module_embeddings]  # ~500 pathways

Problems:

How to define modules? (Also arbitrary!)
Loses individual gene information
Ignores within-module correlations

Approach 3: No Explicit Tokenization

# Direct embedding to latent space
z = encoder(gene_expression)  # (20000,) → (512,)
output = model(z)  # No tokens!

Problems:

Less interpretable
Loses biological structure
Black box

Approach 4: Graph-Structured (GRN-aware)

# Use gene regulatory network
grn = load_gene_regulatory_network()
output = graph_transformer(gene_expression, grn)

Problems:

GRN knowledge incomplete
Still 20K nodes to handle
Which GRN to use?

The Core Issue¶

There is no natural "tokenization" for gene expression.

Unlike images (spatial structure) or language (sequential structure), gene expression is: - A set (unordered) - A vector (continuous) - A network (interconnected)

Forcing it into a sequence feels wrong because it is wrong.

3. Why Do Patches Work Despite Being Arbitrary?¶

Pragmatic Reasons¶

1. Computational Efficiency

256×256 image = 65,536 pixels
With 16×16 patches = 256 tokens
Attention: 65,536² → 256² (65,000× reduction!)

2. Transfer from NLP

Transformers proven for sequences
Patches make images "sequence-like"
Can reuse architectures

3. Good Enough in Practice

ImageNet SOTA achieved
Stable Diffusion works
Empirical success

4. Implementation Simplicity

Easy to code
GPU-efficient
Standard operations

But This Doesn't Make It "Right"¶

Engineering success ≠ Principled design

The field has optimized for what works, not what makes sense.

4. Alternative Approaches (Research Frontiers)¶

4.1 Hierarchical Tokenization¶

Idea: Learn local semantics first, then group into "super tokens"

Swin Transformer (2021):

Image → Small patches → Local attention → Merge
              ↓
      "Super tokens" (hierarchical)
              ↓
      Global attention

Status: Works well, but still uses fixed patch sizes at each level.

4.2 Learned Tokenization¶

Idea: Don't fix patch size — learn how to tokenize!

BEiT, VQGAN, MaskGIT:

# Instead of fixed patches
tokens = split_into_patches(image, size=16)  # Fixed

# Learn tokenization
tokens = learned_tokenizer(image)  # Adaptive!

Advantages:

Content-aware
Can adapt to different regions
Potentially more semantic

Challenges:

How to train the tokenizer?
Discrete vs continuous tokens?
Computational cost

4.3 Convolutional Stem¶

Idea: Use CNNs for local features, Transformers for global

class HybridModel(nn.Module):
    def __init__(self):
        # CNN extracts local semantics
        self.conv_stem = ResNet(...)
        # Transformer on CNN features
        self.transformer = Transformer(...)

Status: Used in some models, but not standard for DiT.

4.4 No Tokenization (Continuous)¶

Idea: Work directly in continuous space

For images:

Latent diffusion (VAE → continuous latent → diffusion)
No explicit tokens

For gene expression:

Direct MLP/attention on expression vector
Treat as continuous state, not sequence

Advantage: No arbitrary discretization

Disadvantage: May lose interpretability

5. Biological Inspiration: How Should We Think About This?¶

How Visual Cortex Works¶

Retina → V1 (edges) → V2 (motion) → V3 (shape) → V4 → IT (objects)

Key properties: 1. Hierarchical: Simple → complex features 2. Local receptive fields that grow 3. Specialization: Different areas for different features 4. Sparse coding: Neurons fire selectively 5. Feedback: Top-down and bottom-up

Current Models vs Biology¶

Aspect	Biology	Patch-based DiT
Hierarchy	Yes (V1→V2→V3→V4)	Flat (all patches equal)
Local first	Yes (small receptive fields)	No (global attention)
Adaptive	Yes (attention, feedback)	No (fixed patches)
Sparse	Yes (selective firing)	No (dense attention)

Conclusion: Current approaches are not biologically inspired.

Should We Care?¶

Two perspectives:

Pragmatic: "Biology is slow, backprop works, patches work — who cares?" - Valid for engineering - Gets SOTA results

Principled: "Understanding biology might lead to better architectures" - Valid for research - May unlock new capabilities

Reality: Field is mostly pragmatic (for now).

6. Open Research Questions¶

For Images¶

Q1: What is the optimal tokenization strategy? - Fixed patches? Learned? Hierarchical? - Content-adaptive? - Task-specific?

Q2: Can we learn tokenization end-to-end? - Jointly with the generative model? - Discrete vs continuous?

Q3: How important is biological plausibility? - Should we model V1→V2→V3→V4? - Or is attention enough?

For Gene Expression¶

Q4: What is the "right" representation? - Tokens (if so, what kind)? - Continuous embeddings? - Graph structure?

Q5: Should tokenization respect biological structure? - Gene modules/pathways? - Regulatory networks? - Or learn from data?

Q6: How to handle high dimensionality? - 20K genes → how many tokens? - Latent space diffusion? - Hierarchical representation?

General Questions¶

Q7: Is tokenization necessary at all? - Can we do generative modeling without tokens? - Continuous-space alternatives?

Q8: Should tokenization be modality-specific? - Images: Patches - Audio: Time patches - Gene expression: ??? - Or unified approach?

Q9: How to evaluate tokenization quality? - Reconstruction error? - Downstream task performance? - Interpretability?

7. Current State of the Field (January 2026)¶

What's Working¶

For images:

Fixed patches (8×8, 16×16) are standard
Empirically tuned per model
Stable Diffusion 3, Sora use patch-based approaches

For gene expression:

Multiple approaches being explored
No clear winner yet
Geneformer (ranking), scPPDM (tabular), others

What's Being Researched¶

Active areas: 1. Learned tokenization (VQ-VAE, MaskGIT) 2. Hierarchical models (Swin, PVT) 3. Hybrid CNN-Transformer 4. Graph-structured attention 5. Continuous-space alternatives

What's Still Unknown¶

Open problems:

Principled way to choose patch size
Optimal tokenization for non-image modalities
Whether biological inspiration helps
Unified tokenization across modalities

8. Recommendations for Practitioners¶

For Image Generation (DiT)¶

Current best practice:

# Use empirically-tuned patch sizes
patch_size = 8  # For 256×256 images (DiT-XL/8)
# or
patch_size = 4  # For higher quality (more compute)

Experiment with:

Different patch sizes for your data
Hierarchical approaches if quality matters
Latent diffusion (VAE + diffusion) to avoid tokenization

For Gene Expression¶

Recommended approach (as of 2026):

# Option 1: No explicit tokenization
z = encoder(gene_expression)  # (20000,) → (512,)
output = diffusion_model(z)

# Option 2: Biologically-structured
grn = load_gene_regulatory_network()
output = graph_diffusion(gene_expression, grn)

# Option 3: Learned modules
modules = learn_gene_modules(data)  # Data-driven
tokens = embed_by_modules(gene_expression, modules)
output = transformer(tokens)

Then:

Compare approaches empirically
Publish ablation study
Let performance guide you

General Advice¶

Start simple: 1. Use standard approaches (patches for images, embeddings for other) 2. Get baseline working 3. Then experiment with alternatives

Don't overthink:

If patches work for your task, use them
Principled design is nice, but results matter

But do explore:

This is an open research area
Novel tokenization strategies could be publishable
Especially for non-image modalities

9. Future Directions¶

Near-term (2026-2027)¶

Likely developments: 1. More learned tokenization methods 2. Better hierarchical models 3. Modality-specific tokenization strategies 4. Improved understanding of why patches work

Medium-term (2027-2029)¶

Possible breakthroughs: 1. Unified tokenization framework 2. Biologically-inspired alternatives that match SOTA 3. Continuous-space generative models (no tokens) 4. Neural architecture search for tokenization

Long-term (2029+)¶

Speculative: 1. Fundamental rethinking of tokenization 2. New architectures that don't need tokens 3. True biological plausibility 4. Modality-agnostic generative models

10. The Bigger Picture¶

The Tension¶

Engineering Pragmatism     vs     Principled Design
"Does it work?"            vs     "Is it right?"
Empirical tuning           vs     Theory-driven
Fast iteration             vs     Deep understanding

Current state: Pragmatism dominates - Patch sizes: Empirically tuned - Architecture choices: What works on benchmarks - Limited theoretical understanding

Future direction: More principled approaches - Understanding WHY things work - Biologically-inspired designs - Learned, adaptive strategies

Why This Matters¶

For science:

Understanding principles leads to better models
Biological inspiration may unlock new capabilities
Theory guides experimentation

For engineering:

Principled designs generalize better
Less hyperparameter tuning
More robust to distribution shift

For biology applications:

Gene expression needs better representations
Biological structure should inform design
Interpretability matters

11. Conclusion¶

The honest assessment:

Patch-based tokenization is arbitrary and unnatural.

16×16 is not "right" — it's empirically tuned
Should vary by resolution, task, data
Doesn't respect semantic boundaries
Not biologically inspired

But it works.

Achieves SOTA on many tasks
Computationally efficient
Easy to implement

For gene expression, it's even worse.

No natural tokenization exists
Current approaches are hacks
Open research problem

The field is still figuring this out.

Active research area
No consensus
Your skepticism is warranted

Recommendations: 1. Use standard approaches to get started 2. Experiment with alternatives 3. Let performance guide you 4. Contribute to the research!

References¶

Tokenization Approaches¶

Patch-based:

Dosovitskiy et al. (2020): "An Image is Worth 16x16 Words" (ViT)
Peebles & Xie (2023): "Scalable Diffusion Models with Transformers" (DiT)

Hierarchical:

Liu et al. (2021): "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows"
Wang et al. (2021): "Pyramid Vision Transformer"

Learned Tokenization:

Bao et al. (2021): "BEiT: BERT Pre-Training of Image Transformers"
Esser et al. (2021): "Taming Transformers for High-Resolution Image Synthesis" (VQGAN)
Chang et al. (2022): "MaskGIT: Masked Generative Image Transformer"

Gene Expression:

Theodoris et al. (2023): "Transfer learning enables predictions in network biology" (Geneformer)
Cui et al. (2024): "scGPT: Toward Building a Foundation Model for Single-Cell Multi-omics"

Biological Inspiration¶

Hinton et al. (2017): "Dynamic Routing Between Capsules" (Capsule Networks)
Rao & Ballard (1999): "Predictive coding in the visual cortex"

Discussion Questions¶

For researchers: 1. Can we develop a principled theory of tokenization? 2. Should tokenization be learned end-to-end with the model? 3. How important is biological plausibility? 4. Can we unify tokenization across modalities?

For practitioners: 1. How to choose patch size for my data? 2. When should I use hierarchical models? 3. Is learned tokenization worth the complexity? 4. How to tokenize gene expression data?

For the field: 1. Are we over-engineering tokenization? 2. Should we move beyond tokens entirely? 3. What can biology teach us? 4. How to balance pragmatism and principles?

Status: Open research area — contribute your ideas!

Last updated: January 13, 2026