Open Research: Tokenization in Diffusion Transformers¶
Status: Active research area (as of January 2026)
Core Question: What is the "right" way to tokenize complex objects (images, gene expression, molecules) for transformer-based generative models?
The Problem¶
Transformers operate on sequences of tokens. For natural language and DNA/RNA, tokenization is natural — these are inherently sequential. But for other modalities, tokenization feels contrived and arbitrary.
The Uncomfortable Truth¶
Patch-based tokenization (16×16, 8×8, etc.) is a pragmatic hack that works, but lacks principled justification.
This document explores why current approaches feel unsatisfying and outlines open research directions.
1. Images: The Patch Problem¶
Current Approach¶
Standard practice (ViT, DiT, Stable Diffusion 3):
# Split image into fixed-size patches
patches = image.unfold(dimension=2, size=16, step=16) # 16×16 patches
tokens = embed(patches)
output = transformer(tokens)
Why This Feels Wrong¶
Q1: Why 16×16?
- Not "right" — just empirically tuned
- Different models use different sizes (2×2, 4×4, 8×8, 14×14, 16×16)
- No principled way to choose
Q2: Should patch size depend on content?
- Medical images (smooth gradients): Large patches OK
- Text images (fine details): Small patches needed
- Satellite images: Depends on scale of features
- Current approach: One size for all!
Q3: Do patches respect semantic boundaries?
- A 16×16 patch might contain:
- Half a face, half background
- Part of an object, part of another
- Arbitrary image regions
- Our visual cortex doesn't work this way
Trade-offs¶
| Patch Size | Pros | Cons |
|---|---|---|
| Small (2×2, 4×4) | Fine details, local structure | More tokens, O(n²) attention cost |
| Large (16×16, 32×32) | Fewer tokens, faster | Loss of detail, coarse representation |
The problem: This is a hyperparameter, not a principled design choice.
2. Gene Expression: Even Less Obvious¶
Gene expression vectors: \(x \in \mathbb{R}^{20000}\) (20K genes)
Properties:
- Unordered: No natural sequence (unlike DNA)
- Dense: Most genes have non-zero expression
- Compositional: Relative values matter
- High-dimensional: 10K-30K genes typical
Current Approaches (2023-2026)¶
Approach 1: Rank by Expression (Geneformer)
# Sort genes by expression level
genes_sorted = sort_by_expression(gene_expression)
tokens = [gene_1, gene_2, ..., gene_20000]
output = transformer(tokens)
Problems:
- Ranking is arbitrary — not biological
- 20K tokens = huge sequences (O(n²) = 400M operations)
- What about genes with same expression?
- Loses biological structure
Approach 2: Gene Modules/Pathways
# Group genes by function
modules = {
"glycolysis": [gene_1, gene_5, ...],
"cell_cycle": [gene_2, gene_15, ...],
}
tokens = [module_embeddings] # ~500 pathways
Problems:
- How to define modules? (Also arbitrary!)
- Loses individual gene information
- Ignores within-module correlations
Approach 3: No Explicit Tokenization
# Direct embedding to latent space
z = encoder(gene_expression) # (20000,) → (512,)
output = model(z) # No tokens!
Problems:
- Less interpretable
- Loses biological structure
- Black box
Approach 4: Graph-Structured (GRN-aware)
# Use gene regulatory network
grn = load_gene_regulatory_network()
output = graph_transformer(gene_expression, grn)
Problems:
- GRN knowledge incomplete
- Still 20K nodes to handle
- Which GRN to use?
The Core Issue¶
There is no natural "tokenization" for gene expression.
Unlike images (spatial structure) or language (sequential structure), gene expression is: - A set (unordered) - A vector (continuous) - A network (interconnected)
Forcing it into a sequence feels wrong because it is wrong.
3. Why Do Patches Work Despite Being Arbitrary?¶
Pragmatic Reasons¶
1. Computational Efficiency
256×256 image = 65,536 pixels
With 16×16 patches = 256 tokens
Attention: 65,536² → 256² (65,000× reduction!)
2. Transfer from NLP
- Transformers proven for sequences
- Patches make images "sequence-like"
- Can reuse architectures
3. Good Enough in Practice
- ImageNet SOTA achieved
- Stable Diffusion works
- Empirical success
4. Implementation Simplicity
- Easy to code
- GPU-efficient
- Standard operations
But This Doesn't Make It "Right"¶
Engineering success ≠ Principled design
The field has optimized for what works, not what makes sense.
4. Alternative Approaches (Research Frontiers)¶
4.1 Hierarchical Tokenization¶
Idea: Learn local semantics first, then group into "super tokens"
Swin Transformer (2021):
Status: Works well, but still uses fixed patch sizes at each level.
4.2 Learned Tokenization¶
Idea: Don't fix patch size — learn how to tokenize!
BEiT, VQGAN, MaskGIT:
# Instead of fixed patches
tokens = split_into_patches(image, size=16) # Fixed
# Learn tokenization
tokens = learned_tokenizer(image) # Adaptive!
Advantages:
- Content-aware
- Can adapt to different regions
- Potentially more semantic
Challenges:
- How to train the tokenizer?
- Discrete vs continuous tokens?
- Computational cost
4.3 Convolutional Stem¶
Idea: Use CNNs for local features, Transformers for global
class HybridModel(nn.Module):
def __init__(self):
# CNN extracts local semantics
self.conv_stem = ResNet(...)
# Transformer on CNN features
self.transformer = Transformer(...)
Status: Used in some models, but not standard for DiT.
4.4 No Tokenization (Continuous)¶
Idea: Work directly in continuous space
For images:
- Latent diffusion (VAE → continuous latent → diffusion)
- No explicit tokens
For gene expression:
- Direct MLP/attention on expression vector
- Treat as continuous state, not sequence
Advantage: No arbitrary discretization
Disadvantage: May lose interpretability
5. Biological Inspiration: How Should We Think About This?¶
How Visual Cortex Works¶
Key properties: 1. Hierarchical: Simple → complex features 2. Local receptive fields that grow 3. Specialization: Different areas for different features 4. Sparse coding: Neurons fire selectively 5. Feedback: Top-down and bottom-up
Current Models vs Biology¶
| Aspect | Biology | Patch-based DiT |
|---|---|---|
| Hierarchy | Yes (V1→V2→V3→V4) | Flat (all patches equal) |
| Local first | Yes (small receptive fields) | No (global attention) |
| Adaptive | Yes (attention, feedback) | No (fixed patches) |
| Sparse | Yes (selective firing) | No (dense attention) |
Conclusion: Current approaches are not biologically inspired.
Should We Care?¶
Two perspectives:
Pragmatic: "Biology is slow, backprop works, patches work — who cares?" - Valid for engineering - Gets SOTA results
Principled: "Understanding biology might lead to better architectures" - Valid for research - May unlock new capabilities
Reality: Field is mostly pragmatic (for now).
6. Open Research Questions¶
For Images¶
Q1: What is the optimal tokenization strategy? - Fixed patches? Learned? Hierarchical? - Content-adaptive? - Task-specific?
Q2: Can we learn tokenization end-to-end? - Jointly with the generative model? - Discrete vs continuous?
Q3: How important is biological plausibility? - Should we model V1→V2→V3→V4? - Or is attention enough?
For Gene Expression¶
Q4: What is the "right" representation? - Tokens (if so, what kind)? - Continuous embeddings? - Graph structure?
Q5: Should tokenization respect biological structure? - Gene modules/pathways? - Regulatory networks? - Or learn from data?
Q6: How to handle high dimensionality? - 20K genes → how many tokens? - Latent space diffusion? - Hierarchical representation?
General Questions¶
Q7: Is tokenization necessary at all? - Can we do generative modeling without tokens? - Continuous-space alternatives?
Q8: Should tokenization be modality-specific? - Images: Patches - Audio: Time patches - Gene expression: ??? - Or unified approach?
Q9: How to evaluate tokenization quality? - Reconstruction error? - Downstream task performance? - Interpretability?
7. Current State of the Field (January 2026)¶
What's Working¶
For images:
- Fixed patches (8×8, 16×16) are standard
- Empirically tuned per model
- Stable Diffusion 3, Sora use patch-based approaches
For gene expression:
- Multiple approaches being explored
- No clear winner yet
- Geneformer (ranking), scPPDM (tabular), others
What's Being Researched¶
Active areas: 1. Learned tokenization (VQ-VAE, MaskGIT) 2. Hierarchical models (Swin, PVT) 3. Hybrid CNN-Transformer 4. Graph-structured attention 5. Continuous-space alternatives
What's Still Unknown¶
Open problems:
- Principled way to choose patch size
- Optimal tokenization for non-image modalities
- Whether biological inspiration helps
- Unified tokenization across modalities
8. Recommendations for Practitioners¶
For Image Generation (DiT)¶
Current best practice:
# Use empirically-tuned patch sizes
patch_size = 8 # For 256×256 images (DiT-XL/8)
# or
patch_size = 4 # For higher quality (more compute)
Experiment with:
- Different patch sizes for your data
- Hierarchical approaches if quality matters
- Latent diffusion (VAE + diffusion) to avoid tokenization
For Gene Expression¶
Recommended approach (as of 2026):
# Option 1: No explicit tokenization
z = encoder(gene_expression) # (20000,) → (512,)
output = diffusion_model(z)
# Option 2: Biologically-structured
grn = load_gene_regulatory_network()
output = graph_diffusion(gene_expression, grn)
# Option 3: Learned modules
modules = learn_gene_modules(data) # Data-driven
tokens = embed_by_modules(gene_expression, modules)
output = transformer(tokens)
Then:
- Compare approaches empirically
- Publish ablation study
- Let performance guide you
General Advice¶
Start simple: 1. Use standard approaches (patches for images, embeddings for other) 2. Get baseline working 3. Then experiment with alternatives
Don't overthink:
- If patches work for your task, use them
- Principled design is nice, but results matter
But do explore:
- This is an open research area
- Novel tokenization strategies could be publishable
- Especially for non-image modalities
9. Future Directions¶
Near-term (2026-2027)¶
Likely developments: 1. More learned tokenization methods 2. Better hierarchical models 3. Modality-specific tokenization strategies 4. Improved understanding of why patches work
Medium-term (2027-2029)¶
Possible breakthroughs: 1. Unified tokenization framework 2. Biologically-inspired alternatives that match SOTA 3. Continuous-space generative models (no tokens) 4. Neural architecture search for tokenization
Long-term (2029+)¶
Speculative: 1. Fundamental rethinking of tokenization 2. New architectures that don't need tokens 3. True biological plausibility 4. Modality-agnostic generative models
10. The Bigger Picture¶
The Tension¶
Engineering Pragmatism vs Principled Design
"Does it work?" vs "Is it right?"
Empirical tuning vs Theory-driven
Fast iteration vs Deep understanding
Current state: Pragmatism dominates - Patch sizes: Empirically tuned - Architecture choices: What works on benchmarks - Limited theoretical understanding
Future direction: More principled approaches - Understanding WHY things work - Biologically-inspired designs - Learned, adaptive strategies
Why This Matters¶
For science:
- Understanding principles leads to better models
- Biological inspiration may unlock new capabilities
- Theory guides experimentation
For engineering:
- Principled designs generalize better
- Less hyperparameter tuning
- More robust to distribution shift
For biology applications:
- Gene expression needs better representations
- Biological structure should inform design
- Interpretability matters
11. Conclusion¶
The honest assessment:
Patch-based tokenization is arbitrary and unnatural.
- 16×16 is not "right" — it's empirically tuned
- Should vary by resolution, task, data
- Doesn't respect semantic boundaries
- Not biologically inspired
But it works.
- Achieves SOTA on many tasks
- Computationally efficient
- Easy to implement
For gene expression, it's even worse.
- No natural tokenization exists
- Current approaches are hacks
- Open research problem
The field is still figuring this out.
- Active research area
- No consensus
- Your skepticism is warranted
Recommendations: 1. Use standard approaches to get started 2. Experiment with alternatives 3. Let performance guide you 4. Contribute to the research!
References¶
Tokenization Approaches¶
Patch-based:
- Dosovitskiy et al. (2020): "An Image is Worth 16x16 Words" (ViT)
- Peebles & Xie (2023): "Scalable Diffusion Models with Transformers" (DiT)
Hierarchical:
- Liu et al. (2021): "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows"
- Wang et al. (2021): "Pyramid Vision Transformer"
Learned Tokenization:
- Bao et al. (2021): "BEiT: BERT Pre-Training of Image Transformers"
- Esser et al. (2021): "Taming Transformers for High-Resolution Image Synthesis" (VQGAN)
- Chang et al. (2022): "MaskGIT: Masked Generative Image Transformer"
Gene Expression:
- Theodoris et al. (2023): "Transfer learning enables predictions in network biology" (Geneformer)
- Cui et al. (2024): "scGPT: Toward Building a Foundation Model for Single-Cell Multi-omics"
Biological Inspiration¶
- Hinton et al. (2017): "Dynamic Routing Between Capsules" (Capsule Networks)
- Rao & Ballard (1999): "Predictive coding in the visual cortex"
Discussion Questions¶
For researchers: 1. Can we develop a principled theory of tokenization? 2. Should tokenization be learned end-to-end with the model? 3. How important is biological plausibility? 4. Can we unify tokenization across modalities?
For practitioners: 1. How to choose patch size for my data? 2. When should I use hierarchical models? 3. Is learned tokenization worth the complexity? 4. How to tokenize gene expression data?
For the field: 1. Are we over-engineering tokenization? 2. Should we move beyond tokens entirely? 3. What can biology teach us? 4. How to balance pragmatism and principles?
Status: Open research area — contribute your ideas!
Last updated: January 13, 2026