Sharding and I/O Efficiency for Genomic Deep Learning¶
A practical guide to why and how training data is packed into shards, with concrete examples from the agentic-spliceai codebase.
See also: I/O Bottlenecks and DataLoader Tuning for the complementary guide on diagnosing and fixing data pipeline stalls.
1. The Problem: Per-Gene Files Don't Scale¶
During meta-layer training, we need data from ~12,000 genes. The naive
storage format is one .npz file per gene:
gene_cache/train/
gene-BRCA1.npz (850 KB)
gene-TP53.npz (120 KB)
gene-MYBPC3.npz (340 KB)
...
gene-TTN.npz (12 MB) ← largest human gene
Each file contains the gene's sequence, base scores, multimodal features, and labels. Reading one file is fast. But training samples a different random gene for every single window, which means:
- 12,500 batches/epoch (100K samples / batch 8)
- 100,000 file opens/epoch (one gene file per sample)
- Each
np.load()call: ~0.5-2ms file open + seek + read + decompress
On a local NVMe SSD this is tolerable (~100-200ms overhead per epoch). On a network volume (RunPod, cloud NFS) each file open adds 5-30ms of latency, turning 100K file opens into 8-50 minutes of pure I/O wait per epoch — while the GPU sits idle.
The numbers¶
| Storage | Per-file latency | 100K files/epoch | GPU utilization |
|---|---|---|---|
| Local NVMe | 0.5 ms | ~50s | ~40% |
| Network volume | 10 ms | ~17 min | ~7% |
| Object storage (S3) | 30 ms | ~50 min | ~2% |
The GPU can process a batch in ~10ms. At 10ms I/O per file, data loading is 100x slower than compute.
2. The Solution: Pack Genes into Shards¶
Instead of one file per gene, pack many genes into a small number of large files. Each shard is a single file containing hundreds of genes, organized for efficient random access.
Before (per-gene NPZ)¶
100K samples/epoch = 100K file opens.
After (per-chromosome HDF5 shards)¶
gene_cache/train_shards/
shard_chr2.h5 ← contains ~900 genes
shard_chr4.h5 ← contains ~750 genes
shard_chr6.h5
... (12 shard files for training chromosomes)
12 file handles opened once, then kept open. 100K samples/epoch = 100K HDF5 dataset reads within already-open files. Each read is a seek + slice — no file-open overhead.
3. HDF5 Shard Structure¶
Each shard is an HDF5 file with one group per gene:
shard_chr2.h5
├── gene-BRCA2/
│ ├── sequence [gene_len] uint8 (DNA as bytes)
│ ├── base_scores [gene_len, 3] float32 (donor, acceptor, neither)
│ ├── mm_features [gene_len, 9] float32 (multimodal channels)
│ └── labels [gene_len] int64 (0=donor, 1=acceptor, 2=neither)
├── gene-MSH2/
│ ├── ...
└── gene-EPCAM/
├── ...
HDF5 supports chunked storage and random slicing, so reading a 5001-bp window from a 100K-bp gene is fast — HDF5 only reads the relevant chunks, not the whole dataset.
Building shards (in the training pipeline)¶
# From 07_train_sequence_model.py
python -u examples/meta_layer/07_train_sequence_model.py \
--use-shards \ # enable shard packing
--cache-dir /path/to/gene_cache \
...
The training script first builds the per-gene NPZ cache (if not present), then packs genes into per-chromosome HDF5 shards. Subsequent runs skip building and use the shards directly.
The packing code¶
# From shard_packing.py (simplified)
def repack_genes_into_shards(gene_entries, output_dir):
# Group genes by chromosome
by_chrom = defaultdict(list)
for entry in gene_entries:
by_chrom[entry.chrom].append(entry)
for chrom, entries in by_chrom.items():
shard_path = output_dir / f"shard_{chrom}.h5"
with h5py.File(shard_path, "w") as f:
for entry in entries:
gene = np.load(entry.npz_path)
grp = f.create_group(entry.gene_id)
for key in ["sequence", "base_scores", "mm_features", "labels"]:
grp.create_dataset(key, data=gene[key])
4. Dataset Access Patterns¶
Without shards: SequenceLevelDataset¶
def __getitem__(self, idx):
gene_idx = self._rng.choice(self._valid_genes)
entry = self.gene_index[gene_idx]
gene = np.load(entry.npz_path) # ← file open every call
# ... extract window from gene ...
Every __getitem__ call opens a new file.
With shards: ShardedSequenceLevelDataset¶
def __getitem__(self, idx):
gene_idx = self._rng.choice(self._valid_genes)
entry = self.gene_index[gene_idx]
handle = self._get_handle(entry.shard_path) # ← cached, opened once
grp = handle[entry.gene_id]
# Slice directly from HDF5 — no full gene load
base_scores = grp["base_scores"][out_start:out_end]
# ...
File handles are cached per-process (fork-safe, keyed by process ID). A single HDF5 handle stays open for the life of the DataLoader worker.
Fork-safety note¶
HDF5 file handles cannot survive a fork(). The sharded dataset uses
lazy handle initialization (opened on first access per worker process)
and the DataLoader must use multiprocessing_context="spawn":
DataLoader(
dataset,
num_workers=4,
multiprocessing_context="spawn", # required for HDF5
persistent_workers=True,
)
See I/O Bottlenecks for the full explanation.
5. Measured Impact¶
From the M1-S training run on an A40 pod:
| Configuration | GPU utilization | Time per epoch |
|---|---|---|
Per-gene NPZ, num_workers=0 |
~7% | ~45 min |
Per-gene NPZ, num_workers=4 |
~15% | ~25 min |
HDF5 shards, num_workers=4 |
~40% | ~8 min |
The jump from 7% to 40% GPU utilization came from two changes: sharding (eliminated file-open overhead) and multi-worker prefetch (overlapped I/O with GPU compute).
The remaining 60% gap to full GPU utilization is due to the small model size (367K params) — the forward/backward pass is so fast that even optimized I/O can't keep up perfectly. Larger models would see higher utilization with the same I/O pipeline.
6. Alternative Shard Formats¶
HDF5 is not the only option. Here's how common formats compare for genomic training data:
| Format | Random access | Compression | Multi-worker safe | Best for |
|---|---|---|---|---|
| HDF5 | Yes (chunked) | Optional (gzip, lzf) | With spawn |
Structured arrays, sliceable |
| Parquet | Column-level | Snappy/Zstd | Yes | Tabular data, Polars/Pandas |
| WebDataset (.tar) | No (sequential) | Optional | Yes | Images, streaming-only |
| TFRecord | No (sequential) | Optional | Yes | TensorFlow pipelines |
| Zarr | Yes (chunked) | Blosc/Zstd | Yes | Large arrays, cloud-native |
| LMDB | Yes (key-value) | No | Yes | Small records, fast random access |
| SQLite | Yes (indexed) | No | Read-only concurrent | Metadata + small blobs |
For genomic sequence-level data with variable-length genes and multi-channel features, HDF5 and Zarr are the strongest fits because they support efficient slicing of contiguous array regions.
7. When to Shard¶
Shard when: - Training accesses many individual files per epoch (>1000) - GPU utilization is below 30% during training - Data lives on network storage (not local NVMe)
Don't bother sharding when: - Dataset fits in memory (just load it all upfront) - Sequential access only (streaming formats like WebDataset work better) - Files are already large (>100 MB each) — the per-file overhead is amortized
8. Summary¶
The core principle: minimize the number of file-open operations during training. Every file open pays a latency tax (0.5ms local, 10-30ms network). Sharding amortizes this cost across thousands of samples by keeping a small number of large files open.
Individual files Shards
──────────────── ──────
Epoch start: open 0 files open 12 shard files
Per sample: open 1 file (0.5-30ms) seek within open file (~0.01ms)
100K samples: 100K file opens 100K seeks (1000x faster)
The training pipeline in 07_train_sequence_model.py supports both modes
via --use-shards. On local NVMe, per-gene NPZ files are fast enough.
On GPU pods with network volumes, shards are essential.