Genomic Data Preparation¶
Goals served: cross-cutting (foundation for all other applications)
Tier: Active
Last updated: 2026-04
Problem¶
Every downstream application depends on consistent, validated genomic data: reference FASTA, gene annotations (GTF), splice-site ground truth, chromosome splits, and resource resolution. Inconsistent coordinate systems (0-based vs 1-based, chr prefix vs bare), mismatched genome builds (GRCh37 vs GRCh38), or silent annotation-source drift are common failure modes that cascade into every model.
This application provides a stable data-preparation layer with explicit resource manifests, coordinate-system guardrails, and validated MANE / Ensembl annotation sources.
User-facing functionality¶
- Extract per-gene sequence and annotations from a reference genome
- Produce splice-site ground truth TSVs (donor/acceptor/neither) aligned to MANE or Ensembl
- Validate MANE vs Ensembl metadata consistency for a chosen gene set
- Generate balanced chromosome splits matching the SpliceAI convention
- Resolve model/genome/annotation resources via a central registry
Driving examples¶
examples/data_preparation/01_prepare_gene_data.py— gene data extractionexamples/data_preparation/02_prepare_splice_sites.py— splice-site annotation generationexamples/data_preparation/03_full_data_pipeline.py— end-to-end preparation pipelineexamples/data_preparation/04_generate_ground_truth.py— ground truth TSV generationexamples/data_preparation/validate_mane_metadata.py— MANE validation utility
src/ surface¶
Library (stable):
agentic_spliceai.splice_engine.base_layer.data.preparation— gene and splice-site extractionagentic_spliceai.splice_engine.resources.schema.ensure_chrom_column— boundary guardrail (canonicalchromcolumn)agentic_spliceai.splice_engine.resources.registry— model/genome/annotation resource resolveragentic_spliceai.splice_engine.resources.model_resources.get_model_resources— per-model config lookupagentic_spliceai.splice_engine.eval.splitting— balanced chromosome split helpers
Application package (src/agentic_spliceai/applications/data_preparation/):
manifest.py— versionedIngestManifest(inputs + artifacts + SHA-256 hashes + timestamps)pipeline.py—prepare_build()orchestrator +resolve_canonical_output_dir()helperstatus.py—DataPrepStatusreadiness query (what's done / missing / stale)steps.py— thin wrappers over library calls (gene_features, splice_sites, chromosome_split, validate)cli.py— unified subcommand CLI
CLI entry points:
agentic-spliceai-prepare— infrastructure CLI (pre-existing)agentic-spliceai-ingest— application CLI (ingestion layer):agentic-spliceai-ingest list-builds agentic-spliceai-ingest status --canonical --build GRCh38 --annotation-source mane agentic-spliceai-ingest prepare --inplace --build GRCh38 --annotation-source mane --dry-run agentic-spliceai-ingest prepare --build GRCh38 --annotation-source mane \ --output-dir output/ingest/my_run agentic-spliceai-ingest validate --build GRCh38 --annotation-source mane \ --output-dir data/mane/GRCh38
Production-safety guards:
--output-diris required for throwaway runs--inplaceis the opt-in flag for writing to the resource-manager-resolved canonical dir (data/<source>/<build>/)- Existing artifacts are preserved by default; regeneration requires
--force status --canonicalandvalidateare fully read-only
Artifacts produced (per output dir):
| Name | File | Step | Purpose |
|---|---|---|---|
gene_features |
gene_features.parquet or .tsv |
gene_features | Gene metadata table |
splice_sites |
splice_sites_enhanced.tsv |
splice_sites | Donor/acceptor ground truth |
chromosome_split |
chromosome_split.json |
chromosome_split | SpliceAI-convention train/test split |
ingest_manifest |
ingest_manifest.json |
(automatic) | Versioned record: inputs + artifacts + hashes + timestamps |
Optional experiment tracking (silent fallback when wandb is not installed or no API key is set):
agentic-spliceai-ingest prepare --inplace \
--build GRCh38 --annotation-source mane \
--track --tracking-project agentic-spliceai-data-preparation
Logs per-step durations, rows, success flags, and the resulting
ingest_manifest.json as a W&B artifact. Shared with base_layer and
multimodal_features via
applications._common.tracking.
Evaluation¶
- Genome builds: GRCh37 (no chr prefix, MT only), GRCh38 (chr prefix, M+MT)
- Annotation sources: MANE (~19K genes, OpenSpliceAI-compatible), Ensembl (~57K genes including pseudogenes)
- Cross-model genes: 17,571 shared protein-coding genes across SpliceAI + OpenSpliceAI
- Aliases:
hg38 → GRCh38,hg19 → GRCh37resolved by registry
Maturity tier and signals¶
Current tier: Active (moving toward Mature)
Signals supporting the tier:
- 4 numbered scripts with stable args
- Shared by every downstream application
- Canonical schema columns (
chrom,splice_type) enforced at I/O boundaries - Genome-wide annotation extraction (no
force_extractrequired anymore) - MANE metadata validation utility exists
- Packaged application at
src/agentic_spliceai/applications/data_preparation/exposing a versioned manifest, readiness API, and dedicated CLI (agentic-spliceai-ingest) - Production-path completeness check:
status --canonicaldetects which artifacts are present indata/<source>/<build>/without modifying them. Verified ondata/mane/GRCh38/(gene_features + splice_sites present, chromosome_split gap-filled non-destructively)
Graduation signals¶
To advance to Mature, the application needs:
- Per-chromosome sequence caching (
gene_sequence_{chrom}.parquet) surfaced as astep_gene_sequencesstep (currently lazy-loaded by base predictors) - Inference-path tests for the pipeline + status API
- Modality-side equivalent for the meta-layer (planned:
src/agentic_spliceai/applications/multimodal_features/) - Declared coordinate-system invariants in the API
- Integration hook from base-layer CLI (call
get_status()before running, surface warnings when incomplete)
Known limitations¶
DYLD_LIBRARY_PATHin.zshrccan cause torch import failures during prep (documented indev/errors/)- GTF parsing is memory-bound for very large annotations; use streaming mode
- MANE and Ensembl use different transcript IDs — cross-reference requires explicit mapping
- Legacy
seqnamealias preserved for backward compatibility — do not remove without a deprecation cycle
Related¶
- Canonical Splice Prediction — primary consumer
- Multimodal Feature Engineering — consumes annotations and resources
- Variant Effect Analysis — consumes reference FASTA and annotations
- Splice Prediction Guide
- Roadmap: Phase 2