Skip to content

Adaptive Splice Prediction (M1/M2)

Goals served: adaptive splice prediction

Tier: Active

Last updated: 2026-04


Problem

Base-layer predictions are canonical — they capture the splice sites seen during training but miss context-dependent variation (tissue specificity, alternative sites, variant-induced events). The meta layer refines base predictions with multimodal evidence (conservation, junction support, RBP binding, chromatin accessibility, etc.) and specializes for two task regimes: M1 — canonical (sharpen what the base model already sees) and M2 — alternative (recover sites the base model misses on annotation sources richer than MANE).

User-facing functionality

  • Train an M1-S sequence-level adaptive model on MANE canonical labels
  • Train an M2-S sequence-level adaptive model on Ensembl/GENCODE alternative labels
  • Produce context-aware per-nucleotide predictions that combine base model scores with multimodal features via logit-space blend
  • Evaluate calibration, modality importance (SHAP / gain), and OOD generalization to alternative sites
  • Ablate individual modalities to quantify contribution

Driving examples

src/ surface

  • agentic_spliceai.splice_engine.meta_layer.core.feature_schema — canonical 116-column schema
  • agentic_spliceai.splice_engine.meta_layer.models.sequence_model — 2-stream dilated CNN with logit-space blend
  • agentic_spliceai.splice_engine.meta_layer.training.* — training loops, checkpointing, loss functions
  • agentic_spliceai.splice_engine.meta_layer.inference.* — inference utilities
  • agentic_spliceai.splice_engine.eval.splitting — balanced chromosome split for SpliceAI convention

Evaluation

Key finding: logit-space residual blend (v2) exceeds base model on both canonical (PR-AUC 0.9954 > 0.99) and alternative sites (0.775 > 0.749) — v1 probability-space blend hurt on alternative sites due to overcommitment to the meta-CNN when uncertain.

Maturity tier and signals

Current tier: Active

Signals supporting the tier:

  • 11 example scripts spanning baseline → training → evaluation → diagnostics
  • M1-S v2 and M2-S trained with reproducible results
  • Ablation and calibration analyses committed
  • Stable args for training scripts since logit-blend refactor (April 2026)
  • Depended on by Variant Effect Analysis and (planned) Novel Isoform Discovery
  • Pod-based training ops scripts codify reproducibility

Graduation signals

To advance to Mature, the application needs:

  • Canonical driver script for production inference (currently evaluation scripts serve this role)
  • Inference-path test coverage in tests/
  • M3-S (novel site discovery) trained and evaluated, demonstrating the full M1-M4 framework
  • Documented stable CLI wrapping 07_train_sequence_model.py or a dedicated inference entry point

Known limitations

  • M2-S OOD generalization degrades on unseen genes (see ood_generalization.md)
  • Junction modality coverage uneven across tissues (see junction_coverage_findings.md)
  • shap package broken with numpy 2.4 — use XGBoost pred_contribs=True instead
  • Inference requires pod for large-scale work; local inference only feasible per-chromosome