Novel Isoform Discovery (M3)¶
Goals served: isoform discovery (primary project goal)
Tier: Incubating
Last updated: 2026-04
Problem¶
Canonical annotations (MANE, RefSeq) capture only ~10% of biologically active splice sites. The remaining ~90% include disease-specific, tissue-specific, and variant-induced isoforms — the targets of greatest scientific and therapeutic interest. The M3 task is to identify these novel splice sites that are absent from canonical annotations but supported by orthogonal evidence (cross-species conservation, RNA-seq junction reads, chromatin features).
Critically, the M3 model must be trained with junction reads held out as a label, not a feature — otherwise it leaks information about the very sites it should discover.
User-facing functionality (planned)¶
- Predict high-confidence novel splice sites genome-wide, ranked by multimodal evidence
- Filter by tissue context (brain, immune, cancer) via GTEx junction overlap
- Produce per-site evidence cards (conservation, RBP binding, chromatin, foundation-model embeddings)
- Feed ranked candidates into downstream isoform assembly and agentic validation
Driving examples (planned)¶
- (none yet — M3 training not started)
The existing meta_m3_novel.yaml feature config (junction excluded)
provides the feature-pipeline side; the training and inference scripts
are pending.
src/ surface (existing, ready for M3 training)¶
agentic_spliceai.splice_engine.features.*— already supports junction-as-label via themeta_m3_novelconfig profileagentic_spliceai.splice_engine.meta_layer.core.feature_schema— M3 variant already scoped- Training / inference modules — to be added, likely extending the
existing
sequence_modelandtraining/*infrastructure
Evaluation (planned)¶
- Dataset: MANE-negative, annotation-rich positions with junction support held out as label
- Baselines: base model (canonical only), M1-S v2, random forest on conservation + chromatin
- Key metrics: recall of GTEx-supported non-canonical junctions, conservation enrichment of top-K candidates, literature-confirmed rate
- Validation: cross-tissue RNA-seq junction verification (GTEx v8
already available at
data/mane/GRCh38/junction_data/)
Maturity tier and signals¶
Current tier: Incubating
Signals supporting the tier:
- Feature config (
meta_m3_novel.yaml) exists with junction excluded - GTEx v8 junction data aggregated (353K junctions, 54 tissues)
- M3-S training architecture scoped in meta-layer design docs
- No scripts yet, no results yet — pre-implementation
Graduation signals¶
To advance to Active, the application needs:
- First training script working end-to-end (even at small scale)
- Baseline evaluation against random forest or M1-S-trained-on-negatives
- Reproducible benchmark on held-out chromosome(s)
- Initial results committed to
examples/meta_layer/results/
Known limitations (anticipated)¶
- Label definition is non-trivial: "novel" is a negative space (MANE-negative but evidence-positive). Label noise will dominate.
- GTEx junction support varies by tissue and depth — low-expression novel sites will be missed
- Adjacent application (M4 induced sites) overlaps conceptually but has a different training signal
Related¶
- Adaptive Splice Prediction — prerequisite for M3 training (shares infrastructure)
- Multimodal Feature Engineering —
meta_m3_novel.yamlconfig - Agentic Validation — downstream consumer of M3 candidates
- Meta Layer Methods — M1-M4 framework including M3 design
- Use cases: Neurology, Oncology — intended clinical applications
- Roadmap: Phase 9