Agentic-SpliceAI — Development Roadmap¶

North Star: Enable novel isoform discovery for drug target identification by building a multi-layer pipeline that goes beyond canonical splice annotations to uncover disease-specific, tissue-specific, and variant-induced RNA isoforms with therapeutic potential.

Phase Overview¶

Phase	Description	Status
1	Base Layer	Done
2	Data Preparation	Done
2.5	Bioinformatics Lab UI	Done
3	Workflow Orchestration	Done
4	Feature Engineering & Multimodal Evidence	Done
5	Foundation Models	Experimental
6	Meta Layer Training	Active Research
7	Agentic Validation Layer	Planned
8	Variant Analysis	Planned
9	Isoform Discovery	Ultimate Goal
10+	Drug Target Validation & Deployment	Future

Phase Details¶

Phase 1: Base Layer — COMPLETE¶

Port SpliceAI and OpenSpliceAI prediction engines
Set up genomic resources (GTF, FASTA, annotations)
Build BaseModelRunner with data preparation
Deliverable: Canonical splice site predictions (MANE baseline)

Phase 2: Data Preparation — COMPLETE¶

Data preparation module with CLI (agentic-spliceai-prepare)
MANE annotation support for OpenSpliceAI consistency
Deliverable: Production-ready data pipeline

Phase 2.5: Bioinformatics Lab UI — COMPLETE¶

Gene Browser (browse, search, filter ~19K genes)
Metrics Dashboard (evaluation results, model comparison)
Genome View (on-demand prediction, 3-track Plotly visualization)
LRU prediction cache + peak-preserving downsampling
Deliverable: FastAPI + Jinja2 + Plotly.js web service at server/bio/ (port 8005)

Phase 3: Workflow Orchestration — COMPLETE¶

Chunking and checkpointing for genome-scale processing (PredictionWorkflow)
Artifact management (ArtifactManager with atomic writes)
Mode-aware output paths, evaluation metrics
Deliverable: Production base layer with full workflows
Verified: chr22 -- 423 genes, 17.6M positions, 5 chunks, 12.4 min

Phase 4: Feature Engineering & Multimodal Evidence — COMPLETE¶

Modality protocol with auto-registration (FeaturePipeline)
9 modalities with 100 feature columns:
base_scores (43), annotation (3), sequence (3), genomic (4)
conservation (9), epigenetic (12), junction (12), rbp_eclip (8), chrom_access (6)
Genome-scale FeatureWorkflow with --augment for incremental modality addition
YAML-driven config system with 4 profiles
Position alignment verification (features/verification.py)
Deliverable: 9-modality feature pipeline -- 100 feature columns
Verified: Full-genome feature generation across 17 chromosomes
See: examples/features/docs/ for per-modality tutorials

Phase 5: Foundation Models — EXPERIMENTAL¶

Evo2-based exon classifier, HDF5 embedding cache
Device-aware quantization routing, 4 classifier architectures
SkyPilot + RunPod cloud workflows
Deliverable: Independent sub-project at foundation_models/

Phase 6: Meta Layer Training — ACTIVE RESEARCH¶

Hierarchical multi-task prediction framework — shared 9-modality feature infrastructure with specialized model heads for progressively harder tasks:

Variant	Purpose	Status
M1	Canonical Classification	XGBoost baseline 99.78% acc, PR-AUC 0.999/0.998
M2a	Ensembl vs MANE evaluation — predict alternative splice sites in (Ensembl MANE) that the base model never saw	Next priority
M2	Alternative Splice Sites (tissue-specific, isoform-specific)	Planned
M3	Novel Site Discovery (junction as held-out target)	Planned
M4	Perturbation-Induced (variant/disease/treatment effects)	Planned

M2a is the strongest justification for multimodality: M1 already achieves 99.7% on canonical sites (little room to prove multimodality's value). Ensembl-only sites are where the base model is weakest — the delta (meta - base) at these positions directly measures the value of multimodal evidence fusion. Requires: Ensembl/GRCh38 annotations, set difference computation, base score evaluation, then full meta-layer rescue + ablation.

Key Insight: Junction support is the #2 feature by SHAP (31.3%), reducing false negatives by 60-70%.

See: examples/meta_layer/docs/meta_model_variants_m1_m4.md for the full M1-M4 framework and M2a evaluation design

Phase 7: Agentic Validation Layer — PLANNED¶

Literature Agent (PubMed, arXiv, splice databases)
Expression Agent (GTEx, TCGA, ENCODE)
Clinical Agent (ClinVar, COSMIC, disease associations)
Conservation Agent (cross-species PhyloP)
Nexus Research Agent orchestration
Self-improvement feedback loop (validation results refine meta layer)
Deliverable: AI-validated predictions with biological context

Phase 8: Variant Analysis — PLANNED¶

VCF processing and variant-induced splicing analysis
Pathogenicity scoring for splice-affecting variants
Clinical interpretation and reporting
ClinVar integration and VUS reclassification

Phase 9: Isoform Discovery — ULTIMATE GOAL¶

Novel splice site detector (high-delta-score sites beyond MANE)
Isoform reconstruction (virtual transcripts from predicted splice sites)
RNA-seq junction validation across GTEx tissues
Confidence scoring with multi-source evidence
Drug target pipeline: isoform -> druggability assessment -> lead candidates

Phase 10+: Drug Target Validation & Deployment — FUTURE¶

Druggability assessment for novel isoform targets
Biomarker development (liquid biopsy, companion diagnostics)
Production platform deployment
Cloud-native scaling and API services

Success Metrics¶

Discovery Metrics (Phase 9)¶

100+ novel isoforms discovered with high confidence
70% RNA-seq junction validation rate across GTEx tissues
50% literature confirmation for top candidates

Clinical Metrics (Phases 8-9)¶

30% of VUS variants reclassified through splice impact analysis
90% diagnostic accuracy for splice-affecting variants

Foundation Model Metrics (Phase 5)¶

0.9 AUROC for exon boundary classification
10K+ base pairs per second inference on A40 GPU

Last Updated: March 2026