Alternative Splice Site Prediction: Methodology Roadmap¶

Created: December 2025
Status: Active Development
Last Updated: December 2025

Overview¶

This document tracks the progressive development of methods for detecting and predicting alternative splice sites induced by genetic variants.

Goal¶

Predict whether and how a genetic variant affects splicing patterns, going beyond what current base models (SpliceAI, OpenSpliceAI) can detect.

Challenge¶

Base models are trained on canonical splice sites and fail to capture many variant-induced alternative splice sites documented in SpliceVarDB.

Method Taxonomy¶

┌─────────────────────────────────────────────────────────────────────────────┐
│                    ALTERNATIVE SPLICE SITE PREDICTION                        │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  ┌───────────────────────────────────────────────────────────────────────┐  │
│  │ PAIRED DELTA PREDICTION (Siamese)                                      │  │
│  │                                                                        │  │
│  │ • Input: ref_seq + alt_seq (BOTH needed)                              │  │
│  │ • Target: base_model(alt) - base_model(ref)                           │  │
│  │ • Output: [L, 2] per-position deltas                                  │  │
│  │                                                                        │  │
│  │ Status: Tested, r=0.38 correlation (not sufficient)                   │  │
│  │ Limitation: Learning from potentially inaccurate base model deltas    │  │
│  └───────────────────────────────────────────────────────────────────────┘  │
│                                                                              │
│  ┌───────────────────────────────────────────────────────────────────────┐  │
│  │ VALIDATED DELTA PREDICTION (Single-Pass with Ground Truth)             │  │
│  │                                                                        │  │
│  │ • Input: alt_seq + variant_info (ref_base, alt_base)                  │  │
│  │ • Target: SpliceVarDB-validated delta (ground truth filtering)        │  │
│  │ • Output: Δ directly (single forward pass)                            │  │
│  │                                                                        │  │
│  │ Status: ✅ IMPLEMENTED & TESTED - r=0.41 (BEST SO FAR!)               │  │
│  │ Advantage: Uses ground truth labels, efficient inference              │  │
│  └───────────────────────────────────────────────────────────────────────┘  │
│                                                                              │
│  ┌───────────────────────────────────────────────────────────────────────┐  │
│  │ MULTI-STEP FRAMEWORK (Decomposed Problem)                              │  │
│  │                                                                        │  │
│  │ Step 1: Binary Classification                                         │  │
│  │   • "Is this variant splice-altering?"                                │  │
│  │   • Status: Tested, AUC=0.61, F1=0.53 (needs >0.7)                   │  │
│  │                                                                        │  │
│  │ Step 2: Effect Type Classification                                    │  │
│  │   • "What type of effect?" (gain/loss, donor/acceptor)               │  │
│  │   • Status: NOT YET IMPLEMENTED                                       │  │
│  │                                                                        │  │
│  │ Step 3: Position Localization                                         │  │
│  │   • "Where in the window is the effect?"                              │  │
│  │   • Status: NOT YET IMPLEMENTED                                       │  │
│  │                                                                        │  │
│  │ Step 4: Delta Magnitude                                               │  │
│  │   • "How strong is the effect at that position?"                      │  │
│  │   • Status: NOT YET IMPLEMENTED                                       │  │
│  └───────────────────────────────────────────────────────────────────────┘  │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Development Timeline¶

Canonical Classification (Completed ❌)¶

Goal: Improve splice site classification using meta-learning on base model artifacts

Approach: - Multimodal model: sequence (CNN) + tabular features (base model scores) - Labels: GTF canonical splice sites (donor, acceptor, neither) - Sample weights from SpliceVarDB

Results: - Classification accuracy: 99% - Variant detection: 17% (FAILED)

Conclusion: High accuracy on canonical sites does NOT transfer to variant detection.

Documentation: docs/experiments/001_canonical_classification/

Paired Delta Prediction (Completed ⚠️)¶

Goal: Predict delta scores directly using Siamese architecture

Approach: - Input: ref_seq + alt_seq (paired) - Target: base_model(alt) - base_model(ref) - Architecture: Gated CNN with dilated convolutions

Variations Tested:

Variation	Correlation	Notes
V2 Original	r=-0.04	No learning
V2 + 10x data	r=0.002	Still no correlation
Gated CNN	r=0.36	Architecture matters
+ Quantile loss	r=0.38	Best for this approach
+ Scaling	r=0.22	Overfitting
+ Temperature	r=-0.03	No improvement
+ Multi-task	r=-0.07	Task interference

Conclusion: Moderate correlation achieved (r=0.38) but: 1. Targets (base model deltas) may be inaccurate 2. Not sufficient for practical use

Documentation: docs/experiments/002_delta_prediction/

Validated Delta Prediction (COMPLETED ✅)¶

Status: Implemented and tested
Result: r=0.41 correlation (best so far!)
Goal: Use SpliceVarDB classifications to derive ground-truth delta targets

Key Difference from Paired: - Paired: Target = base_model(alt) - base_model(ref) (possibly inaccurate) - Validated: Target = SpliceVarDB-validated delta (ground truth filtering)

Approach:

Input: alt_sequence + variant_info (ref_base, alt_base)
Target: Δ derived from SpliceVarDB classification
Output: Δ directly (single forward pass)

Final score = base_scores + Δ

Documentation: docs/experiments/004_validated_delta/

Multi-Step Framework (IN PROGRESS)¶

Status: Step 1 tested (needs improvement)
Goal: Decompose the problem into manageable sub-tasks

Step 1: Binary Classification - Question: "Is this variant splice-altering?" - Results: AUC=0.61, F1=0.53 (needs F1 > 0.7) - Status: Needs improvement

Step 2-4: Not yet implemented

Documentation: docs/MULTI_STEP_FRAMEWORK.md

Current Implementation Status¶

Models Implemented¶

Model	File	Purpose	Status
`DeltaPredictorV2`	`delta_predictor_v2.py`	Paired prediction	Tested
`SimpleCNNDeltaPredictor`	`hyenadna_delta_predictor.py`	Gated CNN encoder	Tested
`ValidatedDeltaPredictor`	`validated_delta_predictor.py`	Single-pass validated	BEST
`SpliceInducingClassifier`	`splice_classifier.py`	Binary classification	Tested
`EffectTypeClassifier`	`splice_classifier.py`	Multi-class effects	Implemented
`UnifiedSpliceClassifier`	`splice_classifier.py`	Multi-task	Implemented

Next Steps (Prioritized)¶

Immediate (M1 Mac)¶

Scale Validated Delta with more training data
Improve Multi-Step Step 1 (Binary Classification)
Try larger context (1001nt vs 501nt)
Add position-aware features
Data augmentation (reverse complement)
Target: F1 > 0.7
Document methodology choices

With GPU (RunPods)¶

HyenaDNA encoder for all approaches
Full SpliceVarDB (~50K variants)
Cross-validation on larger scale

Compute Resources¶

Available¶

Environment	Specs	Suitable For
MacBook M1	16GB RAM, MPS	Quick iterations, small models

Needed (RunPods)¶

Environment	Specs	Suitable For
RTX 4090	24GB VRAM	HyenaDNA-small, larger batches
A40	48GB VRAM	HyenaDNA-medium
A100	80GB VRAM	HyenaDNA-large, fine-tuning

References¶

SpliceVarDB: Source of ground-truth variant effect labels
LABELING_STRATEGY.md: Detailed approach descriptions
ARCHITECTURE.md: Model architectures

This roadmap will be updated as methodology development progresses.