Experiment 001: Canonical Classification¶
Date: December 2025
Status: Completed
Outcome: Partial Success (High accuracy, poor variant detection)
Hypothesis¶
Training a multimodal meta-layer on canonical splice sites (from GTF annotations) will improve splice site classification and generalize to variant-induced splice changes.
Setup¶
Data¶
- Training: Base layer artifacts (
analysis_sequences_*.tsv) - Labels: GTF canonical splice sites (donor, acceptor, neither)
- Evaluation: SpliceVarDB variants
Model¶
- Architecture:
MetaSpliceModel(multimodal) - Sequence Encoder: CNN (lightweight)
- Score Encoder: MLP
- Fusion: Concatenation
Training¶
- Epochs: 30
- Batch Size: 64
- Optimizer: AdamW (lr=1e-4)
- Loss: CrossEntropyLoss with class weights
Results¶
Classification Performance¶
| Metric | Value |
|---|---|
| Accuracy | 99.11% |
| PR-AUC (macro) | 0.987 |
| Per-class AP | D: 0.98, A: 0.98, N: 0.99 |
Variant Detection¶
| Metric | Base Model | Meta-Layer |
|---|---|---|
| Detection Rate | 67% | 17% |
| Mean | Δ | (SA) |
Analysis¶
Why Classification Succeeded¶
- Canonical sites are well-defined (clear GT/AG patterns)
- Base model scores are highly predictive
- Task is relatively easy (≈random guessing baseline: 33%)
Why Variant Detection Failed¶
- Training-evaluation mismatch: Trained on canonical sites, evaluated on variants
- No variant-specific learning: Model never sees variant effects during training
- Confidence suppression: Model becomes overly confident in canonical labels
Key Insight¶
High classification accuracy on canonical sites does NOT translate to variant effect detection.
The meta-layer essentially learns to replicate base model behavior, which is already good at canonical sites but blind to variant-induced changes.
Lessons Learned¶
- Train for evaluation task: If goal is variant detection, must train on variant data
- SpliceVarDB is essential: Only source of ground-truth variant effects
- Need different approach: Classification → Delta prediction
Next Steps¶
This experiment motivated the pivot to: 1. Paired Delta Prediction (Experiment 002) 2. Validated Delta Prediction (Experiment 004) - BEST
Files¶
| File | Description |
|---|---|
models/meta_splice_model.py |
Model implementation |
workflows/canonical_training.py |
Training workflow |