Experiment 004: Validated Delta Prediction (Single-Pass)¶

Date: December 2025
Status: ✅ Completed
Outcome: r=0.41 (BEST SO FAR!)

Key Innovation¶

This experiment addresses a fundamental limitation of paired prediction: base model deltas may be inaccurate for non-splice-altering variants.

The Problem with Paired Prediction¶

Paired Prediction (Previous Approach):
  Target = base_model(alt) - base_model(ref)

  Issue: If variant is NOT splice-altering but base model predicts 
         a delta anyway, we're training on wrong labels!

Our Solution: Validated Delta Targets¶

Validated Delta Prediction:
  If SpliceVarDB says "Splice-altering":
    Target = base_model(alt) - base_model(ref)  # Trust base model

  If SpliceVarDB says "Normal":
    Target = [0, 0, 0]  # Override base model - no effect!

  If SpliceVarDB says "Low-frequency" or "Conflicting":
    SKIP  # Uncertain, don't train on it

Results¶

Correlation (Splice-altering samples only)¶

Model	Pearson r	p-value
Paired (Siamese)	0.38	-
Validated (Single-Pass)	0.41	1.4e-07

Improvement: +8% correlation

Binary Discrimination (SA vs Normal)¶

Metric	Value
ROC-AUC	0.58
PR-AUC	0.62

Detection at Threshold=0.1¶

Metric	Value
SA detected	18.7%
False positives	6.0%

Why It Works Better¶

Ground truth filtering: SpliceVarDB provides validated labels
No false learning: Doesn't learn from incorrect base model predictions
Cleaner signal: Normal variants always have zero delta target
Single-pass efficiency: No reference sequence needed at inference

Architecture¶

                    ValidatedDeltaPredictor
┌─────────────────────────────────────────────────────────────────┐
│                                                                 │
│  alt_seq [B, 4, 501] ──→ [Gated CNN (6 layers)] ──→ [B, 128]   │
│                                                      │          │
│  ref_base [B, 4] ──┬──→ [MLP Embed] ──→ [B, 128]    │          │
│  alt_base [B, 4] ──┘                       │         │          │
│                                            └────┬────┘          │
│                                                 │               │
│                                         concat [B, 256]         │
│                                                 │               │
│                                         [Delta Head]            │
│                                                 │               │
│                                          Δ [B, 3]               │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Training Configuration¶

# Data
samples = 2000 (balanced: 1000 SA, 1000 Normal)
context_size = 501 nt
test_chromosomes = ['21', '22']

# Model
hidden_dim = 128
n_layers = 6
dropout = 0.1
parameters = 3,011,843

# Training
epochs = 40
batch_size = 32
optimizer = AdamW(lr=5e-5, weight_decay=0.02)
scheduler = OneCycleLR(max_lr=5e-4)

Comparison Summary¶

Aspect	Paired Prediction	Validated Prediction
Input	ref_seq + alt_seq	alt_seq + var_info
Target source	Base model (may be wrong)	SpliceVarDB-validated
Forward passes	2	1
Correlation	r=0.38	r=0.41
Inference speed	Slower (2 passes)	Faster (1 pass)

Recommendations¶

Use validated targets for any delta prediction task
Increase training data: 2000 samples is limited
Try longer context: 501nt → 1001nt
Add position attention for interpretability
Scale with HyenaDNA on GPU

Files¶

File	Description
`models/validated_delta_predictor.py`	Model implementation
`methods/VALIDATED_DELTA_PREDICTION.md`	Design rationale

This approach is now the recommended method for delta prediction.