Alternative Splice Site Prediction — Deep Analysis¶

Created: March 2026 Context: Review of meta-layer approaches for predicting alternative splice sites induced by genetic variants and other perturbations

Table of Contents¶

[[#Prerequisite — SpliceVarDB and Trainable Signals]]
[[#Approach 1 — Canonical Classification (Experiment 001)]]
[[#Approach 2 — Paired Delta Prediction (Experiment 002)]]
[[#Approach 3 — Binary Classification (Experiment 003)]]
[[#Approach 4 — Validated Delta Prediction (Experiment 004)]]
[[#What SpliceVarDB Actually Tells Us (and What It Doesn't)]]
[[#Why This Is Not a Standard Supervised Learning Problem]]
[[#Systematic ML Formulations]]
[[#Recommendations and Roadmap]]

Prerequisite — SpliceVarDB and Trainable Signals¶

Before dissecting the models, it's worth framing what "alternative splice site" means as a learning problem and what SpliceVarDB gives us.

What SpliceVarDB provides¶

SpliceVarDB is a curated database of genetic variants with experimentally validated effects on splicing. Each record carries:

Genomic coordinates (hg19 and hg38), parsed from strings like "1-100107682-T-C"
Classification — one of three labels:
"Splice-altering" — experimentally confirmed to change splicing
"Non-splice-altering" — confirmed NOT to change splicing
"Low-frequency" — uncertain / insufficient evidence
Location — exonic or intronic
Method — the experimental validation technique used

[!info] Implementation Reference See meta_layer/data/splicevardb_loader.py — VariantRecord dataclass (lines 27–58) and SpliceVarDBLoader class (lines 61–308)

The labeling gap¶

The fundamental challenge is that base models (SpliceAI, OpenSpliceAI) are trained on canonical splice sites from GTF annotations — the known GT-AG sites at exon-intron boundaries. They have no training signal for:

Whether a single nucleotide variant activates or destroys a splice site
Whether a cryptic splice site emerges from a mutation
The magnitude of the effect (how much does donor/acceptor probability shift?)

SpliceVarDB fills this gap, but it provides a categorical label (altering vs. not), while what we actually want to predict is a continuous delta (how much do the scores change?). This mismatch between available labels and desired output shapes the entire experimental trajectory.

Approach 1 — Canonical Classification (Experiment 001)¶

[!summary] Outcome 99.11% classification accuracy — but only 17% variant detection rate. Failed for the actual goal.

Label Preparation¶

Labels come directly from GTF annotations loaded via ArtifactLoader:

# dataset.py:333-337
def _encode_label(self, splice_type: str) -> int:
    return LABEL_ENCODING.get(splice_type.lower(), LABEL_ENCODING['neither'])

where LABEL_ENCODING = {'donor': 0, 'acceptor': 1, 'neither': 2, '': 2} (defined in feature_schema.py:196-201).

Each sample is a 501nt window centered on a genomic position. If that position is an annotated donor site, label = 0; if annotated acceptor, label = 1; otherwise label = 2. Classes are balanced via undersampling (dataset.py:533-567).

The 50+ numeric features are z-score normalized (dataset.py:230-242), and known leakage columns (splice_type, pred_type, is_correct, error_type, etc.) are explicitly excluded via FeatureSchema.LEAKAGE_COLS.

Optimization Objective¶

Cross-entropy with label smoothing (0.1) and inverse-frequency class weights:

# trainer.py:403-428
loss = F.cross_entropy(logits, labels, weight=self.class_weights, label_smoothing=0.1)

Class weights are computed as total / (3 * count_per_class), standard inverse-frequency balancing.

Architecture¶

MetaSpliceModel (meta_splice_model.py:21-157) is a two-stream multimodal network:

Sequence [B, 4, 501] ──→ CNN (kernels 3/5/7/11/15) ──→ GlobalAvgPool ──→ [B, 256]
                                                                                    ├── Fusion ──→ Classifier ──→ [B, 3]
Scores [B, ~50]       ──→ MLP (ScoreEncoder)         ──→                   [B, 256]

Components: 1. Sequence stream: CNN encoder processes [B, 4, 501] one-hot DNA → [B, 256] embedding via multi-scale convolutions with global average pooling 2. Score stream: MLP (ScoreEncoder) processes [B, ~50] numeric features → [B, 256] 3. Fusion: Concatenation (or optionally CrossAttentionFusion which does bidirectional attention between the two streams with residual connections) 4. Head: Linear(512→256) → LayerNorm → GELU → Dropout → Linear(256→3) producing class logits

Training: AdamW (lr=1e-4, weight_decay=0.01), cosine annealing LR, gradient clipping at 1.0, early stopping on val_pr_auc with patience=10.

Assessment¶

[!warning] Logically flawed for the stated goal This approach is logically flawed for variant detection, and the results confirm it.

The reason is a distribution mismatch: the model learns P(donor | sequence, scores) for canonical positions — a near-trivial task since the base model scores already encode this information with high accuracy. When asked to detect changes caused by variants, it has never seen variant-perturbed inputs during training. Worse, the label-smoothed cross-entropy objective encourages confident canonical predictions, actively suppressing sensitivity to subtle score perturbations that would signal variant effects.

However, this experiment is not useless. It established that: - The multimodal fusion architecture works mechanically - Leakage detection infrastructure is sound - The feature schema is validated

It serves as an upper bound on "how well can we classify known sites" and a proof that classification of known sites is the wrong proxy task for variant detection.

Approach 2 — Paired Delta Prediction (Experiment 002)¶

[!summary] Outcome Best correlation: r=0.38 (Gated CNN + Quantile loss). Moderate but insufficient.

Label Preparation¶

Labels are now continuous deltas derived by running the base model on both reference and alternate sequences:

# variant_dataset.py:227-364
delta = alt_scores - ref_scores  # [L, 3]
# Scan ±50bp window around variant for max absolute effect
max_donor_delta = donor_deltas[np.abs(donor_deltas).argmax()]
max_acceptor_delta = acceptor_deltas[np.abs(acceptor_deltas).argmax()]

Each sample is a VariantSample containing: - ref_sequence and alt_sequence (501nt each, centered on variant) - delta_donor and delta_acceptor (scalar: max absolute delta in ±50bp) - weight: 2.0 for splice-altering, 1.0 for normal

[!important] Critical limitation The target is the base model's own prediction difference, not ground truth.

Optimization Objective¶

Multiple losses were tested (delta_predictor_calibrated.py):

Loss	Mechanism	Result
Weighted MSE	`per_sample_MSE * class_weight`, averaged	Baseline
Quantile/Pinball (τ=0.9)	Underpredictions penalized 9x more than overpredictions	Best (r=0.38)
Output scaling	Learnable multiplicative constant `delta * scale`	r=0.22
Temperature scaling	`delta / exp(log_temperature)`	r=-0.03
Hybrid classification + regression	Multi-task BCE + MSE	r=-0.07

The quantile loss at τ=0.9 is noteworthy:

# delta_predictor_calibrated.py:226-233
loss = torch.where(
    error >= 0,
    tau * error,        # underprediction: penalized by τ=0.9
    (tau - 1) * error   # overprediction: penalized by 0.1
)

This forces the model to focus on capturing large deltas rather than regressing to the mean — important because most variants have near-zero delta.

Architecture¶

Siamese network (delta_predictor.py:35-169):

ref_seq [B,4,501] ──→ SharedEncoder ──→ ref_emb [B,256]
                                                          diff = alt - ref ──→ DeltaHead ──→ [B,2]
alt_seq [B,4,501] ──→ SharedEncoder ──→ alt_emb [B,256]

SharedEncoder (delta_predictor.py:172-248): 4 conv layers with exponentially dilated convolutions (dilation 1, 2, 4, 8), BatchNorm + ReLU, global average pooling, linear projection. Kaiming initialization on conv weights, Xavier on linear.
DeltaHead: Linear(256→256) → ReLU → Dropout → Linear(256→128) → ReLU → Dropout → Linear(128→2)

The best variant uses SimpleCNNDeltaPredictor (a GatedCNN) wrapped in QuantileDeltaPredictor.

Assessment¶

[!note] Right intuition, poisoned target The Siamese architecture is the right geometric intuition — but the target is poisoned.

By learning base_model(alt) - base_model(ref), you can never exceed the base model's accuracy. For non-splice-altering variants where the base model falsely predicts a delta, you're training on noise. This ceiling is visible in the r=0.38 plateau.

What did help: - Gated CNN with dilated convolutions (r=0.36 vs. r=-0.04 for simple CNN) - Quantile loss at τ=0.9 (r=0.38 vs r=0.36) — biasing toward the upper quantile forces the model to preserve signal for rare large-effect variants rather than predicting zero for everything

What failed: - Multi-task (classification + regression hybrid, r=-0.07) — the two tasks compete for representation capacity, and the classification head's gradient overwhelms the regression signal - Temperature scaling (r=-0.03) — adjusting a global scalar can't fix position-dependent errors

[!tip] Key lesson Target quality is the bottleneck, not architecture or loss function.

Approach 3 — Binary Classification (Experiment 003)¶

[!summary] Outcome AUC=0.61, F1=0.53. Better than random but not practically useful. Needs F1 > 0.7.

Label Preparation¶

Labels are SpliceVarDB's categorical classifications used directly:

Input: alt_seq [B, 4, 501] + ref_base [B, 4] + alt_base [B, 4] (one-hot encoded)
Label: binary 0/1 from variant.classification == "Splice-altering"
Balanced: 50/50 splice-altering vs. normal

This is the first approach to use ground truth labels from SpliceVarDB for training, not just evaluation.

Optimization Objective¶

Binary cross-entropy with sigmoid output:

# splice_classifier.py:253-255
return torch.sigmoid(logits)  # P(splice-altering) ∈ [0,1]

For the UnifiedSpliceClassifier multi-task variant:

total_loss = binary_weight * BCE(p_splice_altering, label)
           + effect_weight * CE(effect_logits, effect_type)

with binary_weight=1.0 and effect_weight=0.5.

Architecture¶

SpliceInducingClassifier (splice_classifier.py:165-275):

alt_seq [B, 4, 501]
  ↓
GatedCNNEncoder (6 layers, 128 hidden)
  ├── Conv1d(4 → 128, k=1) initial projection
  ├── 6x GatedResidualBlock:
  │     Conv1d(128 → 256, k=15, dilation=2^(i%4))
  │     Split → [content, gate]
  │     out = content * sigmoid(gate)
  │     LayerNorm → Dropout → residual add
  ├── AdaptiveAvgPool1d(1) → [B, 128]
  ↓
seq_features [B, 128]

ref_base [B, 4] ┐
                 cat → [B, 8] → Linear(8→128) → ReLU → Dropout → Linear(128→128)
alt_base [B, 4] ┘
  ↓
var_features [B, 128]

[seq_features, var_features] → cat → [B, 256]
  ↓
Linear(256→128) → ReLU → Dropout → Linear(128→1) → sigmoid
  ↓
P(splice-altering) [B, 1]

The GatedResidualBlock is the workhorse — dilations cycle through 1, 2, 4, 8, 1, 2 giving a receptive field of ~465bp over 6 layers with kernel size 15. The gating mechanism (out * sigmoid(gate)) allows the network to selectively pass or suppress information at each position, similar to LSTM gating but applied spatially.

Assessment¶

[!note] Right question, insufficient input This is the right question to ask, but the sequence-only input may be insufficient to answer it.

AUC=0.61 (vs. 0.5 random) confirms there is a learnable signal — the 501nt context around a variant carries some information about whether it will affect splicing. But F1=0.53 means the model is barely better than a coin flip in practice.

Why is the signal so weak?

501nt is too narrow. Many splice-altering variants work by disrupting branch point sequences, exonic splicing enhancers/silencers, or regulatory elements that can be hundreds of nucleotides away. A 501nt window may simply not contain the relevant context.
No base model scores as input. Unlike Approach 1 which had 50+ precomputed features, this model only sees raw sequence + a 2-base variant description. It must independently learn the splice code from scratch.
The variant embedding is minimal. Concatenating two 4-dimensional one-hot vectors (ref + alt base) and projecting to 128 dimensions throws away positional context. The model knows what the mutation is but not where it sits relative to splice site motifs.

The multi-step framework (Steps 2–4) was never tested because the prerequisite F1 > 0.7 for Step 1 was not met. This was a sound gating decision.

Approach 4 — Validated Delta Prediction (Experiment 004)¶

[!summary] Outcome r=0.41 (p=1.4e-07) — best so far. +8% improvement over paired prediction.

Label Preparation¶

This is where the experimental design gets most interesting. The hybrid labeling strategy:

If SpliceVarDB says "Splice-altering":
    target = base_model(alt) - base_model(ref)   # Trust the delta direction

If SpliceVarDB says "Normal":
    target = [0, 0, 0]                           # Override: no effect, period

If SpliceVarDB says "Low-frequency" or "Conflicting":
    SKIP                                          # Don't train on uncertainty

[!tip] Key insight This is a hybrid labeling strategy: it uses the base model's continuous delta for confirmed splice-altering variants (where the base model is likely directionally correct, even if not calibrated), but overrides the base model with ground truth zeros for confirmed non-altering variants (where the base model may produce spurious deltas).

Data: ~2000 samples (1000 SA, 1000 Normal), balanced, train/test split by chromosome (1-20 train, 21-22 test).

Optimization Objective¶

MSE against the validated delta targets, trained with AdamW (lr=5e-5, weight_decay=0.02) and OneCycleLR scheduler (max_lr=5e-4). 40 epochs, batch_size=32.

The delta head outputs raw values (no activation function on the final layer):

# validated_delta_predictor.py:206-213
self.delta_head = nn.Sequential(
    nn.Linear(hidden_dim * 2, hidden_dim),
    nn.ReLU(),
    nn.Dropout(dropout),
    nn.Linear(hidden_dim, hidden_dim // 2),
    nn.ReLU(),
    nn.Linear(hidden_dim // 2, 3)  # [Δ_donor, Δ_acceptor, Δ_neither]
)

Architecture¶

ValidatedDeltaPredictor (validated_delta_predictor.py:144-256, ~3M parameters):

alt_seq [B, 4, 501]
  ↓
GatedCNNEncoder (6 layers, 128 hidden)
  ├── Conv1d(4 → 128, k=1)
  ├── 6x GatedResidualBlock(k=15, dilation=1,2,4,8,1,2)
  ├── AdaptiveAvgPool1d(1)
  ↓
seq_features [B, 128]

ref_base [B, 4] ┐
                 cat → [B, 8] → Linear(8→128) → ReLU → Dropout → Linear(128→128)
alt_base [B, 4] ┘
  ↓
var_features [B, 128]

[seq_features, var_features] → cat → [B, 256]
  ↓
Linear(256→128) → ReLU → Dropout → Linear(128→64) → ReLU → Linear(64→3)
  ↓
Δ = [Δ_donor, Δ_acceptor, Δ_neither]

Optionally includes base model scores as additional input (include_base_scores=True adds 3 dims to the variant input, enabling residual learning on top of the base model).

An attention variant (ValidatedDeltaPredictorWithAttention) replaces global average pooling with learned position attention:

attn_logits = self.position_attention(per_pos_features).squeeze(-1)  # [B, L]
attention = F.softmax(attn_logits, dim=-1)
seq_features = torch.einsum('bl,blh->bh', attention, per_pos_features)  # weighted pool

Assessment¶

[!note] Advantage is in label engineering, not architecture The model architecture is nearly identical to Approach 3's SpliceInducingClassifier. The improvement comes almost entirely from cleaning up the training targets.

Why validated targets work — asymmetric trust:

For splice-altering variants, the base model's delta is a reasonable (if noisy) target — the direction is probably right even if the magnitude is off. For non-splice-altering variants, the base model's delta is unreliable noise — any predicted delta is a false positive by definition. By forcing zeros for the non-SA class, you remove approximately half the noise from the training signal.

[!warning] Concerns 1. r=0.41 is measured on SA samples only. The model's ability to discriminate SA from non-SA (ROC-AUC=0.58) is barely better than Approach 3 (AUC=0.61). 2. False negative blindness. For SA variants where the base model predicts near-zero delta (false negatives), the validated target still uses base_model(alt) - base_model(ref) ≈ 0. The meta-layer has no mechanism to "discover" effects the base model completely misses. 3. 2000 samples is very small. The full SpliceVarDB has ~50K variants. The r=0.41 should be taken with appropriate uncertainty bounds. 4. At inference, final = base_scores + Δ. This means the meta-layer is a residual correction — it can sharpen or attenuate base model predictions but fundamentally cannot detect novel splice sites the base model scores as zero.

What SpliceVarDB Actually Tells Us (and What It Doesn't)¶

What it provides per record¶

Field	Example	Available?
Genomic locus + variant	`chr1:100107682 T→C`	Yes
"Does it affect splicing?"	`Splice-altering`	Yes
Location (exonic/intronic)	`Intronic`	Yes
Experimental method	varies	Yes
Where the affected splice site is	unknown	No
Gain or loss?	unknown	No
Donor or acceptor?	unknown	No
Magnitude of the effect	unknown	No
Resulting transcript	unknown	No

What the variant might actually do¶

Knowing that chr1:100107682 T→C is "splice-altering" tells us something happens to splicing but not what, where, or how much. The variant might:

Directly destroy a canonical GT/AG dinucleotide (trivial case)
Weaken a splice site by disrupting the surrounding consensus motif
Create a new cryptic splice site nearby
Disrupt an exonic splicing enhancer (ESE), causing exon skipping at a site hundreds of bp away
Alter a branch point sequence, changing acceptor site selection
Modify RNA secondary structure, exposing or hiding a splice site

Cases 1-2 are local and detectable within 501nt. Cases 3-6 may require much longer range context and are the truly "alternative" splice events that current approaches struggle with.

Why This Is Not a Standard Supervised Learning Problem¶

The label hierarchy¶

Level 0: Genome-wide       "Which positions are splice sites?"
                            → Labels: GTF annotations (complete for canonical sites)
                            → Supervision: Strong, but STATIC (reference genome only)

Level 1: Variant-level      "Does this variant affect splicing?"
                            → Labels: SpliceVarDB (binary, ~50K variants)
                            → Supervision: Weak — tells us IF, not WHERE/HOW

Level 2: Effect type        "Gain or loss? Donor or acceptor?"
                            → Labels: NOT in SpliceVarDB directly
                            → Can be partially INFERRED from base model deltas
                            → Supervision: Inferred, noisy

Level 3: Position-level     "At which position does the new/lost splice site occur?"
                            → Labels: Almost completely ABSENT
                            → Supervision: None (this is what we want to predict)

Level 4: Magnitude          "How strong is the effect (PSI change)?"
                            → Labels: Available in RNA-seq (not currently used)
                            → Supervision: Available but requires different data source

[!important] The fundamental gap We want to predict at Level 3 (positional) but our best labels exist at Level 1 (binary variant-level). The current approaches try to bridge this gap through the base model's delta scores, which provide a noisy Level 3 signal — but only for effects the base model already partially detects.

Systematic ML Formulations¶

Formulation A: Conditional Splice Landscape Prediction¶

The most natural formulation. Instead of predicting "is this variant splice-altering?", predict the full per-position splice site probability map conditioned on a perturbation:

f(sequence, perturbation) → P(donor_i), P(acceptor_i)  for all positions i

where perturbation can be: - A variant (SNV, indel) - An epigenetic mark at position j - A splicing factor binding event - A condition label (stress, disease state)

The delta then falls out naturally:

Δ_i = f(sequence, perturbation)_i - f(sequence, null)_i

Why this is better than the current approach: The current system predicts deltas directly, which means it can only learn corrections to the base model. Formulation A predicts the full landscape, meaning it can discover splice sites the base model never scored. The delta is derived, not directly predicted.

Training signal: - For perturbation = null (reference genome): Strong supervision from GTF annotations at Level 0 - For perturbation = variant: Weak supervision from SpliceVarDB at Level 1 — we know the landscape should change but not where

Bridging the label gap (Levels 1→3): 1. Consistency constraints: For non-SA variants, the predicted landscape should be identical to reference: f(seq, variant) ≈ f(seq, null). This is a self-supervised signal. 2. Sparsity priors: For SA variants, the difference Δ should be sparse (only a few positions change). Enforce with L1 regularization on the delta map. 3. Motif constraints: New splice sites should conform to known motifs (GT for donor, AG for acceptor, polypyrimidine tract for branch points). Physics-informed prior.

Formulation B: Latent Splice State Model¶

Think of each position as having a latent "splice competence" that is modulated by context:

z_i = encoder(sequence_context_around_i)           # latent splice competence
s_i = modulator(z_i, perturbation)                  # perturbed state
P(donor_i), P(acceptor_i) = decoder(s_i)           # observed probabilities

[!tip] Key insight The encoder learns position-level representations from canonical sites (Level 0, abundant labels), and the modulator learns how perturbations shift these representations (Level 1, scarce labels). This separates "understanding splice sites" from "understanding perturbation effects" — two very different learning problems with very different amounts of supervision.

Training: - Phase 1: Train encoder + decoder on canonical splice sites (GTF labels, millions of examples) - Phase 2: Freeze encoder, train modulator on SpliceVarDB variants

The modulator can be small because it only needs to learn a low-dimensional perturbation, while the heavy lifting of sequence understanding is done by the encoder with abundant labels.

Formulation C: Contrastive Multi-Resolution Learning¶

Address the label gap directly by using contrastive objectives at multiple resolutions:

Resolution 1 (position-level):
  Known donor sites should have similar representations
  Known acceptor sites should have similar representations
  Donor ≠ Acceptor ≠ Neither
  → Supervision: GTF annotations (abundant)

Resolution 2 (variant-level):
  SA variants should produce representation SHIFTS
  Non-SA variants should produce NO representation shift
  → Supervision: SpliceVarDB (moderate)

Resolution 3 (effect-level):
  Similar effect types should cluster
  → Supervision: Inferred from base model deltas (noisy)

This formulation doesn't require Level 3 positional labels because it learns representations rather than predictions directly. The position-level prediction emerges from the learned representation space.

Formulation D: Generalized Perturbation Framework¶

The observation that "external factors can be anything including variants, diseases, stress, epigenetic marks" points toward a more general formulation:

SpliceLandscape(position) = BaseModel(sequence) + Σ_k  δ_k(position, perturbation_k)

where each perturbation type k has its own delta function, but they share the same underlying sequence representation. This is a multi-task perturbation model:

Task	Train On	Perturbation Type
Variant effects	SpliceVarDB	Genetic variants
Epigenetic effects	ENCODE methylation + splice junction data	Methylation, histone mods
Splicing factor effects	eCLIP + splice junction data	Protein binding
Disease/tissue state effects	GTEx tissue-specific splicing	Condition labels

Each task has different supervision, but they all answer the same question: how does this perturbation change the splice landscape? The shared sequence encoder benefits from all tasks simultaneously (multi-task transfer).

Comparison of Formulations¶

Aspect	A: Conditional Landscape	B: Latent State	C: Contrastive	D: Perturbation Framework
Can discover novel sites	Yes	Yes	Indirectly	Yes
Handles label gap	Consistency + sparsity	Two-phase training	Multi-resolution	Multi-task
Non-variant perturbations	Natural	Natural	Requires redesign	Primary design goal
Data efficiency	Moderate	High (phase separation)	Moderate	Requires multiple data sources
Implementation complexity	Moderate	Moderate	High	High

Recommendations and Roadmap¶

Where the current system stands¶

The current meta-layer is closest to a simplified version of Formulation A, but with critical limitations:

Aspect	Current System	Ideal (Formulation A)
Predicts	Δ directly (residual)	Full landscape, Δ derived
Can discover novel sites	No (constrained by base model)	Yes
Context	501nt	5000nt+
Variant supervision	SpliceVarDB binary → cleaned deltas	SpliceVarDB + consistency + sparsity constraints
Non-variant perturbations	Not supported	Naturally supported

Short-term (with current infrastructure)¶

The Validated Delta Predictor (Approach 4) is the right starting point
But reframe it: instead of predicting the scalar max-delta, predict a per-position delta profile [L, 3] — this naturally provides Level 3 information
Add the consistency constraint: for non-SA variants, the loss penalizes any non-zero delta at any position. This is free supervision
Add the sparsity prior: for SA variants, L1-regularize the delta profile. Real splice effects are sparse (1-3 positions change significantly)

Medium-term (with GPU access)¶

Move toward Formulation B: pre-train a position-level encoder on canonical sites genome-wide, then fine-tune a lightweight perturbation modulator on SpliceVarDB
Extend context to 2000-5000nt to capture branch points and regulatory elements
Use HyenaDNA or equivalent as the pre-trained encoder backbone

Long-term (full Formulation D)¶

Integrate RNA-seq junction data (e.g., GTEx) as continuous Level 4 supervision
Multi-task across perturbation types
This is where "predicting alternative splicing from any external factor" becomes tractable

Cross-Cutting Insights¶

Experimental progression validates the design thinking¶

Transition	What was learned
001 → 002	Classification of known sites is the wrong proxy task for variant detection
002 → 004	Target quality (not model capacity) is the bottleneck
003 (parallel)	There IS a learnable signal from sequence alone (AUC > 0.5), but 501nt is insufficient
All → Future	The meta-layer is a calibration layer, not a discovery engine

The core limitation shared by all approaches¶

All four approaches can only refine what the base model already partially detects. True alternative splice site prediction — identifying cryptic splice sites, deep intronic activations, or complex multi-exon effects — requires:

Longer context (thousands of bp) to capture branch points, enhancers/silencers, and RNA secondary structure
Direct sequence-to-function learning rather than refining an existing model's outputs
More diverse training signal — RNA-seq data showing actual isoform usage, not just binary SA/non-SA labels
Addressing the false negative problem — variants that create entirely new splice sites the base model has never scored

[!abstract] Bottom line The recognition that this is fundamentally a weak-supervision problem with a label hierarchy (not a standard input→label mapping) is the key insight that should drive the system design. The current experiments have established what doesn't work (canonical classification, raw base model targets) and what the signal looks like (validated targets, r=0.41). The next step is formalizing the problem structure itself and designing training objectives that bridge the gap between Level 1 labels (SpliceVarDB) and Level 3 predictions (positional splice landscapes).

[[experiments/001_canonical_classification/README|Experiment 001 — Canonical Classification]]
[[experiments/002_delta_prediction/README|Experiment 002 — Paired Delta Prediction]]
[[experiments/003_binary_classification/README|Experiment 003 — Binary Classification]]
[[experiments/004_validated_delta/README|Experiment 004 — Validated Delta Prediction]]
[[methods/VALIDATED_DELTA_PREDICTION|Validated Delta Prediction Method]]
[[methods/PAIRED_DELTA_PREDICTION|Paired Delta Prediction Method]]
[[methods/ROADMAP|Methodology Roadmap]]

Alternative Splice Site Prediction — Deep Analysis¶

Table of Contents¶

Prerequisite — SpliceVarDB and Trainable Signals¶

What SpliceVarDB provides¶

The labeling gap¶

Approach 1 — Canonical Classification (Experiment 001)¶

Label Preparation¶

Optimization Objective¶

Architecture¶

Assessment¶

Approach 2 — Paired Delta Prediction (Experiment 002)¶

Label Preparation¶

Optimization Objective¶

Architecture¶

Assessment¶

Approach 3 — Binary Classification (Experiment 003)¶

Label Preparation¶

Optimization Objective¶

Architecture¶

Assessment¶

Approach 4 — Validated Delta Prediction (Experiment 004)¶

Label Preparation¶

Optimization Objective¶

Architecture¶

Assessment¶

What SpliceVarDB Actually Tells Us (and What It Doesn't)¶

What it provides per record¶

What the variant might actually do¶

Why This Is Not a Standard Supervised Learning Problem¶

The label hierarchy¶

Systematic ML Formulations¶

Formulation A: Conditional Splice Landscape Prediction¶

Formulation B: Latent Splice State Model¶

Formulation C: Contrastive Multi-Resolution Learning¶

Formulation D: Generalized Perturbation Framework¶

Comparison of Formulations¶

Recommendations and Roadmap¶

Where the current system stands¶

Short-term (with current infrastructure)¶

Medium-term (with GPU access)¶

Long-term (full Formulation D)¶

Cross-Cutting Insights¶

Experimental progression validates the design thinking¶

The core limitation shared by all approaches¶

Related Documents¶