Data Sources, Derivation Strategies, and Landscape Analysis¶

Created: March 2026 Prerequisite: 01_alternative_splice_prediction_analysis.md, 02_virtual_transcripts.md

Motivation¶

Docs 01 and 02 define what we want to predict (alternative splice sites induced by external factors, at junction level) and how to frame the labels (Strategies 1-3, Levels 0-4). This document addresses two practical questions:

Where do we get junction-level labeled data tied to specific external factors?
What already exists in the ML landscape, so we don't reinvent the wheel?

Part I: Data Sources for Junction-Level Labels¶

The fundamental bottleneck for predicting induced splice sites is not algorithm design — it's assembling junction-level labeled datasets at sufficient scale. Below are concrete sources, ordered by effort required.

Tier 1: Available Now (No New Data Generation)¶

GTEx Splice QTLs (sQTLs)¶

The single most actionable data source. GTEx v8 publishes pre-computed associations between genetic variants and differential junction usage across 49 tissues.

Attribute	Detail
What it provides	Variant → junction usage change (intron excision ratio) per tissue
Scale	~50K significant sQTLs across tissues; many are tissue-specific
Label format	(variant, gene, junction_cluster, effect_size, tissue, p-value)
How to derive splice-site labels	Each sQTL's intron cluster defines which junctions are gained/lost/shifted; the variant is the external factor
Key advantage	Pre-computed — no need to re-process BAMs. Population-level (common variants)
Limitation	Biased toward common variants (MAF > 1%); rare pathogenic variants underrepresented
Access	GTEx Portal → sQTL summary statistics
Processing tools	`leafcutter` (intron clustering), `tensorQTL` (association testing)

Integration path: Download sQTL summary statistics → filter to significant associations → map intron clusters to donor-acceptor coordinates → format as (variant, gene, junction_gained/lost, tissue, effect_size).

SpliceVault¶

A lookup database aggregating 335K+ junction-level observations from clinical RNA-seq, cross-referenced with known pathogenic variants (Dawes et al., Nature Genetics 2023).

Attribute	Detail
What it provides	Novel/cryptic junction coordinates (donor-acceptor pairs) with read counts
Scale	335K junction observations from ~300K RNA-seq samples (via recount3)
Key insight	Cryptic splice sites activated by pathogenic variants also appear as rare stochastic events in healthy population data
Performance	Top-4 unannotated events predict mis-splicing outcome with 92% sensitivity
Label format	Junction coordinates + read counts + whether annotated or novel
Key advantage	Bridges "variant X causes aberrant splicing" → "variant X creates junction (d', a') with N reads"
Limitation	Lookup-based (not a trained model); requires the junction to have been observed at least once
Access	github.com/kidsneuro-lab/SpliceVault
Paper	Nature Genetics 2023

Integration path: Query SpliceVault for junctions near known splice-altering variants → provides ground-truth junction coordinates and read support → format matches our splice_sites_enhanced.tsv schema with additional junction-pair columns.

ClinVar + SpliceVarDB (Variant-Level, Needs Junction Mapping)¶

These provide variant-level labels ("this variant affects splicing") but not junction-level coordinates. Useful as a starting point when cross-referenced with SpliceVault or GTEx.

Source	Scale	Label Level	Limitation
SpliceVarDB	~50K variants	Binary (splice-altering or not)	No junction positions
ClinVar (splice subset)	~15K splice variants	Review status + disease	Inconsistent annotation depth

Tier 2: Moderate Effort¶

TCGA Junction Quantification¶

For cancer/disease-specific alternative splicing. Matched somatic variant calls + tumor RNA-seq across 33 cancer types.

Attribute	Detail
What it provides	Somatic variant → aberrant junction usage in same tumor
Scale	~11K patients across 33 cancer types
Key advantage	Somatic mutations near splice sites directly linked to tumor-specific junction usage
Processing	SplAdder reprocessing provides alternative splicing event catalogs (exon skip, intron retention, alt 3'/5')
Related resource	PCAWG (Pan-Cancer Analysis of Whole Genomes) has systematically cataloged splicing-associated somatic mutations
Effort	Need to download and process BAMs or use pre-computed junction tables

Integration path: Use TCGA junction tables (or SplAdder output) → filter to junctions near somatic mutations → label as cancer-specific alternative splicing events.

ENCODE Perturbation RNA-seq¶

Knockdown of splicing regulators (SRSF1, hnRNPA1, U2AF1, SF3B1, etc.) with matched RNA-seq. Junction changes after knockdown directly reveal which junctions depend on that factor.

Attribute	Detail
What it provides	Splicing factor knockdown → differential junction usage
Experiment types	shRNA knockdown, CRISPRi, in K562 and HepG2 cell lines
Key advantage	Clean causal signal — if junction X disappears when factor Y is knocked down, X depends on Y
Scale	~200+ splicing-related knockdown experiments with RNA-seq
Access	ENCODE Portal → search "RNA-seq" + "shRNA" + splicing factor

Integration path: Download perturbation vs. control RNA-seq → run STAR junction quantification → compute differential junction usage → label as factor-dependent junctions.

Tier 3: Heavy Lift, Highest Value¶

Reprocess GTEx/TCGA BAMs with Uniform Junction Calling¶

The most comprehensive approach — run all ~17K GTEx + ~11K TCGA samples through a uniform STAR 2-pass pipeline to get per-sample junction counts.

Attribute	Detail
What it enables	Discovery of rare-variant sQTLs not in the published catalog
Scale	~28K samples, ~billions of junction observations
Compute cost	~$5-10K in cloud compute (or months on institutional cluster)
Key advantage	Enables junction-level labels for rare pathogenic variants
Alternative	Use recount3 (pre-processed junction tables for ~750K public RNA-seq samples)

Note: recount3 provides pre-computed junction tables that may be sufficient without reprocessing BAMs. See recount3.org.

Drug Perturbation and ASO Studies¶

For the hardest case — condition-induced alternative splicing:

Source	What It Provides	Example
Splicing modulator studies	Drug → junction changes	Risdiplam (SMN2 exon 7 inclusion)
ASO (antisense oligonucleotide) studies	Targeted splice site blocking → clean junction changes	Nusinersen (SMN2), Eteplirsen (DMD exon 51 skip)
Stress response RNA-seq	Cellular stress → differential splicing	Heat shock, hypoxia, DNA damage

These produce the cleanest training signal for perturbation-aware models but require study-by-study curation.

Part II: Derivation Strategy — Building the Dataset Incrementally¶

Phase 1 (weeks): Curate existing resources
  ├── GTEx sQTLs → ~50K variant-junction associations (common variants)
  ├── SpliceVault → ~335K junction observations (pathogenic variants)
  └── Format: (variant, gene, junction_gained/lost, tissue, read_count)
      ≈ 100K unique variant-junction associations

Phase 2 (months): Expand with processed data
  ├── TCGA junctions → somatic variant-junction (cancer-specific)
  ├── ENCODE knockdowns → factor-junction dependencies
  └── Format: same + external_factor column (variant / factor / condition)
      ≈ 500K associations

Phase 3 (ongoing): Deep integration
  ├── recount3 junction tables → rare variant sQTLs
  ├── Drug/ASO perturbation studies → condition-specific junctions
  └── ≈ 1M+ associations across variant, factor, and condition types

The key insight: all external factors produce the same output format — a set of junctions that are gained or lost relative to a reference. The meta-layer doesn't need to know whether the cause is a genetic variant, a splicing factor knockdown, or a drug — it learns the mapping from (sequence + context features + perturbation signal) to (junction usage change).

Part III: Landscape Analysis — What Already Exists (as of March 2026)¶

Per-Position Splice Site Prediction (Mature)¶

Tool	Architecture	Context	Tissue-Aware	Training Data	Key Limitation
SpliceAI (2019)	ResNet, 3-channel output	10kb	No	GENCODE annotations	Per-position only; argmax discards information
CI-SpliceAI (2022)	Same as SpliceAI	10kb	No	Curated alt sites	Marginal improvement (~1%)
OpenSpliceAI (2025)	Same architecture, retrained	10kb	No	GENCODE v46	Per-position only
Pangolin (2022)	SpliceAI architecture	5kb	4 tissues	RNA-seq junction counts (4 tissues x 4 species)	Bug affected ~28% of hg38 scores in 2024
SpTransformer (2024)	Transformer	Variable	56 tissues	GTEx + GENCODE	Per-position; claims 83% cross-tissue inference

Our base layer: SpliceAI + OpenSpliceAI. Feature pipeline extracts 43+ derived features.

Variant Effect on Splicing (Binary/Categorical)¶

Tool	Prediction Level	Tissue-Aware	Data Sources	Key Advance
AbSplice (2023)	Binary (aberrant or not)	50 tissues	GTEx rare variants + FRASER	First tissue-specific aberrant splicing classifier; 60% precision
AbSplice2 (2025)	Binary + continuous usage	Developmental stages	GTEx + FRASER2 + Pangolin	Adds developmental stage predictions
MMSplice/MTSplice (2019/2021)	Delta-logit-PSI (exon-level)	56 tissues (MTSplice)	VEX-seq + GTEx	Limited to cassette exon events
SpliceVault (2023)	Outcome type lookup	No	335K RNA-seq samples	92% sensitivity for outcome prediction; lookup, not learned

Gap: AbSplice predicts whether aberrant splicing occurs but not which junction. SpliceVault predicts outcomes but only via lookup of previously observed events.

Junction-Level Prediction (Emerging)¶

Tool	What It Predicts	Architecture	Open Source	Key Limitation
Splam (2024)	P(valid junction \| donor, acceptor)	Deep ResNet, 800nt input	Yes	Requires enumeration of candidate pairs; classifies, doesn't discover
AlphaGenome (2025)	2D junction strength (donor x acceptor)	1Mb U-Net + Transformer	API only	Non-commercial; cannot train/extend; doesn't model perturbations

AlphaGenome is the state-of-the-art for junction-level prediction from sequence — its 2D pairwise branch explicitly scores donor-acceptor pairs. However, it is API-only, non-commercial, and does not model external perturbation factors.

Splam is the closest open-source junction-level model, but it classifies given candidate pairs rather than discovering novel junctions from genomic regions.

Perturbation-Aware Splicing (Very Early)¶

Tool	External Factor	Architecture	Status
ChemSplice (2024)	Drug perturbations	BERT (AllSplice) + chemical embeddings	Preprint only; not peer-reviewed
KATMAP (2025)	Splicing factor knockdowns	Linear regression (interpretable)	Published; predicts factor targets, not junction outcomes
TrASPr+BOS (2025)	Experimental conditions	Generative VAE + Bayesian optimization	Predicts PSI/dPSI; can design sequences

This space is wide open. ChemSplice is the only model attempting drug-perturbed junction prediction, and it's a preprint. No tool combines variant effects + perturbation context + junction-level prediction in a single framework.

Foundation Models for Splicing¶

Model	Splice Capability	Junction-Level	Zero-Shot
Evo 2 (2025)	Variant effect via log-likelihood ratio	No	Yes (best zero-shot performance)
Borzoi (2025)	RNA-seq coverage prediction → indirect splice scoring	Indirect (coverage-based)	No (trained on specific tracks)
SpliceBERT (2024)	Splice site classification after fine-tuning	No	No
AlphaGenome (2025)	Full junction-level scoring	Yes (2D branch)	No (trained on functional genomics tracks)

Our approach: Use Evo2 embeddings as features in the meta-layer, not as a standalone predictor. The sparse exon classifier demonstrates that Evo2 embeddings capture splicing-relevant structure (see examples/foundation_models/05_sparse_exon_classifier.py).

Part IV: What's Novel About Our Approach¶

The Gap in the Landscape¶

No existing open-source tool combines all three of:

Junction-level prediction (donor-acceptor pairing, not just per-position)
External factor conditioning (variants, disease, perturbations)
Tissue specificity (different junctions active in different contexts)

AlphaGenome achieves (1) but not (2) or (3), and is API-only. AbSplice achieves (3) partially but not (1). ChemSplice attempts (2) but is a preprint with narrow scope.

Our Position¶

                        Per-Position    Junction-Level
                        ────────────    ──────────────
Canonical only          SpliceAI ✓      Splam ✓
                        OpenSpliceAI ✓  AlphaGenome ✓ (API)

Variant-induced         AbSplice ✓      ← GAP: no open tool
                        SpliceVault ✓     predicts which junctions
                        (lookup)          a variant creates/destroys

Factor/condition-       ChemSplice       ← GAP: no tool predicts
induced                 (preprint)        junction outcome under
                        KATMAP (linear)   perturbation

The meta-layer we're building aims to fill the bottom-right quadrant: junction-level predictions conditioned on external factors, using a multimodal feature fusion approach (base scores + conservation + epigenetic + junction evidence + perturbation signal).

Why Multimodal Fusion (Not End-to-End)¶

AlphaGenome shows that end-to-end 2D junction prediction is possible, but requires massive compute (1Mb input, thousands of output tracks, DeepMind-scale training).

Our approach is complementary and more practical: - Base models (SpliceAI/OpenSpliceAI) handle the "splicing grammar" from sequence - Conservation tells us which sites are under selective constraint (and which are evolvable) - Epigenetic context tells us which sites are active in a given cell type - Junction reads provide ground truth for donor-acceptor pairing - Meta-layer learns to combine these signals, conditioned on external factors

This is more data-efficient than training a single massive model, and allows each modality to be updated independently as better data becomes available.

Part V: Recommended Reading Order¶

For someone new to this problem:

01_alternative_splice_prediction_analysis.md — Problem formulation, label hierarchy, initial experiments showing the difficulty (best: r=0.41)
02_virtual_transcripts.md — Why junction-level prediction is needed, three label strategies, the pairing problem
This document — Concrete data sources, what exists in the landscape, what's novel
docs/foundation_models/evo2/junction_support_labels.md — How to build soft labels from junction read support (implementation-level detail)
docs/isoform_discovery/README.md — The end goal: context-aware isoform discovery

References¶

Data Sources¶

GTEx sQTLs: gtexportal.org
SpliceVault: Nature Genetics 2023, github.com/kidsneuro-lab/SpliceVault
recount3: recount.bio
ENCODE: encodeproject.org
TCGA SplAdder: bioinformatics.mdanderson.org/TCGA_Splicing

Tools and Models¶

SpliceAI: Cell 2019
OpenSpliceAI: eLife 2025
Pangolin: Genome Biology 2022
AbSplice: Nature Genetics 2023, github.com/gagneurlab/absplice
AbSplice2: bioRxiv 2025
SpliceMap: github.com/gagneurlab/splicemap
MMSplice/MTSplice: Nature Methods 2019, Genome Biology 2021
Splam: Genome Biology 2024
AlphaGenome: Nature 2025
Evo 2: bioRxiv 2025
Borzoi: Nature Genetics 2025
ChemSplice/AllSplice: bioRxiv 2024
KATMAP: Nature Biotechnology 2025
SpTransformer: Nature Communications 2024
SpliceBERT: Briefings in Bioinformatics 2024
TrASPr+BOS: eLife 2025

Within Codebase¶

Junction support labels: docs/foundation_models/evo2/junction_support_labels.md
Isoform discovery vision: docs/isoform_discovery/README.md
Feature pipeline: src/agentic_spliceai/splice_engine/features/
Multimodal exploration: examples/features/05_multimodal_exploration.py
Sparse exon classifier: examples/foundation_models/05_sparse_exon_classifier.py