Meta-Layer Documentation¶

Last Updated: April 2026 Status: Active Development (Phase 6 — M1-S training, M2 series design)

Overview¶

The meta-layer is a multimodal refinement system that takes frozen base model predictions (OpenSpliceAI, SpliceAI) and improves them using conservation, epigenetic, chromatin, junction, and RBP evidence. It follows the universal [L, 3] protocol: given an L-nucleotide sequence, output per-position (donor, acceptor, neither) scores.

Four model variants (M1-M4) address progressively harder prediction tasks, from canonical site recalibration to perturbation-induced splice site discovery.

Methods (read in order)¶

#	Document	Topic
00	Model Variants M1-M4	Variant definitions, architecture, results, status
01	Label Hierarchy & Weak Supervision	Label levels (L0-L4), weak-supervision framing, SpliceVarDB
02	Annotation-Driven Splice Prediction	Annotation as latent variable, GENCODE MANE, tier-based confidence
03	Virtual Transcripts & Junction Pairing	Representation gap, donor-acceptor pairing, Level 3.5
04	Data Sources & Landscape	GTEx, SpliceVault, ENCODE data + ML landscape analysis
05	M2 Variant Formulations	M2-S model + evaluation protocols
—	Naming Convention	Model vs eval protocol definitions

Reading guide¶

Start here: 00 (model overview) — defines M1-M4 and current results
Naming: naming_convention.md — models (M1-S, M2-S) vs eval protocols (Eval-MANE, etc.)
Understand the problem: 01 (label hierarchy) — why this isn't standard supervised learning
Understand M2: 02 (annotation strategy) + 05 (M2 protocols) — the current research frontier
Deeper context: 03 (junction pairing) + 04 (data landscape) — for M3/M4 planning

Architecture¶

Document	Description
ARCHITECTURE.md	System architecture and design principles

Meta-SpliceAI Archive¶

The predecessor project (Meta-SpliceAI) ran four experiments on variant-level splice prediction using SpliceVarDB:

ID	Experiment	Result
001	Canonical Classification	99.1% acc, 17% variant detection
002	Paired Delta Prediction	r=0.38
003	Binary Classification	AUC=0.61, F1=0.53
004	Validated Delta Prediction	r=0.41 (best)

Key finding: the bottleneck is label quality (binary variant-level labels), not model capacity. This motivated the shift to annotation-tier and junction-weighted labeling in the M2 series.

Full archive: meta-spliceai-archive/

Component	Path
MetaSpliceModel	`src/.../meta_layer/models/meta_splice_model_v3.py`
DenseFeatureExtractor	`src/.../features/dense_feature_extractor.py`
SequenceLevelDataset	`src/.../meta_layer/data/sequence_level_dataset.py`
Shard packing	`src/.../meta_layer/data/shard_packing.py`
Training script	`examples/meta_layer/07_train_sequence_model.py`
XGBoost baseline (M1-P)	`examples/meta_layer/01_xgboost_baseline.py`
M1-P full-genome results	`examples/meta_layer/docs/m1_fullgenome_results.md`
Feature configs	`examples/features/configs/`