Multimodal Feature Catalog¶

Complete reference for all 10 feature modalities in the agentic-spliceai meta-layer feature engineering pipeline. Each modality is a registered Modality subclass in src/agentic_spliceai/splice_engine/features/modalities/.

Summary¶

Modality	Columns	Source Type	Data Source	Description
`base_scores`	43	Model output	Base model predictions (SpliceAI/OpenSpliceAI)	Derived features from raw splice site probabilities
`annotation`	3	Genomic annotation	GTF/GFF gene annotations	Ground truth splice site labels and transcript info
`sequence`	3	Genomic reference	Reference FASTA (hg38/hg19)	Contextual DNA sequence windows around each position
`genomic`	4	Genomic annotation	Gene boundary coordinates + sequence	Positional and compositional features within genes
`conservation`	9	External data	UCSC PhyloP/PhastCons bigWig tracks	Evolutionary constraint from multi-species alignment
`epigenetic`	12	External data	ENCODE ChIP-seq bigWig tracks	Histone modification signals across cell types
`junction`	12	External data	STAR SJ.out.tab / GTEx / recount3	RNA-seq splice junction read evidence
`rbp_eclip`	8	External data	ENCODE eCLIP narrowPeak (K562, HepG2)	RNA-binding protein occupancy at splice sites
`chrom_access`	12	External data	ENCODE ATAC-seq (5 cell lines) + DNase-seq (5 primary tissues)	Chromatin accessibility (open vs closed chromatin)
`fm_embeddings`	8	Foundation model	Pre-extracted embeddings (Evo2, SpliceBERT, etc.)	Label-agnostic scalar features from foundation model representations

Total: 114 feature columns (full-stack with fm_embeddings enabled). Default full-stack: 106 columns (fm_embeddings commented out by default — requires GPU-extracted embeddings).

Modalities by Source Type¶

Group 1: Model Output¶

base_scores (43 columns)¶

Derives ~43 engineered features from the three raw per-nucleotide probabilities (donor_prob, acceptor_prob, neither_prob) output by the base model. All features are derived via vectorized Polars expressions. Context-aware features use .over('gene_id') to prevent cross-gene leakage.

Source file: modalities/base_scores.py

Score aliases (3 columns)

Column	Description
`donor_score`	Alias of `donor_prob` for FeatureSchema compatibility
`acceptor_score`	Alias of `acceptor_prob` for FeatureSchema compatibility
`neither_score`	Alias of `neither_prob` for FeatureSchema compatibility

Context scores (4 columns, default window=2)

Raw predicted splice probabilities at neighboring positions, extracted via shift() within gene groups.

Column	Description
`context_score_m2`	Max(donor, acceptor) probability 2 positions upstream
`context_score_m1`	Max(donor, acceptor) probability 1 position upstream
`context_score_p1`	Max(donor, acceptor) probability 1 position downstream
`context_score_p2`	Max(donor, acceptor) probability 2 positions downstream

Derived probability features (7 columns)

Column	Type	Description
`relative_donor_probability`	Derived	donor / (donor + acceptor); donor fraction of splice signal
`splice_probability`	Derived	(donor + acceptor) / total; overall splice confidence
`donor_acceptor_diff`	Derived	(donor - acceptor) / max(donor, acceptor); normalized type difference
`splice_neither_diff`	Derived	(max_splice - neither) / max_all; splice vs background contrast
`donor_acceptor_logodds`	Derived	log(donor) - log(acceptor); log-odds of donor vs acceptor
`splice_neither_logodds`	Derived	log(donor + acceptor) - log(neither); log-odds of splice vs background
`probability_entropy`	Derived	Shannon entropy of the 3-class probability distribution

Context pattern features (3 columns)

Column	Type	Description
`context_neighbor_mean`	Derived	Mean of all context scores in the window
`context_asymmetry`	Derived	Sum of upstream context - sum of downstream context
`context_max`	Derived	Maximum context score across all neighbor positions

Donor gradient features (11 columns)

Column	Type	Description
`donor_diff_m1`	Derived	donor_prob minus context score at -1 position
`donor_diff_m2`	Derived	donor_prob minus context score at -2 position
`donor_diff_p1`	Derived	donor_prob minus context score at +1 position
`donor_diff_p2`	Derived	donor_prob minus context score at +2 position
`donor_surge_ratio`	Derived	donor_prob / (neighbor_m1 + neighbor_p1); sharpness of peak
`donor_is_local_peak`	Derived	Binary: 1 if donor_prob > both immediate neighbors and > 0.001
`donor_weighted_context`	Derived	Gaussian-weighted sum of donor and context scores
`donor_peak_height_ratio`	Derived	donor_prob / mean(context); how much position stands out
`donor_second_derivative`	Derived	2*donor_prob - m1 - p1; curvature of score profile
`donor_signal_strength`	Derived	donor_prob - mean(context); absolute signal above background
`donor_context_diff_ratio`	Derived	donor_prob / max(context); ratio to strongest neighbor

Acceptor gradient features (11 columns)

Same structure as donor gradient features, computed from acceptor_prob:

Column	Type	Description
`acceptor_diff_m1`	Derived	acceptor_prob minus context score at -1 position
`acceptor_diff_m2`	Derived	acceptor_prob minus context score at -2 position
`acceptor_diff_p1`	Derived	acceptor_prob minus context score at +1 position
`acceptor_diff_p2`	Derived	acceptor_prob minus context score at +2 position
`acceptor_surge_ratio`	Derived	acceptor_prob / (neighbor_m1 + neighbor_p1)
`acceptor_is_local_peak`	Derived	Binary: 1 if acceptor_prob > both immediate neighbors and > 0.001
`acceptor_weighted_context`	Derived	Gaussian-weighted sum of acceptor and context scores
`acceptor_peak_height_ratio`	Derived	acceptor_prob / mean(context)
`acceptor_second_derivative`	Derived	2*acceptor_prob - m1 - p1
`acceptor_signal_strength`	Derived	acceptor_prob - mean(context)
`acceptor_context_diff_ratio`	Derived	acceptor_prob / max(context)

Cross-type comparative features (4 columns)

Column	Type	Description
`donor_acceptor_peak_ratio`	Derived	donor_peak_height_ratio / acceptor_peak_height_ratio
`type_signal_difference`	Derived	donor_signal_strength - acceptor_signal_strength
`score_difference_ratio`	Derived	(donor - acceptor) / (donor + acceptor); normalized difference
`signal_strength_ratio`	Derived	donor_signal_strength / acceptor_signal_strength

Group 2: Genomic Annotation¶

annotation (3 columns)¶

Joins known donor/acceptor positions from pre-extracted GTF splice site annotations onto the prediction DataFrame. Matches by (chrom, position, strand).

Source file: modalities/annotation.py

Column	Type	Description
`splice_type`	Raw (from GTF)	Ground truth label: `'donor'`, `'acceptor'`, or `''` (neither)
`transcript_id`	Raw (from GTF)	Ensembl transcript ID of the first matching transcript
`transcript_count`	Derived	Number of distinct transcripts containing this splice site

Data leakage warning: splice_type is the training label itself and must NEVER be used as a feature. It is listed in FeatureSchema.LEAKAGE_COLS along with pred_type, true_position, predicted_position, is_correct, and error_type. The transcript_id and transcript_count columns are metadata (FeatureSchema.METADATA_COLS) and should also be excluded from training features due to high cardinality and poor generalization.

genomic (4 columns)¶

Lightweight positional and compositional features derived from gene boundary coordinates and optionally from the DNA sequence column (requires the sequence modality to run first).

Source file: modalities/genomic.py

Column	Type	Description
`relative_gene_position`	Derived	Transcriptomic-aware position within gene (0.0 = 5'/TSS, 1.0 = 3'/TES); strand-corrected
`distance_to_gene_start`	Derived	Absolute distance (bp) from position to genomic gene start
`distance_to_gene_end`	Derived	Absolute distance (bp) from position to genomic gene end
`gc_content`	Derived	GC fraction in a central window of the DNA sequence (default 100bp window)

Optional (when include_dinucleotides=True): cpg_density (CpG dinucleotide frequency).

Group 3: External Data¶

sequence (3 columns)¶

Extracts fixed-length DNA windows from the reference FASTA using pyfaidx for efficient random access. The sequence column is consumed by downstream modalities (genomic context for GC content) and by the meta-layer model as a raw input.

Source file: modalities/sequence.py

Column	Type	Description
`sequence`	Raw (from FASTA)	DNA sequence string of length `2 * window_size + 1` (default: 1001nt) centered on position
`window_start`	Derived	Genomic start coordinate of the extracted window
`window_end`	Derived	Genomic end coordinate of the extracted window

Note: window_start and window_end are metadata columns, not training features. The sequence column itself is a raw string consumed by the meta-layer's sequence encoder, not a numeric feature.

conservation (9 columns)¶

Evolutionary constraint scores from UCSC multi-species alignment bigWig tracks. Build-matched: GRCh38 uses 100-way vertebrate alignment, GRCh37 uses 46-way. Requires pyBigWig. Supports local files and remote UCSC HTTP streaming.

Source file: modalities/conservation.py

Column	Type	Description
`phylop_score`	Raw (from bigWig)	PhyloP score at the exact position; positive = conserved, negative = fast-evolving
`phylop_context_mean`	Derived	Mean PhyloP score in a window around position (default ±10bp)
`phylop_context_max`	Derived	Maximum PhyloP score in the context window
`phylop_context_std`	Derived	Standard deviation of PhyloP scores in the context window
`phastcons_score`	Raw (from bigWig)	PhastCons probability of being in a conserved element (0-1)
`phastcons_context_mean`	Derived	Mean PhastCons score in the context window
`phastcons_context_max`	Derived	Maximum PhastCons score in the context window
`phastcons_context_std`	Derived	Standard deviation of PhastCons scores in the context window
`conservation_contrast`	Derived	`phylop_score - phylop_context_mean`; how much this position stands out from local context

epigenetic (12 columns)¶

Histone modification signals from ENCODE ChIP-seq fold-change-over-control bigWig tracks. Default mode is summarized (Strategy B): cross-tissue summary statistics from 8 ENCODE cell lines (K562, GM12878, H1, HepG2, A549, keratinocyte, MCF-7, SK-N-SH).

GRCh38 only. For GRCh37 builds, all columns are filled with NaN (graceful degradation). Requires pyBigWig.

Source file: modalities/epigenetic.py

H3K36me3 (exon body mark) -- 6 columns

Column	Type	Description
`h3k36me3_max_across_tissues`	Derived	Maximum H3K36me3 signal across all cell lines at this position
`h3k36me3_mean_across_tissues`	Derived	Mean H3K36me3 signal across cell lines
`h3k36me3_tissue_breadth`	Derived	Number of cell lines with signal above threshold (default > 1.5 fold-change)
`h3k36me3_variance`	Derived	Variance of H3K36me3 signal across cell lines; high = tissue-specific
`h3k36me3_context_mean`	Derived	Mean signal in a 200bp window around position (averaged across cell lines)
`h3k36me3_exon_intron_ratio`	Derived	Log2-ratio of upstream (exonic) to downstream (intronic) signal; positive = exon enrichment

H3K4me3 (promoter mark) -- 6 columns

Column	Type	Description
`h3k4me3_max_across_tissues`	Derived	Maximum H3K4me3 signal across all cell lines at this position
`h3k4me3_mean_across_tissues`	Derived	Mean H3K4me3 signal across cell lines
`h3k4me3_tissue_breadth`	Derived	Number of cell lines with signal above threshold
`h3k4me3_variance`	Derived	Variance of H3K4me3 signal across cell lines
`h3k4me3_context_mean`	Derived	Mean signal in a 200bp window around position
`h3k4me3_exon_intron_ratio`	Derived	Log2-ratio of upstream to downstream signal

junction (12 columns)¶

RNA-seq splice junction read evidence. Features are sparse -- they are attributed to splice site boundary positions (donor and acceptor) via a left-join. Most genomic positions receive zero values. Supports STAR SJ.out.tab (single sample) and pre-aggregated multi-tissue tables (GTEx/recount3).

This modality is label-agnostic: it produces the same columns regardless of downstream usage. The meta-layer config determines whether junction columns are used as features (M2: alternative site detector) or as held-out targets (M3: novel junction predictor).

Source file: modalities/junction.py

Column	Type	Description
`junction_log1p`	Derived	`log1p(total_reads)` across all junctions at this boundary position
`junction_has_support`	Derived	Binary: 1.0 if any junction evidence exists at this position, 0.0 otherwise
`junction_n_partners`	Derived	Number of distinct partner positions (competing donors or acceptors sharing this boundary)
`junction_max_reads`	Raw/Derived	Maximum read count from any single junction anchored at this position
`junction_entropy`	Derived	Shannon entropy (log2) of read distribution across partner junctions; high = many competing junctions
`junction_is_annotated`	Raw (from STAR/GTF)	Binary: 1.0 if any junction at this position is annotated in the reference GTF
`junction_tissue_breadth`	Derived	Number of tissues with reads >= breadth_threshold (multi-tissue data only; 0 for single-sample)
`junction_tissue_max`	Derived	Maximum read count across tissues
`junction_tissue_mean`	Derived	Mean read count across tissues
`junction_tissue_variance`	Derived	Variance of read counts across tissues; high = tissue-specific junction usage
`junction_psi`	Derived	Percent Spliced In: max_reads / total_reads at position; measures dominance of strongest junction
`junction_psi_variance`	Derived	Variance of PSI across tissues (multi-tissue only; NaN for single-sample)

rbp_eclip (8 columns)¶

RNA-binding protein (RBP) occupancy from ENCODE eCLIP experiments. Features are sparse — most positions have zero values (no overlapping peaks). Uses pre-aggregated parquet from scripts/aggregate_eclip_peaks.py which queries the ENCODE REST API for IDR-filtered replicate-merged narrowPeak files.

Default cell lines: K562, HepG2. GRCh38 only; GRCh37 returns zero-filled columns.

Source file: modalities/rbp_eclip.py

See: examples/features/docs/rbp-eclip-tutorial.md for biology background and interpretation guide.

Column	Type	Description
`rbp_n_bound`	Derived	Count of unique RBPs with binding peaks overlapping this position
`rbp_max_signal`	Derived	Maximum fold-enrichment across all overlapping RBP peaks
`rbp_max_neg_log10_pvalue`	Derived	Maximum significance (-log10 p-value) among overlapping peaks
`rbp_has_splice_regulator`	Derived	Binary: 1.0 if any known splice regulator (SR protein, hnRNP, or core factor) is bound
`rbp_n_sr_proteins`	Derived	Count of SR proteins (SRSF1, SRSF3, etc.) with peaks at this position
`rbp_n_hnrnps`	Derived	Count of hnRNP proteins (HNRNPA1, HNRNPC, etc.) with peaks
`rbp_cell_line_breadth`	Derived	Number of cell lines (0-2) with binding evidence at this position
`rbp_mean_signal`	Derived	Mean fold-enrichment across all overlapping peaks

chrom_access (12 columns)¶

Chromatin accessibility from two complementary ENCODE data sources: - ATAC-seq (fold-change-over-control): 5 cancer cell lines (K562, GM12878, HepG2, A549, IMR-90) - DNase-seq (read-depth normalized): 5 primary tissues (brain cortex, heart, lung, muscle, liver)

Both assays measure nucleosome-free DNA but use different signal normalization (different scales), so they are kept as separate column groups (atac_* and dnase_*). The meta-layer model learns their individual contributions.

GRCh38 only. For GRCh37 builds, all columns are filled with NaN (graceful degradation). Requires pyBigWig.

Source file: modalities/chrom_access.py

See: examples/features/docs/chromatin-accessibility-tutorial.md for biology background, ENCODE data sources, and why ATAC/DNase use separate registries.

ATAC-seq (fold-change, cancer cell lines) — 6 columns

Column	Type	Description
`atac_max_across_tissues`	Derived	Maximum ATAC-seq fold-change signal across all cell lines at this position
`atac_mean_across_tissues`	Derived	Mean ATAC-seq signal across cell lines
`atac_tissue_breadth`	Derived	Number of cell lines with signal above threshold (default > 2.0 fold-change)
`atac_variance`	Derived	Variance of ATAC-seq signal across cell lines; high = tissue-specific accessibility
`atac_context_mean`	Derived	Mean signal in a 150bp window around position (averaged across cell lines)
`atac_has_peak`	Derived	Binary: 1.0 if maximum signal > peak threshold (default > 3.0 fold-change), 0.0 otherwise

DNase-seq (read-depth normalized, primary tissues) — 6 columns

Column	Type	Description
`dnase_max_across_tissues`	Derived	Maximum DNase-seq read-depth signal across primary tissues
`dnase_mean_across_tissues`	Derived	Mean DNase-seq signal across tissues
`dnase_tissue_breadth`	Derived	Number of tissues with signal above threshold (default > 5.0 read-depth)
`dnase_variance`	Derived	Variance of DNase-seq signal across tissues; high = tissue-specific accessibility
`dnase_context_mean`	Derived	Mean signal in a 150bp window around position (averaged across tissues)
`dnase_has_peak`	Derived	Binary: 1.0 if maximum signal > peak threshold (default > 10.0 read-depth), 0.0 otherwise

fm_embeddings (10 columns)¶

Label-agnostic scalar features derived from pre-computed foundation model per-position embeddings. All features are computed without using splice site annotations, avoiding any risk of label leakage. This modality is a reader — embeddings must be pre-extracted on a GPU pod using the foundation_models sub-project, then PCA-projected into scalar features. Foundation-model-agnostic: all columns use the fm_ prefix regardless of the underlying model (Evo2, SpliceBERT, etc.).

Requires pre-extracted per-chromosome embedding parquets and PCA artifacts (.npz file fit on training chromosomes only). If unavailable, all columns are filled with NaN (graceful degradation).

Source file: modalities/fm_embeddings.py

See: examples/features/docs/fm-embeddings-tutorial.md for the extraction workflow, PCA fitting, and feature interpretation.

PCA components (6 columns, default)

Column	Type	Description
`fm_pca_1`	Derived	1^st principal component of the embedding vector (captures dominant variation)
`fm_pca_2`	Derived	2^nd principal component
`fm_pca_3`	Derived	3^rd principal component
`fm_pca_4`	Derived	4^th principal component
`fm_pca_5`	Derived	5^th principal component
`fm_pca_6`	Derived	6^th principal component

Summary statistics (2 columns)

Column	Type	Description
`fm_embedding_norm`	Derived	L2 magnitude of the embedding vector; correlates with model confidence and sequence complexity
`fm_local_gradient`	Derived	L2 norm of difference between this position's embedding and the mean of its neighbors within the same gene; detects splice boundary transitions

Optional centroid features (disabled by default, include_cosine_centroids=True)

Column	Type	Description
`fm_donor_cosine_sim`	Derived	Cosine similarity to the mean donor site embedding centroid (fit on training chromosomes). Disabled by default: uses ground truth labels for centroid computation, redundant with base model scores
`fm_acceptor_cosine_sim`	Derived	Cosine similarity to the mean acceptor site embedding centroid. Same caveat as above

Data Leakage Reference¶

The FeatureSchema (in meta_layer/core/feature_schema.py) explicitly tracks columns that must never be used as training features:

Leakage columns (LEAKAGE_COLS) -- directly encode or correlate with the label:

splice_type -- the target label itself (from annotation modality)
pred_type -- base model prediction type
true_position -- exact coordinate of real splice site
predicted_position -- tightly correlated with label
is_correct -- whether base model was correct (TP/TN)
error_type -- FP/FN/TP/TN classification

Metadata columns (METADATA_COLS) -- high cardinality, do not generalize:

gene_id, transcript_id, gene_name, gene_type
chrom, strand, position, absolute_position
window_start, window_end, transcript_count

Use FeatureSchema.is_leaky_column(col) and FeatureSchema.get_excluded_cols() to programmatically enforce these exclusions.

Feature Type Summary¶

Category	Count	Examples
Raw (direct from data source)	~10	`donor_score`, `phylop_score`, `phastcons_score`, `sequence`, `junction_is_annotated`
Derived (engineered from raw)	~100	`probability_entropy`, `donor_surge_ratio`, `conservation_contrast`, `h3k36me3_tissue_breadth`, `rbp_n_bound`, `atac_has_peak`, `fm_pca_1`, `fm_embedding_norm`
Labels (never use as features)	1	`splice_type`
Metadata (not for training)	~11	`gene_id`, `chrom`, `position`, `window_start`, `transcript_id`

Per-Modality Tutorials¶

For detailed biology background, data source descriptions, and interpretation guidance:

Epigenetic Marks Tutorial — H3K36me3/H3K4me3 ChIP-seq
RBP eCLIP Tutorial — ENCODE RBP binding
Chromatin Accessibility Tutorial — ENCODE ATAC-seq
Foundation Model Embeddings Tutorial — Evo2/SpliceBERT scalar features

Last Updated: March 27, 2026