Skip to content

Junction-Supported Exon Labels for Evo2 Training

Purpose: Explain why and how to use RNA-seq-backed splice junction support counts to produce higher-confidence per-nucleotide exon/intron labels for training the Evo2 exon classifier.

Audience: Developers and researchers building the Evo2 exon prediction pipeline.

Status: Design — ready to implement


Background

What are "exon labels"?

When training the exon classifier on top of Evo2 embeddings, we need a binary label for each nucleotide in a gene sequence: 1 = exon, 0 = intron.

The simplest approach derives these labels directly from the GTF annotation:

gene_sequence_*.parquet  +  splice_sites_enhanced.tsv
build_exon_labels(gene_id, ...)   ← implemented in utils/chunking.py
per-nucleotide [0, 1, 1, 1, 0, 0, ...] array

This works, but it has a subtle weakness: GTF annotations include all annotated exons across all transcripts of a gene, many of which are low- confidence or tissue-specific. For a classifier that we want to be useful for adaptive splice site prediction (i.e. detecting which exons are actually active), annotation-only labels are too permissive.

What is junction support?

A splice junction is a (donor_position, acceptor_position) pair representing one intron-removal event. In RNA-seq workflows, junctions are detected by split-read aligners (e.g. STAR, HISAT2) and each junction receives a read support count — the number of RNA-seq reads that span it.

A junction with high read support is strong evidence that the corresponding exon boundaries are real and expressed in the tissue of interest.

Why use junction support for labels?

Label source Pros Cons
GTF annotation only Always available; no RNA-seq needed Includes unannotated / theoretical exons; no tissue specificity
Junction support Reflects actual splicing activity; tissue-specific; filters noise Requires RNA-seq data; coverage-dependent
Combined (this doc) Best of both: GTF as skeleton, junction support as confidence Slightly more complex pipeline

The combined strategy is also what we need downstream: the goal is to predict novel isoforms not yet in the GTF. A classifier trained purely on GTF labels will never generalise to unannotated splice events.


Data Sources Already Available

From the existing data/ infrastructure:

File Location Contents
splice_sites_enhanced.tsv data/{source}/{build}/ Annotated donor/acceptor positions with exon_rank, transcript_id, gene_id
junctions.tsv data/ensembl/GRCh38/ Junction-level annotations (when RNA-seq data has been processed)
gene_sequence_*.parquet data/ensembl/GRCh37/ Per-gene full sequences, one row per gene

The junctions.tsv format (from meta-spliceai processing) typically contains:

chrom  start  end  strand  support_count  gene_id  transcript_id  ...

Where start is the donor position (last base of exon) and end is the acceptor position (first base of next exon), with support_count being the number of RNA-seq reads supporting this junction.


Implementation

Step 1 — Load junction support counts

import pandas as pd
from pathlib import Path

def load_junctions(junctions_path: str | Path) -> pd.DataFrame:
    """Load junction support file and return a clean DataFrame.

    Expected columns:
        chrom, start (donor pos), end (acceptor pos), strand,
        support_count, gene_id
    """
    df = pd.read_csv(junctions_path, sep="\t")

    # Normalise column names (different tools use different names)
    rename_map = {
        "reads": "support_count",
        "score": "support_count",
        "count": "support_count",
        "block_count": "support_count",
    }
    df = df.rename(columns={k: v for k, v in rename_map.items() if k in df.columns})

    required = {"chrom", "start", "end", "strand", "gene_id"}
    missing = required - set(df.columns)
    if missing:
        raise ValueError(f"junctions file missing columns: {missing}")

    return df

Step 2 — Build a confidence-weighted label

Instead of a hard binary label, we build a float label in [0, 1] representing the probability that each nucleotide is part of a supported exon. This can be used as a soft target during training.

import numpy as np

def build_junction_weighted_labels(
    gene_id: str,
    gene_start: int,
    gene_sequence_length: int,
    splice_sites_df: pd.DataFrame,
    junctions_df: pd.DataFrame,
    min_support: int = 3,
    soft: bool = True,
) -> np.ndarray:
    """Per-nucleotide exon label weighted by junction read support.

    Parameters
    ----------
    gene_id:
        Ensembl / MANE gene ID.
    gene_start:
        Genomic start of the gene (0-based, from gene_sequence parquet).
    gene_sequence_length:
        Length of the gene sequence string.
    splice_sites_df:
        Loaded splice_sites_enhanced.tsv DataFrame.
    junctions_df:
        Loaded junctions.tsv DataFrame filtered to this chromosome.
    min_support:
        Minimum junction read count to count a junction as supported.
        Junctions below this threshold are treated as absent.
    soft:
        If True, return float weights normalised to [0, 1].
        If False, return a hard binary label (1 if any supported junction
        uses this nucleotide as part of its exon span, 0 otherwise).

    Returns
    -------
    np.ndarray of shape [gene_sequence_length], dtype float32 if soft,
    uint8 if hard.
    """
    labels = np.zeros(gene_sequence_length, dtype=np.float32)

    # Filter junctions for this gene
    gene_juncs = junctions_df[
        (junctions_df["gene_id"] == gene_id) &
        (junctions_df["support_count"] >= min_support)
    ].copy()

    # Filter annotation sites for this gene
    gene_sites = splice_sites_df[splice_sites_df["gene_id"] == gene_id].copy()

    if gene_sites.empty:
        return labels if soft else labels.astype(np.uint8)

    rank_col = "exon_rank" if "exon_rank" in gene_sites.columns else "exon_number"

    for tx_id, tx_sites in gene_sites.groupby("transcript_id"):
        tx_sites = tx_sites.sort_values(rank_col)
        donors = sorted(tx_sites[tx_sites["site_type"] == "donor"]["position"].tolist())
        acceptors = sorted(tx_sites[tx_sites["site_type"] == "acceptor"]["position"].tolist())
        strand = tx_sites["strand"].iloc[0]

        # Build annotated exon spans
        spans: list[tuple[int, int, float]] = []  # (start, end, weight)

        pairs = (
            zip(acceptors, donors) if strand == "+"
            else zip(donors, acceptors)
        )

        for acc, don in pairs:
            if acc > don:
                continue

            # Look up junction support for this intron
            # The intron is (don + 1, acc - 1) in genomic coords
            match = gene_juncs[
                (gene_juncs["start"] == don) &
                (gene_juncs["end"] == acc)
            ]

            if not gene_juncs.empty and len(match) == 0:
                # Junction annotation exists but no read support → down-weight
                weight = 0.1 if soft else 0
            elif len(match) > 0:
                # Supported junction → weight proportional to log(support)
                support = match["support_count"].iloc[0]
                weight = float(np.log1p(support) / np.log1p(gene_juncs["support_count"].max()))
            else:
                # No junction data at all → fall back to annotation weight
                weight = 0.5

            spans.append((acc, don, weight))

        for gstart, gend, weight in spans:
            rel_s = max(0, gstart - gene_start)
            rel_e = min(gene_sequence_length, gend - gene_start + 1)
            if rel_s < rel_e:
                labels[rel_s:rel_e] = np.maximum(labels[rel_s:rel_e], weight)

    if not soft:
        return (labels >= 0.5).astype(np.uint8)
    return labels

Step 3 — Integrate into the training data pipeline

from foundation_models.utils.chunking import (
    load_gene_sequences_parquet,
    make_chunks_for_gene,
    generate_labeled_windows,
)
from pathlib import Path
import pandas as pd

DATA_ROOT = Path("data/ensembl/GRCh37")

def build_training_dataset(
    chromosomes: list[str] = ["1", "2", "3"],
    window_size: int = 4096,
    step_size: int = 2048,
    min_support: int = 3,
    use_junction_weights: bool = True,
):
    """Build a list of LabeledWindow objects for Evo2 exon classifier training.

    Yields windows across all genes on the specified chromosomes.
    """
    splice_sites = pd.read_csv(DATA_ROOT / "splice_sites_enhanced.tsv", sep="\t")
    junctions_df = None
    if use_junction_weights and (DATA_ROOT / "junctions.tsv").exists():
        junctions_df = pd.read_csv(DATA_ROOT / "junctions.tsv", sep="\t")

    for chrom in chromosomes:
        parquet_path = DATA_ROOT / f"gene_sequence_{chrom}.parquet"
        if not parquet_path.exists():
            print(f"  Skipping chr{chrom}: parquet not found")
            continue

        genes_df = load_gene_sequences_parquet(parquet_path)
        print(f"  chr{chrom}: {len(genes_df)} genes")

        for _, row in genes_df.iterrows():
            gene_id = row["gene_id"]
            gene_start = int(row["start"])
            seq_len = len(row["sequence"])

            if use_junction_weights and junctions_df is not None:
                labels = build_junction_weighted_labels(
                    gene_id=gene_id,
                    gene_start=gene_start,
                    gene_sequence_length=seq_len,
                    splice_sites_df=splice_sites,
                    junctions_df=junctions_df,
                    min_support=min_support,
                    soft=True,
                )
            else:
                from foundation_models.utils.chunking import build_exon_labels
                labels = build_exon_labels(
                    gene_id=gene_id,
                    gene_start=gene_start,
                    gene_sequence_length=seq_len,
                    splice_sites_df=splice_sites,
                ).astype(np.float32)

            yield from generate_labeled_windows(
                gene_sequence=row["sequence"],
                exon_labels=labels,
                gene_id=gene_id,
                chrom=str(chrom),
                gene_genomic_start=gene_start,
                window_size=window_size,
                step_size=step_size,
            )

Soft Labels vs Hard Labels — When to Use Which

Scenario Recommendation
Reproducing Evo2 paper exon classifier Hard labels from GTF annotation (build_exon_labels)
Training on well-covered RNA-seq data Soft labels (soft=True, min_support=5)
Transfer learning to a new tissue Soft labels from that tissue's RNA-seq
Low-coverage or noisy RNA-seq Hard labels with junction filtering (soft=False, min_support=10)
No RNA-seq available Hard labels from annotation only

Training with Soft Labels

The ExonClassifier.fit() method uses BCEWithLogitsLoss. To train with soft float targets, no change to the classifier is needed — BCEWithLogitsLoss accepts targets in [0, 1]:

from foundation_models.evo2 import Evo2Embedder, ExonClassifier
import torch

embedder = Evo2Embedder(model_size="7b", quantize=True)

# Collect embeddings and labels for a small gene set
embeddings_list, labels_list = [], []

for window in build_training_dataset(chromosomes=["22"], window_size=4096):
    emb = embedder.encode(window.sequence)          # [seq_len, 2560]
    embeddings_list.append(emb)
    labels_list.append(torch.tensor(window.labels, dtype=torch.float32))

train_embeddings = torch.stack(embeddings_list)     # [N, seq_len, 2560]
train_labels = torch.stack(labels_list)             # [N, seq_len]

classifier = ExonClassifier(input_dim=2560, architecture="mlp")
classifier.fit(
    train_embeddings=train_embeddings,
    train_labels=train_labels,
    epochs=50,
)

Junction Confidence Tiers

A practical way to think about label quality:

Tier 1 — High confidence  (support ≥ 10 reads)
    → weight = log(support) / log(max_support)  ≈ 0.7–1.0
    → safe for direct training

Tier 2 — Medium confidence  (3 ≤ support < 10)
    → weight ≈ 0.3–0.7
    → useful if balanced against Tier 1 samples

Tier 3 — GTF-only, no reads  (support = 0)
    → weight = 0.1 (soft) or 0 (hard)
    → exclude from training OR use as negative-confidence examples

Tier 4 — Novel junction (not in GTF, reads present)
    → weight = log(support) / log(max_support)
    → GOLD for novel isoform discovery — label as exon even if GTF says intron

Tier 4 is the most scientifically interesting — a novel junction with strong read support is a candidate novel splice event. To detect these, we would need to cross-reference the junction positions against the GTF and flag any junction whose donor or acceptor falls outside annotated exon boundaries.


Connection to Adaptive Splice Site Prediction

The exon classifier trained here feeds into the meta-layer:

Gene sequence (full gene or ±10 kb window)
        ↓ Evo2 encoder (frozen)
        ↓ Evo2 embeddings [seq_len, 2560]
        ↓ ExonClassifier
Exon probability per nucleotide  [seq_len]
        ↓  (used as additional feature)
Meta-layer (OpenSpliceAI scores + Evo2 exon probs + conservation + ...)
Adaptive splice site predictions
Novel isoform candidates

Key insight: splice site prediction models like OpenSpliceAI only see a short window (501 bp) around each candidate site. Evo2's exon probability provides global gene-level context — if the exon classifier says position X is unlikely to be exonic, the meta-layer can suppress a false-positive donor prediction there.


Next Steps

  • Obtain/generate junctions.tsv for GRCh38 MANE data (STAR alignment of GTEx or ENCODE RNA-seq)
  • Run build_training_dataset on chr21/chr22 as a fast prototype
  • Train and evaluate ExonClassifier (MLP vs CNN) — target AUROC ≥ 0.85
  • Integrate exon probability as a meta-layer feature

Last Updated: February 22, 2026 Status: Design complete — awaiting RNA-seq junction data for MANE/GRCh38