Skip to content

Architecture โ€” Multi-Layer Pipeline to Novel Isoforms

Three-layer architecture enabling progression from canonical prediction to novel isoform discovery:

%%{init: {'theme':'base', 'themeVariables': { 'fontSize':'15px','fontFamily':'ui-sans-serif, system-ui, sans-serif'}}}%%

graph TB
    subgraph AGENTIC["<b>๐Ÿค– AGENTIC LAYER</b> - Clinical Translation & Validation"]
        direction TB
        LIT["<b>๐Ÿ“š Literature Mining</b><br/>PubMed โ€ข arXiv<br/>Splice Databases"]:::agent
        EXP["<b>๐Ÿงฌ Expression Evidence</b><br/>GTEx โ€ข TCGA<br/>RNA-seq Junctions"]:::agent
        CLIN["<b>๐Ÿฅ Clinical Integration</b><br/>ClinVar โ€ข COSMIC<br/>Disease Associations"]:::agent

        NEXUS["<b>๐ŸŽฏ Nexus Research Agent</b><br/>(Orchestrator)<br/>โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”<br/>โ€ข Evidence Aggregation<br/>โ€ข Validation Workflows<br/>โ€ข Drug Target Assessment<br/>โ€ข Report Generation"]:::orchestrator

        LIT --> NEXUS
        EXP --> NEXUS
        CLIN --> NEXUS

        OUTPUT1["<b>โœ… OUTPUT</b><br/>Validated Novel Isoforms<br/>Drug Target Reports"]:::output
        NEXUS --> OUTPUT1
    end

    subgraph META["<b>๐Ÿง  META LAYER</b> - Adaptive Context-Aware Prediction"]
        direction TB

        MULTIMODAL["<b>๐ŸŽจ Multimodal Evidence Fusion</b><br/>9 Modalities โ€ข 100 Features"]:::metalayer

        BASE["<b>๐Ÿ“Š Base Scores</b><br/>Foundation Model<br/>Predictions (43)"]:::input
        SEQ["<b>๐Ÿงฌ Sequence + Genomic</b><br/>DNA Context โ€ข GC<br/>Conservation (19)"]:::input
        EPI["<b>๐Ÿงช Epigenetic + Chromatin</b><br/>H3K36me3 โ€ข H3K4me3<br/>ATAC-seq (18)"]:::input
        RNA["<b>๐Ÿ”ฌ RNA Evidence</b><br/>Junction Reads โ€ข RBP<br/>eCLIP Binding (20)"]:::input

        BASE --> MULTIMODAL
        SEQ --> MULTIMODAL
        EPI --> MULTIMODAL
        RNA --> MULTIMODAL

        FUSION["<b>โšก Fusion Predictor</b><br/>+ Delta Scorer<br/>โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”<br/>ฮ” = Meta - Base<br/>High ฮ” โ†’ Novel Site!"]:::fusion

        MULTIMODAL --> FUSION

        DETECTOR["<b>๐Ÿ” Novel Site Detector</b><br/>โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”<br/>โ€ข High-confidence Filtering<br/>โ€ข Context Clustering<br/>โ€ข Multi-factor Scoring"]:::discovery

        RECON["<b>๐Ÿงฉ Isoform Reconstruction</b><br/>โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”<br/>โ€ข Transcript Assembly<br/>โ€ข ORF Validation<br/>โ€ข Functional Annotation"]:::discovery

        FUSION --> DETECTOR
        DETECTOR --> RECON

        OUTPUT2["<b>โœ… OUTPUT</b><br/>Novel Splice Sites<br/>Reconstructed Isoforms"]:::output
        RECON --> OUTPUT2
    end

    subgraph BASE_LAYER["<b>๐Ÿ”ฌ BASE LAYER</b> - Foundation Models (Extensible)"]
        direction TB

        RUNNER["<b>โš™๏ธ Base Model Runner</b><br/>Standardized I/O Protocol"]:::baselayer

        SA["<b>SpliceAI</b><br/>GRCh37<br/>Pre-trained"]:::foundation
        OSA["<b>OpenSpliceAI</b><br/>GRCh38/MANE<br/>Pre-trained"]:::foundation
        EXT["<b>Extensible</b><br/>Evo โ€ข GPT-based<br/>Any New Model"]:::foundation

        RUNNER --> SA
        RUNNER --> OSA
        RUNNER --> EXT

        RESOURCES["<b>๐Ÿ“‚ Genomic Resources</b><br/>โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”<br/>โ€ข GTF/FASTA Loading<br/>โ€ข Sequence Extraction<br/>โ€ข Splice Annotation<br/>โ€ข Resource Registry"]:::resources

        RESOURCES --> RUNNER

        OUTPUT3["<b>โœ… OUTPUT</b><br/>Per-Nucleotide Scores<br/>Canonical Baseline (~10%)"]:::output
        SA --> OUTPUT3
        OSA --> OUTPUT3
        EXT --> OUTPUT3
    end

    FINAL["<b>๐ŸŽ‰ NOVEL ISOFORM CATALOG</b><br/>โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”<br/>โœ“ Disease-Specific Isoforms<br/>โœ“ Variant-Induced Splicing<br/>โœ“ Tissue-Specific Transcripts<br/>โœ“ Druggable Targets + Evidence<br/>โœ“ Biomarker Candidates"]:::final

    OUTPUT3 --> META
    OUTPUT2 --> AGENTIC
    OUTPUT1 --> FINAL

    classDef agent fill:#0891b2,stroke:#0e7490,stroke-width:3px,color:#ffffff,font-weight:bold
    classDef orchestrator fill:#7c3aed,stroke:#6d28d9,stroke-width:4px,color:#ffffff,font-weight:bold
    classDef metalayer fill:#8b5cf6,stroke:#7c3aed,stroke-width:3px,color:#ffffff,font-weight:bold
    classDef input fill:#06b6d4,stroke:#0891b2,stroke-width:2px,color:#ffffff
    classDef fusion fill:#d946ef,stroke:#c026d3,stroke-width:3px,color:#ffffff,font-weight:bold
    classDef discovery fill:#059669,stroke:#047857,stroke-width:3px,color:#ffffff,font-weight:bold
    classDef baselayer fill:#1e40af,stroke:#1e3a8a,stroke-width:3px,color:#ffffff,font-weight:bold
    classDef foundation fill:#3b82f6,stroke:#2563eb,stroke-width:2px,color:#ffffff
    classDef resources fill:#475569,stroke:#334155,stroke-width:2px,color:#ffffff
    classDef output fill:#ea580c,stroke:#c2410c,stroke-width:3px,color:#ffffff,font-weight:bold
    classDef final fill:#d97706,stroke:#b45309,stroke-width:4px,color:#ffffff,font-weight:bold,font-size:16px

Layer Responsibilities

Layer Purpose Output Status
Base Layer Canonical splice prediction (MANE) Baseline scores for ~10% of sites Done
Feature Engineering Multimodal evidence fusion 9-modality, 100-column enriched features Done
Foundation Models Evo2/SpliceBERT splice classification Per-nucleotide embeddings + classifiers Experimental
Meta Layer Context-aware adaptive prediction (M1-M4) Novel sites (90% beyond MANE) Active
Agentic Layer Multi-source validation + reports Validated isoforms + drug targets Planned

Feature Engineering

The multimodal pipeline fuses 9 data modalities into 100 feature columns per genomic position via a YAML-driven workflow:

Modality Columns Source
base_scores 43 Foundation model predictions (SpliceAI/OpenSpliceAI)
annotation 3 Ground truth splice labels
sequence 3 DNA context via pyfaidx
genomic 4 GC content, CpG density, dinucleotides
conservation 9 PhyloP/PhastCons (UCSC bigWig)
epigenetic 12 H3K36me3/H3K4me3 ChIP-seq (ENCODE)
junction 12 GTEx RNA-seq junction evidence
rbp_eclip 8 ENCODE RBP eCLIP binding peaks
chrom_access 6 ENCODE ATAC-seq chromatin accessibility

See docs/multimodal_feature_engineering/feature_catalog.md for the complete feature reference and examples/features/docs/ for per-modality tutorials.


Delta Score Analysis

The key innovation for novel isoform discovery is the delta score -- the difference between meta layer and base layer predictions:

delta_score = meta_prediction - base_prediction

if delta_score > 0.3:  # High confidence
    # This splice site is context-dependent!
    # -> Novel isoform candidate
    # -> Not in MANE canonical set
    # -> Validate with RNA-seq, literature, conservation
  • Base layer (SpliceAI/OpenSpliceAI): Trained on canonical annotations, detects ~10% of sites
  • Meta layer (Context-aware): Learns from variants, disease, tissue context, detects the other 90%
  • Delta score = Confidence that this is a real novel isoform, not noise

Project Structure

agentic-spliceai/
โ”œโ”€โ”€ src/
โ”‚   โ”œโ”€โ”€ agentic_spliceai/
โ”‚   โ”‚   โ”‚
โ”‚   โ”‚   โ”œโ”€โ”€ splice_engine/           # Core splice prediction engine
โ”‚   โ”‚   โ”‚   โ”‚
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ config/              # Configuration management
โ”‚   โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ genomic_config.py    # Config dataclass & loader
โ”‚   โ”‚   โ”‚   โ”‚   โ””โ”€โ”€ settings.yaml        # Default settings
โ”‚   โ”‚   โ”‚   โ”‚
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ resources/           # Genomic resource management
โ”‚   โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ registry.py          # Path resolution for GTF/FASTA/models
โ”‚   โ”‚   โ”‚   โ”‚   โ””โ”€โ”€ schema.py            # Column standardization (splice_type, chrom)
โ”‚   โ”‚   โ”‚   โ”‚
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ utils/               # Shared utilities
โ”‚   โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ dataframe.py         # DataFrame operations
โ”‚   โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ display.py           # Printing & formatting
โ”‚   โ”‚   โ”‚   โ”‚   โ””โ”€โ”€ filesystem.py        # File I/O helpers
โ”‚   โ”‚   โ”‚   โ”‚
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ base_layer/          # Base model predictions
โ”‚   โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ models/              # Model configs + runner
โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ config.py            # BaseModelConfig, WorkflowConfig
โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   โ””โ”€โ”€ runner.py            # BaseModelRunner
โ”‚   โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ prediction/          # Core prediction logic
โ”‚   โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ workflows/           # Chunked prediction pipeline
โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   โ””โ”€โ”€ prediction.py        # PredictionWorkflow (checkpointing, resume)
โ”‚   โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ io/                  # Artifact management
โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   โ””โ”€โ”€ artifacts.py         # ArtifactManager (atomic writes, mode-aware)
โ”‚   โ”‚   โ”‚   โ”‚   โ””โ”€โ”€ data/                # Data types & preparation
โ”‚   โ”‚   โ”‚   โ”‚
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ features/            # Multimodal feature engineering
โ”‚   โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ pipeline.py          # FeaturePipeline (dependency resolution)
โ”‚   โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ workflow.py          # FeatureWorkflow (genome-scale)
โ”‚   โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ modality.py          # Modality protocol (ABC)
โ”‚   โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ verification.py      # Position alignment verification
โ”‚   โ”‚   โ”‚   โ”‚   โ””โ”€โ”€ modalities/          # 9 modalities:
โ”‚   โ”‚   โ”‚   โ”‚       โ”œโ”€โ”€ base_scores.py       # 43 engineered features
โ”‚   โ”‚   โ”‚   โ”‚       โ”œโ”€โ”€ annotation.py        # Ground truth labels (3)
โ”‚   โ”‚   โ”‚   โ”‚       โ”œโ”€โ”€ sequence.py          # DNA context via pyfaidx (3)
โ”‚   โ”‚   โ”‚   โ”‚       โ”œโ”€โ”€ genomic.py           # GC content, CpG, dinucs (4)
โ”‚   โ”‚   โ”‚   โ”‚       โ”œโ”€โ”€ conservation.py      # PhyloP/PhastCons bigWig (9)
โ”‚   โ”‚   โ”‚   โ”‚       โ”œโ”€โ”€ epigenetic.py        # H3K36me3/H3K4me3 ChIP-seq (12)
โ”‚   โ”‚   โ”‚   โ”‚       โ”œโ”€โ”€ junction.py          # GTEx RNA-seq junctions (12)
โ”‚   โ”‚   โ”‚   โ”‚       โ”œโ”€โ”€ rbp_eclip.py         # ENCODE RBP eCLIP binding (8)
โ”‚   โ”‚   โ”‚   โ”‚       โ””โ”€โ”€ chrom_access.py      # ENCODE ATAC-seq accessibility (6)
โ”‚   โ”‚   โ”‚   โ”‚
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ eval/                # Cross-layer evaluation
โ”‚   โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ metrics.py           # TP/FP/FN, sensitivity, specificity
โ”‚   โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ output.py            # EvaluationOutputWriter
โ”‚   โ”‚   โ”‚   โ”‚   โ””โ”€โ”€ display.py           # Result visualization
โ”‚   โ”‚   โ”‚   โ”‚
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ data/                # Cross-layer data utilities
โ”‚   โ”‚   โ”‚   โ”‚   โ””โ”€โ”€ sampling.py          # Balanced train/test sampling
โ”‚   โ”‚   โ”‚   โ”‚
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ meta_layer/          # Meta-learning layer
โ”‚   โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ core/                # Configuration & schema
โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ config.py            # MetaLayerConfig
โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   โ””โ”€โ”€ feature_schema.py    # Feature definitions (8 column groups)
โ”‚   โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ models/              # Neural network models
โ”‚   โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ training/            # Training pipeline
โ”‚   โ”‚   โ”‚   โ”‚   โ””โ”€โ”€ workflows/           # Meta-layer workflows
โ”‚   โ”‚   โ”‚   โ”‚
โ”‚   โ”‚   โ”‚   โ””โ”€โ”€ cli/                 # CLI entry points
โ”‚   โ”‚   โ”‚       โ”œโ”€โ”€ predict.py           # agentic-spliceai-predict
โ”‚   โ”‚   โ”‚       โ””โ”€โ”€ prepare.py           # agentic-spliceai-prepare
โ”‚   โ”‚   โ”‚
โ”‚   โ”‚   โ”œโ”€โ”€ agents/                  # Agentic workflows (WIP)
โ”‚   โ”‚   โ”œโ”€โ”€ server/                  # FastAPI splice service
โ”‚   โ”‚   โ””โ”€โ”€ analysis/                # Analysis tools & templates
โ”‚   โ”‚
โ”‚   โ””โ”€โ”€ nexus/                       # Research agent package
โ”‚       โ”œโ”€โ”€ agents/                      # Multi-agent pipeline
โ”‚       โ”‚   โ”œโ”€โ”€ research/                    # Research orchestrator
โ”‚       โ”‚   โ”œโ”€โ”€ planner/                     # Research planning
โ”‚       โ”‚   โ”œโ”€โ”€ researcher/                  # Information gathering
โ”‚       โ”‚   โ”œโ”€โ”€ writer/                      # Report writing
โ”‚       โ”‚   โ””โ”€โ”€ editor/                      # Report refinement
โ”‚       โ”œโ”€โ”€ core/                        # Core utilities
โ”‚       โ”œโ”€โ”€ cli/                         # CLI interface
โ”‚       โ””โ”€โ”€ templates/                   # Report templates
โ”‚
โ”œโ”€โ”€ foundation_models/               # Experimental sub-project (own pyproject.toml)
โ”‚   โ”œโ”€โ”€ foundation_models/
โ”‚   โ”‚   โ”œโ”€โ”€ evo2/                        # Evo2-based exon classifier
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ config.py                    # Evo2Config (device auto-detect)
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ model.py                     # HuggingFace wrapper
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ embedder.py                  # Chunked extraction + HDF5 cache
โ”‚   โ”‚   โ”‚   โ””โ”€โ”€ classifier.py               # ExonClassifier (linear/MLP/CNN/LSTM)
โ”‚   โ”‚   โ””โ”€โ”€ utils/                       # Quantization, chunking
โ”‚   โ”œโ”€โ”€ configs/skypilot/               # SkyPilot cloud deployment (RunPod)
โ”‚   โ”œโ”€โ”€ examples/                        # Learning path (01-05)
โ”‚   โ””โ”€โ”€ docs/                            # Sub-project documentation
โ”‚
โ”œโ”€โ”€ server/                          # Standalone FastAPI services
โ”‚   โ”œโ”€โ”€ bio/                             # Bioinformatics Lab UI (port 8005)
โ”‚   โ”‚   โ”œโ”€โ”€ app.py                           # FastAPI + Jinja2 entry point
โ”‚   โ”‚   โ”œโ”€โ”€ bio_service.py                   # Core service (LRU cache, predictions)
โ”‚   โ”‚   โ””โ”€โ”€ templates/                       # HTML templates (Gene Browser, etc.)
โ”‚   โ”œโ”€โ”€ splice_service/                  # Splice prediction API (port 8004)
โ”‚   โ””โ”€โ”€ chart_service/                   # Chart/viz API (port 8003)
โ”‚
โ”œโ”€โ”€ examples/                        # Learning path examples
โ”‚   โ”œโ”€โ”€ base_layer/                      # 5 scripts: prediction -> precomputation
โ”‚   โ”œโ”€โ”€ features/                        # 4 scripts: base scores -> genome-scale
โ”‚   โ”œโ”€โ”€ foundation_models/               # 5 scripts: resource check -> orchestrate
โ”‚   โ””โ”€โ”€ data_preparation/               # Data prep & ground truth generation
โ”‚
โ”œโ”€โ”€ data/                            # Data directory (symlinked)
โ”‚   โ”œโ”€โ”€ ensembl/GRCh37/                  # Ensembl annotations
โ”‚   โ”œโ”€โ”€ mane/GRCh38/                     # MANE annotations
โ”‚   โ””โ”€โ”€ models/                          # Pre-trained model weights
โ”‚
โ”œโ”€โ”€ notebooks/                       # Jupyter analysis & demos
โ”œโ”€โ”€ docs/                            # Public documentation (MkDocs)
โ”œโ”€โ”€ scripts/                         # Utility scripts
โ”œโ”€โ”€ tests/                           # Unit tests
โ””โ”€โ”€ pyproject.toml                   # Package configuration


Last Updated: March 2026