Base Layer: Processing Architecture¶
This document explains the core processing architecture of the base layer for splice site prediction. The base layer supports multiple foundational models (SpliceAI, OpenSpliceAI, and custom models) that predict splice sites from genomic sequences. Understanding this architecture is essential for:
- Configuring prediction workflows
- Optimizing memory usage for large-scale analyses
- Troubleshooting and debugging
- Extending the system with custom features
Overview¶
The base layer processes genomic data through a carefully designed pipeline that balances accuracy, memory efficiency, and fault tolerance. The architecture handles the challenge of processing ~20,000 genes across 24 chromosomes while keeping memory usage manageable.
Key Design Principles¶
| Principle | Implementation |
|---|---|
| Data Locality | Process one chromosome at a time |
| Fault Tolerance | Checkpoint after every 500 genes |
| Memory Efficiency | Process in mini-batches of 50 genes |
| Streaming | Use lazy loading to avoid loading entire genome |
High-Level Flow¶
Input Files Processing Output Files
───────────── ────────── ────────────
GTF annotations ──┐
├──► Data Preparation ──► Nested Loop ──► Splice predictions
FASTA genome ─────┤ (3 levels) Error analysis
│ Position features
Base models* ─────┘ Sequence contexts
* Base models: SpliceAI, OpenSpliceAI, or any model with standard I/O protocol
Base Model Protocol¶
The base layer is designed to support any splice prediction model that follows a standard input/output protocol. This enables extensibility and integration of new models as they become available.
Supported Base Models¶
Currently supported models:
- SpliceAI (Keras/TensorFlow)
- Original splice prediction model
- Ensemble of 5 models
-
Context windows: 80, 400, 2000, 10000 bp
-
OpenSpliceAI (PyTorch)
- Reimplementation of SpliceAI in PyTorch
- Improved training and inference
- Compatible with SpliceAI architecture
Standard I/O Protocol¶
Any base model can be integrated if it follows this protocol:
Input Requirements¶
Format: One-hot encoded DNA sequence
- Shape: (batch_size, sequence_length, 4)
- Channels: [A, C, G, T] in order
- Data type: float32
- Sequence structure: [flanking_context] + [5000bp core] + [flanking_context]
Example:
# Input: Gene sequence
sequence = "ATCGATCG..." # DNA string
# Convert to one-hot encoding
encoded = one_hot_encode(sequence) # Shape: (length, 4)
# Add context and create blocks
blocks = prepare_input_sequence(sequence, context=10000)
# Output shape: (num_blocks, 20000, 4)
Output Requirements¶
Format: Splice site probabilities
- Shape: (batch_size, sequence_length, 3)
- Channels: [donor_prob, acceptor_prob, neither_prob]
- Data type: float32
- Range: [0, 1] (probabilities)
- Constraint: donor_prob + acceptor_prob + neither_prob ≈ 1.0 (per position)
Example:
# Model prediction
predictions = model.predict(blocks)
# Shape: (num_blocks, 5000, 3)
# Extract probabilities
donor_scores = predictions[:, :, 0] # Donor splice site probability
acceptor_scores = predictions[:, :, 1] # Acceptor splice site probability
neither_scores = predictions[:, :, 2] # Neither (no splice site)
Integration Requirements¶
To add a new base model:
-
Model Loading: Implement a loader function
-
Prediction Interface: Model must work with
predict_with_model()def predict_with_model(model, x): """Universal prediction function.""" if hasattr(model, 'predict'): # Keras-style model return model.predict(x, verbose=0) elif hasattr(model, 'forward'): # PyTorch-style model import torch with torch.no_grad(): device = next(model.parameters()).device x_tensor = torch.FloatTensor(x).to(device) return model(x_tensor).cpu().numpy() else: # Add custom inference logic here raise NotImplementedError(f"Unsupported model type: {type(model)}") -
Update Model Registry: Add to
load_spliceai_models()def load_spliceai_models(model_dir=None, model_type='spliceai', verbosity=1): if model_type.lower() == 'openspliceai': return _load_openspliceai_models(model_dir, verbosity) elif model_type.lower() == 'custom_model': return _load_custom_model(model_dir, verbosity) else: return _load_spliceai_models(model_dir, verbosity)
Model Compatibility Matrix¶
| Model | Framework | Ensemble | Context | Status |
|---|---|---|---|---|
| SpliceAI | Keras | 5 models | 10k bp | ✅ Supported |
| OpenSpliceAI | PyTorch | 5 models | 10k bp | ✅ Supported |
| Your Model | Any | Optional | Flexible | 🔧 Easy to add |
Adding Your Own Model¶
Step 1: Ensure your model follows the I/O protocol
- Input: (batch, seq_len, 4) one-hot encoded DNA
- Output: (batch, seq_len, 3) splice probabilities
Step 2: Create a model loader
# In base_layer/prediction/core.py
def _load_my_model(model_dir, verbosity):
# Load your model
model = MyModelClass.from_pretrained(model_dir)
return [model] # Return as list for consistency
Step 3: Update the model type switch
# In load_spliceai_models()
if model_type.lower() == 'mymodel':
return _load_my_model(model_dir, verbosity)
Step 4: Test with a gene
from agentic_spliceai.splice_engine.base_layer import BaseModelRunner
runner = BaseModelRunner()
result = runner.run_single_model(
model_name='mymodel',
target_genes=['BRCA1']
)
Data Preparation Pipeline¶
Before prediction begins, six preparation functions load and validate all required genomic resources.
Preparation Steps¶
Step 1: prepare_gene_annotations()
└── Extract transcript annotations from GTF file
└── Output: annotations_all_transcripts.tsv
Step 2: prepare_splice_site_annotations()
└── Extract known splice sites (donors and acceptors)
└── Output: splice_sites_enhanced.tsv
Step 3: prepare_genomic_sequences()
└── Extract gene sequences from FASTA
└── Output: gene_sequences_chr{N}.parquet (per chromosome)
Step 4: handle_overlapping_genes()
└── Identify genes with overlapping regions
└── Output: overlapping_gene_counts.tsv
Step 5: determine_target_chromosomes()
└── Determine which chromosomes to process
└── Can infer from target genes if specified
Step 6: load_base_models()
└── Load pre-trained models (SpliceAI, OpenSpliceAI, or custom)
└── Models predict donor/acceptor/neither probabilities
└── Supports both Keras and PyTorch frameworks
Function Details¶
1. prepare_gene_annotations()¶
Extracts transcript-level annotations from GTF/GFF3 files.
Key Parameters:
- local_dir: Directory for output files (build-specific)
- gtf_file: Path to GTF annotation file
- do_extract: Whether to extract from GTF (vs. load existing)
- target_chromosomes: Optional filter for specific chromosomes
Output Schema: | Column | Type | Description | |--------|------|-------------| | chrom | str | Chromosome (1-22, X, Y) | | start | int | Feature start position | | end | int | Feature end position | | strand | str | Strand (+ or -) | | feature | str | Feature type (exon, CDS, UTR) | | gene_id | str | Ensembl gene ID | | transcript_id | str | Ensembl transcript ID |
2. prepare_splice_site_annotations()¶
Extracts known splice site positions from gene annotations.
Key Parameters:
- gene_annotations_df: Pre-loaded annotations for filtering
- consensus_window: Window size for consensus calling (default: 2)
- fasta_file: Required for OpenSpliceAI fallback mode
Output Schema: | Column | Type | Description | |--------|------|-------------| | chrom | str | Chromosome | | position | int | Splice site position | | strand | str | Strand | | splice_type | str | 'donor' or 'acceptor' | | gene_id | str | Gene identifier |
3. prepare_genomic_sequences()¶
Extracts nucleotide sequences for each gene from the reference genome.
Key Parameters:
- mode: 'gene' (full gene) or 'transcript' (per-transcript)
- seq_type: 'full' (start to end) or 'minmax' (exon boundaries)
- seq_format: Output format ('parquet' recommended)
- single_sequence_file: Combine all chromosomes into one file
Memory Consideration: Sequences are stored per-chromosome to enable streaming during prediction.
4. handle_overlapping_genes()¶
Identifies genes with overlapping genomic regions, which can cause ambiguous splice site assignments.
Output: DataFrame with overlap counts per gene, useful for filtering or flagging ambiguous predictions.
5. determine_target_chromosomes()¶
Intelligently determines which chromosomes to process based on: 1. Explicitly specified chromosomes (highest priority) 2. Chromosomes containing target genes (if specified) 3. All standard chromosomes (default fallback)
6. load_spliceai_models()¶
Loads the pre-trained SpliceAI ensemble (5 Keras models). Each model independently predicts splice site probabilities, and predictions are averaged for robustness.
Three-Level Processing Loop¶
The core prediction workflow uses a nested loop structure to efficiently process the genome while managing memory.
Loop Structure¶
┌─────────────────────────────────────────────────────────────────────┐
│ LEVEL 1: CHROMOSOME LOOP │
│ for chromosome in ['1', '2', ..., '22', 'X', 'Y']: │
│ • Load sequences for this chromosome only │
│ • Stream from disk (lazy loading) │
├─────────────────────────────────────────────────────────────────────┤
│ LEVEL 2: CHUNK LOOP (500 genes per chunk) │
│ for chunk in range(0, n_genes, 500): │
│ • Check if chunk already processed (checkpoint) │
│ • Materialize chunk from lazy frame │
│ • Save results to disk after processing │
├─────────────────────────────────────────────────────────────────────┤
│ LEVEL 3: MINI-BATCH LOOP (50 genes per batch) │
│ for mini_batch in range(0, chunk_size, 50): │
│ • Run SpliceAI prediction │
│ • Evaluate against known splice sites │
│ • Extract sequence contexts │
│ • Free memory immediately │
└─────────────────────────────────────────────────────────────────────┘
What Happens at Each Level¶
Level 1: Chromosome (Data Locality)¶
Purpose: Load only one chromosome's data at a time to avoid memory overflow.
for chr_ in chromosomes:
# Load sequences for this chromosome only
lazy_seq_df = scan_chromosome_sequence(chromosome=chr_)
# Count genes on this chromosome
n_genes = lazy_seq_df.select(pl.col("gene_id").n_unique()).collect().item()
Key Operations:
- Lazy loading via Polars LazyFrame
- Schema standardization
- Gene counting
Level 2: Chunk (Checkpointing)¶
Purpose: Enable resume capability and manage disk I/O.
chunk_size = 500 # genes per chunk
for chunk_start in range(0, n_genes, chunk_size):
# Check if already processed
if os.path.exists(chunk_artifact_file):
continue # Skip - already done
# Materialize this chunk
seq_chunk = lazy_seq_df.slice(chunk_start, chunk_size).collect()
# ... process mini-batches ...
# Save chunk results to disk
data_handler.save_analysis_sequences(df_seq, chunk_start, chunk_end)
data_handler.save_splice_positions(positions_df, chunk_start, chunk_end)
Key Operations: - Checkpoint checking (resume support) - Chunk materialization - Result consolidation - Disk I/O (save artifacts)
Level 3: Mini-Batch (Memory Optimization)¶
Purpose: Keep peak memory usage low by processing small batches.
MINI_BATCH_SIZE = 50 # genes per mini-batch
for mini_batch_idx in range(n_mini_batches):
# Extract mini-batch
seq_mini_batch = seq_chunk[start:end]
# 1. PREDICT: Run SpliceAI models
predictions = predict_splice_sites_for_genes(
seq_mini_batch,
models=models,
context=10_000 # ±10kb context window
)
# 2. EVALUATE: Compare to known splice sites
error_df, positions_df = enhanced_process_predictions_with_all_scores(
predictions=predictions,
ss_annotations_df=ss_annotations_df,
add_derived_features=True # Generate ~58 features
)
# 3. EXTRACT: Get sequence contexts
df_seq = extract_analysis_sequences(
seq_mini_batch,
positions_df,
window_size=250 # ±250bp around splice site
)
# 4. FREE MEMORY
del seq_mini_batch, predictions
gc.collect()
Key Operations: - SpliceAI prediction (GPU-intensive) - Evaluation against ground truth - Sequence extraction - Aggressive memory cleanup
Memory Management Strategy¶
The architecture employs several strategies to keep memory usage bounded:
Strategy 1: Streaming (Chromosome Level)¶
Strategy 2: Lazy Loading (Chunk Level)¶
Strategy 3: Immediate Cleanup (Mini-Batch Level)¶
# After each mini-batch:
del seq_mini_batch, predictions_mini
del error_df_mini, positions_df_mini, df_seq_mini
gc.collect() # Force garbage collection
Memory Usage Pattern¶
Memory
▲
│ ┌───┐ ┌───┐ ┌───┐
│ │MB1│ │MB2│ │MB3│ Mini-batches
│ │ │ │ │ │ │
│ ───┴───┴─────┴───┴─────┴───┴───
│ ↑ ↓ ↑ ↓ ↑ ↓
│ load free load free load free
│
│ ══════════════════════════════
│ │ Chunk accumulation │
│ ══════════════════════════════
│ ↓
│ Save & Free
└──────────────────────────────────► Time
Configuring Memory Usage¶
| Parameter | Default | Effect |
|---|---|---|
chunk_size |
500 | Larger = fewer disk writes, more memory |
mini_batch_size |
50 | Larger = faster GPU utilization, more memory |
MAX_GENES_FOR_PRELOAD |
1000 | Threshold for pre-loading vs. streaming |
Checkpoint and Resume¶
The workflow automatically saves progress and can resume from interruptions.
How Checkpointing Works¶
-
Before processing each chunk, check if output file exists:
-
After processing each chunk, save results:
Resume Behavior¶
If the workflow is interrupted (crash, timeout, manual stop):
- Restart the workflow with the same parameters
- Completed chunks are automatically skipped
- Processing resumes from the first incomplete chunk
Example Resume Scenario¶
First run (interrupted at chr3):
chr1: ✓ Complete (all chunks saved)
chr2: ✓ Complete (all chunks saved)
chr3: ✗ Interrupted at chunk 501-1000
Second run (automatic resume):
chr1: SKIP (all chunks exist)
chr2: SKIP (all chunks exist)
chr3: SKIP chunks 1-500 (exists)
PROCESS chunks 501+ (resume here)
Output Files¶
The workflow generates several types of output files:
Per-Chunk Files (Intermediate)¶
output_dir/
├── meta/
│ ├── analysis_sequences_1_chunk_1_500.tsv
│ ├── analysis_sequences_1_chunk_501_1000.tsv
│ ├── error_analysis_1_chunk_1_500.tsv
│ ├── splice_positions_1_chunk_1_500.tsv
│ └── ...
Aggregated Files (Final)¶
output_dir/
├── splice_positions_enhanced_aggregated.tsv # All positions
├── error_analysis_aggregated.tsv # All error metrics
└── analysis_sequences_aggregated.tsv # All sequences
File Contents¶
| File | Contents | Use Case |
|---|---|---|
splice_positions_*.tsv |
Position-level features (~58 columns) | Meta-model training |
error_analysis_*.tsv |
Per-gene error metrics (TP, FP, FN) | Quality assessment |
analysis_sequences_*.tsv |
Nucleotide sequences around splice sites | Sequence analysis |
Configuration Options¶
SpliceAIConfig Parameters¶
from agentic_spliceai import SpliceAIConfig
config = SpliceAIConfig(
# Core parameters
threshold=0.5, # Detection threshold
consensus_window=2, # Splice site consensus window
error_window=10, # Error analysis window
# Memory optimization
mini_batch_size=50, # Genes per mini-batch
chunk_size=500, # Genes per chunk
# Output control
save_nucleotide_scores=False, # Save per-nucleotide scores (large!)
# Paths
local_dir="./data/grch38", # Build-specific directory
gtf_file="./annotations.gtf",
fasta_file="./genome.fa"
)
Workflow Parameters¶
results = run_enhanced_splice_prediction_workflow(
config=config,
# Filtering
target_genes=['BRCA1', 'BRCA2'], # Specific genes (optional)
target_chromosomes=['17', '13'], # Specific chromosomes (optional)
# Behavior
verbosity=1, # 0=quiet, 1=normal, 2=debug
no_final_aggregate=False, # Skip final aggregation
no_tn_sampling=False, # Include all true negatives
# Position ID format
position_id_mode='genomic' # 'genomic' or 'relative'
)
Summary¶
The Meta-SpliceAI base layer processing architecture provides:
| Feature | Benefit |
|---|---|
| Three-level nesting | Balances memory, speed, and fault tolerance |
| Lazy loading | Handles genome-scale data without memory overflow |
| Checkpointing | Resume from interruptions without data loss |
| Mini-batching | Efficient GPU utilization with bounded memory |
| Streaming output | Results available as processing progresses |
This architecture enables processing of the entire human genome (~20,000 genes) on machines with as little as 16GB RAM, while providing robust fault tolerance for long-running analyses.