Multimodal Feature Engineering¶
Goals served: cross-cutting (feeds all downstream prediction applications)
Tier: Active
Last updated: 2026-04
Problem¶
Splice prediction benefits from evidence beyond the DNA sequence: cross-species conservation, chromatin accessibility, histone marks, RNA-seq junction reads, RBP binding, foundation-model embeddings. Each modality contributes orthogonal signal, but combining them requires a consistent schema, shared coordinate alignment, and memory-bounded streaming for genome-scale operation.
This application provides a YAML-configurable, 10-modality feature pipeline that produces a single aligned parquet per chromosome — consumable by any downstream predictor.
User-facing functionality¶
- Generate per-position features from 10 modalities (116 columns total): base scores, annotation, sequence, genomic, conservation, epigenetic, junction, RBP eCLIP, chromatin accessibility, foundation-model embeddings (optional)
- Swap modality sets via YAML config profiles (default, full_stack, isoform_discovery, meta_m3_novel)
- Run at genome scale with memory monitoring, per-chromosome parquet
outputs, and
--augmentfor incremental modality addition - Verify feature alignment and modality completeness post hoc
Driving examples¶
examples/features/01_base_score_features.py— base scores (43 cols)examples/features/02_annotation_and_genomic.py— annotation + genomic featuresexamples/features/03_configurable_modalities.py— YAML-driven modality selectionexamples/features/04_genome_scale_workflow.py— genome-scale feature generationexamples/features/05_multimodal_exploration.py— exploratory multimodal analysisexamples/features/06_multimodal_genome_workflow.py— full-stack genome workflow (canonical driver)examples/features/06a_ephemeral_genome_workflow.py— ephemeral (predict → feature → delete) variant for bounded diskexamples/features/07_streaming_fm_scalars.py— foundation model scalar streamingexamples/features/config_loader.py— YAML config loaderexamples/features/verify_feature_alignment.py— alignment verificationexamples/features/check_modality_completeness.py— completeness audit
Configuration profiles:
examples/features/configs/— 4 YAML profiles
Per-modality tutorials:
examples/features/docs/— modality-specific guides (junction reads, RBP eCLIP, conservation, etc.)
src/ surface¶
Library (stable):
agentic_spliceai.splice_engine.features.FeaturePipeline— modality auto-registration, protocol-based dispatchagentic_spliceai.splice_engine.features.FeatureWorkflow— genome-scale orchestrationagentic_spliceai.splice_engine.features.modalities.*— 10 modality implementationsagentic_spliceai.splice_engine.features.verification— position alignment checksagentic_spliceai.splice_engine.features.dense_feature_extractor— memory-bounded extractionagentic_spliceai.splice_engine.utils.memory_monitor— RSS monitoring, graceful abort
Application package (src/agentic_spliceai/applications/multimodal_features/):
profiles.py— profile catalog (readsexamples/features/configs/*.yaml)tracks.py— external-track catalog (conservation, ENCODE epigenetic) +fetch_conservation_tracks()manifest.py— versionedFeatureManifest(profile + inputs + per-chrom artifacts + hashes + tracks used)pipeline.py—prepare_features()orchestrator +resolve_canonical_features_dir()helperstatus.py—FeaturePrepStatusreadiness query (per-chromosome)steps.py— thin wrappers overFeatureWorkflow+ conservation fetch + validatecli.py— unified subcommand CLI
CLI entry point: agentic-spliceai-features
agentic-spliceai-features list-profiles
agentic-spliceai-features list-tracks --build GRCh38
agentic-spliceai-features fetch-tracks --build GRCh38 # downloads PhyloP, PhastCons
agentic-spliceai-features status --canonical --build GRCh38 # read-only production check
agentic-spliceai-features status --output-dir output/features/my_run # throwaway dir check
agentic-spliceai-features prepare --profile full_stack --build GRCh38 \
--chromosomes 22 \
--input-dir data/mane/GRCh38/openspliceai_eval/precomputed \
--output-dir output/features/chr22_full_stack
agentic-spliceai-features validate --output-dir data/mane/GRCh38/openspliceai_eval/analysis_sequences
Production-safety guards (same pattern as data_preparation):
preparerequires either--output-dir(throwaway) or--inplace(canonical)- Library-level
resume=Trueis the default → per-chromosome parquets are preserved;--no-resumeto override status,list-profiles,list-tracks, andvalidateare read-only
External-track fetching:
fetch-tracks --modality conservation --build GRCh38— downloads PhyloP + PhastCons bigWigs from UCSC to the configured cachefetch-tracks --modality epigenetic --build GRCh38 [--cell-lines K562 HepG2] [--marks h3k4me3]— downloads ENCODE ChIP-seq fold-change bigWigs by accession. URL pattern:https://www.encodeproject.org/files/<ENCFF...>/@@download/<ENCFF...>.bigWig. Filters by cell line / mark when supplied; otherwise fetches the full configured panel.
Optional experiment tracking (silent fallback when wandb is not installed or no API key is set):
agentic-spliceai-features prepare --profile full_stack --build GRCh38 \
--input-dir data/mane/GRCh38/openspliceai_eval/precomputed \
--output-dir output/features/chr22_full_stack \
--chromosomes 22 \
--track --tracking-project agentic-spliceai-multimodal-features
Logs per-step durations, rows, success flags, and the resulting
feature_manifest.json as a W&B artifact. Shared with base_layer and
data_preparation via
applications._common.tracking.
Evaluation¶
- Scale verified: 24/24 chromosomes complete (2.88 GB total, 116 cols each)
- Location:
data/mane/GRCh38/openspliceai_eval/analysis_sequences/ - Alignment: position-level verification via
verification.py - Downstream impact: junction_has_support = #2 feature by SHAP (31.3%), FN reduction -60/-70% (donor/acceptor) on M1-P XGBoost
- Feature catalog: docs/multimodal_feature_engineering/feature_catalog.md
Maturity tier and signals¶
Current tier: Active (moving toward Mature)
Signals supporting the tier:
- 10 modalities (9 active + 1 fm_embeddings commented out) with 116 columns
- Full genome completed (24/24 chromosomes)
- 4 YAML profiles covering distinct modeling objectives
- Depended on by Adaptive Splice Prediction and Variant Effect Analysis
- Memory-bounded streaming verified on 16GB MacBook (peak 2.51 GB chr1 full-stack)
- Per-modality tutorials in
examples/features/docs/ - Packaged application at
src/agentic_spliceai/applications/multimodal_features/with versioned manifest, readiness API (per-chromosome), external-track catalog, and dedicated CLI (agentic-spliceai-features) - Production-path completeness check verified:
status --canonical --build GRCh38correctly enumerates all 24 existing per-chromosome parquets read-only (chr22 validated: 96,467 rows, alignment columns present)
Graduation signals¶
To advance to Mature, the application needs:
- FM embeddings modality wired back in with Evo2 full-genome extraction
- Versioned feature schema (116 columns is a frozen surface)
- Test coverage for modality registration and alignment invariants
- ENCODE-accession-driven downloads for epigenetic tracks (currently
remote_fallbackat run-time only; thefetch-tracksCLI currently covers conservation only) - Full meta-layer pre-flight integration (parallel to the base-layer pre-flight)
Known limitations¶
- Conservation bigWig streaming dominates runtime (~2 hrs for 562K positions)
- pyBigWig connection timeouts on laptop standby (kill and
--resume) - Foundation-model embedding modality disabled pending Evo2 extraction completion
- Junction modality coverage uneven across tissues (see junction coverage audit)
- RBP eCLIP data limited to ENCODE set — new RBPs require peak re-aggregation
Related¶
- Adaptive Splice Prediction — primary consumer
- Variant Effect Analysis — locus-level features used at ref and alt
- Canonical Splice Prediction — base-score modality source
- Foundation Model Predictors — FM embedding modality source (experimental)
- Feature Catalog
- Roadmap: Phase 4