Output & Artifact Management System Design¶

Purpose: Systematic organization of experimental outputs, model artifacts, and predictions
Component: splice_engine/resources/artifacts/
Status: 📝 Design complete, implementation in progress

Problem Statement¶

What We're Solving¶

Research and ML projects generate many artifacts: - Base layer outputs: Splice site predictions, analysis sequences, error metrics - Meta layer outputs: Meta-model predictions, training checkpoints - Model artifacts: Trained weights, architectures, hyperparameters - Experimental results: Metrics, plots, logs, evaluation results - Training artifacts: Checkpoints, validation curves, intermediate states

Challenges Without Output Management¶

❌ Output Chaos:

# Scattered outputs
model.save("final_model.pkl")  # Where did this go?
results.to_csv("results.csv")  # Which results?
plt.savefig("plot.png")        # Which experiment?

❌ Non-Reproducible Experiments: - "Where are the results from experiment X?" - "Which model weights correspond to these metrics?" - "What hyperparameters produced this plot?"

❌ Overwrite Disasters:

# Production artifacts overwritten by mistake
predictions.save("predictions.tsv")  # Oops, overwrote last week's run!

❌ No Experiment Tracking: - Can't compare experiments - Can't reproduce results - Can't find "that good model from last month"

Design Goals¶

1. Systematic Organization¶

Clear, predictable structure for all outputs:

data/ensembl/GRCh38/spliceai_eval/
├── base_layer/                    # Base model predictions
│   ├── analysis_sequences_chr*.tsv
│   ├── splice_errors_chr*.tsv
│   ├── nucleotide_scores_chr*.parquet
│   └── gene_manifest.tsv
└── meta_layer/                    # Meta model outputs
    ├── checkpoints/
    │   ├── epoch_001.pkl
    │   ├── epoch_002.pkl
    │   └── latest.pkl
    ├── predictions/
    │   └── meta_predictions.tsv
    └── metrics/
        ├── training_metrics.json
        └── evaluation_metrics.json

2. Reproducibility¶

Track what, when, how: - What: Artifact type, content, format - When: Timestamp, experiment ID - How: Configuration, code version, parameters

3. Mode-Based Isolation¶

Different modes for different use cases:

Mode	Purpose	Overwrite?	Location
Production	Immutable results	❌ No	`base_layer/`
Development	Iterative work	✅ Yes	`base_layer/dev/timestamp/`
Test	Isolated testing	✅ Yes	`base_layer/tests/test_name/`

4. Type-Safe Artifacts¶

Enum-based artifact types prevent typos:

class ArtifactType(Enum):
    ANALYSIS_SEQUENCES = "analysis_sequences"
    SPLICE_ERRORS = "splice_errors"
    NUCLEOTIDE_SCORES = "nucleotide_scores"

5. Format Consistency¶

Standardized formats: - Tabular data: TSV for readability, Parquet for performance - Models: Pickle (Python) or framework-specific (PyTorch, TF) - Metrics: JSON for structure, human-readable - Plots: PNG (high DPI) or PDF (vector)

Architecture¶

Component Structure¶

splice_engine/resources/
├── artifacts/
│   ├── __init__.py          # Public API
│   ├── manager.py           # ArtifactManager class
│   ├── types.py             # ArtifactType enum
│   └── paths.py             # Path utilities
└── genomic/
    └── (resource management)

Key Components¶

1. ArtifactType Enum¶

class ArtifactType(Enum):
    """Types of artifacts with standard names."""

    # Base layer outputs
    ANALYSIS_SEQUENCES = "analysis_sequences"
    SPLICE_ERRORS = "splice_errors"
    NUCLEOTIDE_SCORES = "nucleotide_scores"
    GENE_MANIFEST = "gene_manifest"

    # Meta layer outputs
    META_PREDICTIONS = "meta_predictions"
    META_CHECKPOINTS = "meta_checkpoints"
    META_METRICS = "meta_metrics"

    # Training artifacts
    TRAINING_DATA = "training_data"
    VALIDATION_DATA = "validation_data"

Why enum: Prevents typos, IDE autocomplete, explicit inventory

2. ArtifactManager Class¶

class ArtifactManager:
    """
    Unified artifact management for all pipeline stages.

    Responsibilities:
    - Path resolution for artifacts
    - Save/load with format handling
    - Overwrite policy enforcement
    - Mode-based directory isolation

    Supports:
    - Production mode (immutable)
    - Development mode (timestamped)
    - Test mode (isolated)
    """

    def __init__(
        self,
        base_model: str,                    # e.g., "spliceai"
        mode: str = 'production',           # 'production', 'development', 'test'
        test_name: Optional[str] = None     # Required if mode='test'
    ):
        self.base_model = base_model
        self.mode = mode
        self.test_name = test_name
        self._setup_paths()

3. Path Resolution¶

def get_artifact_path(
    self,
    artifact_type: ArtifactType,
    chromosome: Optional[str] = None,
    layer: str = 'base'
) -> Path:
    """
    Get path for specific artifact.

    Examples:
        >>> am.get_artifact_path(ArtifactType.ANALYSIS_SEQUENCES, "chr1")
        Path("data/ensembl/GRCh38/spliceai_eval/base_layer/analysis_sequences_chr1.tsv")

        >>> am.get_artifact_path(ArtifactType.META_CHECKPOINTS, layer="meta")
        Path("data/ensembl/GRCh38/spliceai_eval/meta_layer/checkpoints/meta_checkpoints.pkl")
    """
    base_dir = self.base_layer_dir if layer == 'base' else self.meta_layer_dir

    # Build filename
    filename = artifact_type.value
    if chromosome:
        filename += f"_chr{chromosome}"
    filename += self._get_extension(artifact_type)

    return base_dir / filename

4. Save with Overwrite Policy¶

def save_artifact(
    self,
    data: pl.DataFrame,
    artifact_type: ArtifactType,
    chromosome: Optional[str] = None,
    layer: str = 'base'
) -> Path:
    """
    Save artifact with overwrite policy enforcement.

    Raises:
        FileExistsError: If production artifact exists
    """
    path = self.get_artifact_path(artifact_type, chromosome, layer)

    # Check overwrite policy
    if path.exists() and not self.should_overwrite(artifact_type):
        raise FileExistsError(
            f"Artifact exists and overwrite not allowed: {path}\n"
            f"Mode: {self.mode} (production artifacts are immutable)"
        )

    # Save based on format
    path.parent.mkdir(parents=True, exist_ok=True)
    if path.suffix == '.tsv':
        data.write_csv(path, separator='\t')
    else:
        data.write_parquet(path)

    return path

5. Load with Validation¶

def load_artifact(
    self,
    artifact_type: ArtifactType,
    chromosome: Optional[str] = None,
    layer: str = 'base'
) -> pl.DataFrame:
    """
    Load artifact with existence checking.

    Raises:
        FileNotFoundError: If artifact doesn't exist
    """
    path = self.get_artifact_path(artifact_type, chromosome, layer)

    if not path.exists():
        raise FileNotFoundError(
            f"Artifact not found: {path}\n"
            f"Expected in {layer} layer for {self.base_model}"
        )

    if path.suffix == '.tsv':
        return pl.read_csv(path, separator='\t')
    else:
        return pl.read_parquet(path)

Mode-Based Isolation¶

Production Mode¶

Purpose: Immutable, authoritative results

Behavior: - ❌ Cannot overwrite existing artifacts - ✅ Standard directory structure - ✅ Used for published results, deployed models

Location:

data/ensembl/GRCh38/spliceai_eval/base_layer/

Example:

am = ArtifactManager("spliceai", mode="production")
am.save_artifact(predictions, ArtifactType.ANALYSIS_SEQUENCES, "chr1")
# Saves to: data/.../base_layer/analysis_sequences_chr1.tsv

# Second save fails
am.save_artifact(predictions2, ArtifactType.ANALYSIS_SEQUENCES, "chr1")
# Raises: FileExistsError (production artifacts immutable)

Development Mode¶

Purpose: Iterative experimentation

Behavior: - ✅ Can overwrite within same session - ✅ Timestamped directories prevent cross-session conflicts - ✅ Easy to compare different runs

Location:

data/ensembl/GRCh38/spliceai_eval/base_layer/dev/20260215_143022/
data/ensembl/GRCh38/spliceai_eval/base_layer/dev/20260215_151035/

Example:

am = ArtifactManager("spliceai", mode="development")
am.save_artifact(predictions, ArtifactType.ANALYSIS_SEQUENCES, "chr1")
# Saves to: data/.../dev/20260215_143022/analysis_sequences_chr1.tsv

# Can overwrite within this session
am.save_artifact(predictions2, ArtifactType.ANALYSIS_SEQUENCES, "chr1")
# Overwrites same file (same timestamp directory)

Test Mode¶

Purpose: Isolated unit/integration testing

Behavior: - ✅ Isolated from production/dev - ✅ Named test directories - ✅ Easy to clean up

Location:

data/ensembl/GRCh38/spliceai_eval/base_layer/tests/test_prediction/
data/ensembl/GRCh38/spliceai_eval/base_layer/tests/test_integration/

Example:

am = ArtifactManager("spliceai", mode="test", test_name="test_prediction")
am.save_artifact(predictions, ArtifactType.ANALYSIS_SEQUENCES, "chr1")
# Saves to: data/.../tests/test_prediction/analysis_sequences_chr1.tsv

# Tests can clean up easily
shutil.rmtree(am.base_layer_dir)

Format Standards¶

TSV (Tab-Separated Values)¶

Use for: Human-readable, line-oriented data

Artifacts: - Analysis sequences - Splice errors - Gene manifests

Advantages: - Human-readable - Line-oriented (easy diffs) - Standard tool support (grep, awk, etc.)

Example:

gene_id chromosome  strand  start   end transcript_count
ENSG00000000003 chrX    -   100627108   100639991   3
ENSG00000000005 chrX    +   100584936   100599885   2

Parquet¶

Use for: Large, high-performance data

Artifacts: - Nucleotide scores (millions of rows) - Training data (large features)

Advantages: - Fast read/write - Columnar compression - Schema preservation - Efficient filtering

Example:

# Efficient reading
scores = pl.scan_parquet("nucleotide_scores_chr1.parquet") \
    .filter(pl.col("position") > 1000000) \
    .collect()

JSON¶

Use for: Structured metadata, metrics

Artifacts: - Training metrics - Evaluation results - Experiment configuration

Advantages: - Human-readable - Nested structure - Language-agnostic - Schema-less flexibility

Example:

{
  "experiment_id": "exp_20260215_143022",
  "model": "meta_spliceai_v1",
  "metrics": {
    "train_loss": 0.023,
    "val_loss": 0.031,
    "val_accuracy": 0.947
  },
  "config": {
    "learning_rate": 0.001,
    "batch_size": 32
  }
}

Pickle¶

Use for: Python objects, model weights

Artifacts: - Model checkpoints - Complex Python objects - Trained models

Advantages: - Preserves Python objects - Fast serialize/deserialize - Handles complex types

⚠️ Warning: Not portable across Python versions, security risk for untrusted data

Usage Patterns¶

Basic Save/Load¶

from agentic_spliceai.splice_engine.resources.artifacts import (
    get_artifact_manager,
    ArtifactType
)

# Get manager
am = get_artifact_manager("spliceai", mode="production")

# Save artifact
predictions = pl.DataFrame(...)
path = am.save_artifact(
    predictions,
    ArtifactType.ANALYSIS_SEQUENCES,
    chromosome="chr1"
)
print(f"Saved to: {path}")

# Load artifact
loaded = am.load_artifact(
    ArtifactType.ANALYSIS_SEQUENCES,
    chromosome="chr1"
)

Per-Chromosome Processing¶

# Save per-chromosome artifacts
for chrom in ["chr1", "chr2", ..., "chrX"]:
    predictions = process_chromosome(chrom)
    am.save_artifact(
        predictions,
        ArtifactType.ANALYSIS_SEQUENCES,
        chromosome=chrom
    )

# Load specific chromosome
chr1_data = am.load_artifact(
    ArtifactType.ANALYSIS_SEQUENCES,
    chromosome="chr1"
)

Meta Layer Artifacts¶

# Save training checkpoint
checkpoint = {
    "epoch": 42,
    "model_state": model.state_dict(),
    "optimizer_state": optimizer.state_dict(),
    "metrics": {"val_loss": 0.031}
}
am.save_artifact(
    checkpoint,
    ArtifactType.META_CHECKPOINTS,
    layer="meta"
)

# Save metrics
metrics = {"train_loss": 0.023, "val_accuracy": 0.947}
am.save_artifact(
    metrics,
    ArtifactType.META_METRICS,
    layer="meta"
)

Development Workflow¶

# Development: iterate freely
am_dev = get_artifact_manager("spliceai", mode="development")

for version in range(10):
    predictions = experiment(version)
    am_dev.save_artifact(predictions, ArtifactType.ANALYSIS_SEQUENCES)
    # Each save overwrites in the timestamped directory

# Promote to production when satisfied
am_prod = get_artifact_manager("spliceai", mode="production")
final_predictions = best_experiment()
am_prod.save_artifact(final_predictions, ArtifactType.ANALYSIS_SEQUENCES)

Integration with Experiments¶

OutputManager Pattern¶

For complex experiments, extend ArtifactManager:

class OutputManager(ArtifactManager):
    """Enhanced artifact manager with experiment tracking."""

    def __init__(self, experiment_name: str, base_model: str):
        super().__init__(base_model, mode="development")
        self.experiment_name = experiment_name
        self.created_at = datetime.now()

    def save_experiment_metadata(self, config: dict, metrics: dict):
        """Save experiment metadata."""
        metadata = {
            "experiment_name": self.experiment_name,
            "created_at": self.created_at.isoformat(),
            "config": config,
            "metrics": metrics
        }
        metadata_path = self.base_layer_dir / "metadata.json"
        with open(metadata_path, 'w') as f:
            json.dump(metadata, f, indent=2)

    def save_plot(self, fig, name: str):
        """Save matplotlib figure."""
        plots_dir = self.base_layer_dir / "plots"
        plots_dir.mkdir(exist_ok=True)
        fig.savefig(plots_dir / f"{name}.png", dpi=300, bbox_inches='tight')

Usage:

om = OutputManager("splice_analysis_v3", "spliceai")

# Run experiment
results = run_experiment(config)

# Save everything
om.save_artifact(results, ArtifactType.ANALYSIS_SEQUENCES)
om.save_experiment_metadata(config, metrics)
om.save_plot(confusion_matrix_fig, "confusion_matrix")

Best Practices¶

DO ✅¶

Use artifact types:

am.save_artifact(data, ArtifactType.ANALYSIS_SEQUENCES)  # Type-safe

Check mode before production saves:

if am.mode == "production":
    # Extra validation before immutable save
    validate_predictions(data)
am.save_artifact(data, ArtifactType.ANALYSIS_SEQUENCES)

Add metadata:

metadata = {
    "created_at": datetime.now().isoformat(),
    "config": config_dict,
    "git_commit": subprocess.check_output(["git", "rev-parse", "HEAD"])
}
# Save alongside artifact

Use appropriate formats:

# Small, human-readable → TSV
gene_manifest.write_csv("gene_manifest.tsv", separator='\t')

# Large, performance-critical → Parquet
nucleotide_scores.write_parquet("scores.parquet")

DON'T ❌¶

Don't hardcode artifact paths:

# BAD
predictions.save("data/spliceai_eval/predictions.tsv")

# GOOD
am.save_artifact(predictions, ArtifactType.ANALYSIS_SEQUENCES)

Don't use arbitrary names:

# BAD
am.save_artifact(data, "my_predictions")  # Not a valid ArtifactType

# GOOD
am.save_artifact(data, ArtifactType.ANALYSIS_SEQUENCES)

Don't mix modes accidentally:

# BAD - Easy to overwrite production
am = ArtifactManager("spliceai", mode="production")  # Think you're in dev!
am.save_artifact(data, ...)  # Fails if exists (good!)

Testing Strategy¶

Unit Tests¶

def test_save_load_artifact():
    """Should save and load artifacts correctly."""
    am = ArtifactManager("spliceai", mode="test", test_name="test_save_load")

    data = pl.DataFrame({"col1": [1, 2, 3]})
    path = am.save_artifact(data, ArtifactType.ANALYSIS_SEQUENCES)

    loaded = am.load_artifact(ArtifactType.ANALYSIS_SEQUENCES)
    assert loaded.equals(data)

    # Cleanup
    shutil.rmtree(am.base_layer_dir)

def test_overwrite_policy():
    """Production should not overwrite, development should."""
    # Production: no overwrite
    am_prod = ArtifactManager("spliceai", mode="production")
    data = pl.DataFrame({"col1": [1]})
    am_prod.save_artifact(data, ArtifactType.ANALYSIS_SEQUENCES)

    with pytest.raises(FileExistsError):
        am_prod.save_artifact(data, ArtifactType.ANALYSIS_SEQUENCES)

    # Development: can overwrite
    am_dev = ArtifactManager("spliceai", mode="development")
    am_dev.save_artifact(data, ArtifactType.ANALYSIS_SEQUENCES)
    am_dev.save_artifact(data, ArtifactType.ANALYSIS_SEQUENCES)  # OK

Future Extensions¶

Version Control Integration¶

class VersionedArtifactManager(ArtifactManager):
    """Artifact manager with automatic versioning."""

    def save_artifact(self, data, artifact_type, ...):
        # Add git commit to metadata
        git_commit = get_git_commit()
        metadata = {"git_commit": git_commit, ...}
        # Save metadata alongside artifact

Cloud Storage Support¶

class CloudArtifactManager(ArtifactManager):
    """Artifact manager with cloud storage backend."""

    def save_artifact(self, data, artifact_type, ...):
        # Save locally
        local_path = super().save_artifact(...)
        # Upload to S3/GCS
        self.upload_to_cloud(local_path)

Artifact Catalog¶

class CatalogedArtifactManager(ArtifactManager):
    """Maintains searchable catalog of artifacts."""

    def save_artifact(self, data, artifact_type, ...):
        path = super().save_artifact(...)
        # Register in catalog
        self.catalog.register(artifact_type, path, metadata)

Summary¶

Key Benefits¶

✅ Organization: Predictable structure for all outputs
✅ Reproducibility: Track what, when, how
✅ Safety: Immutable production artifacts
✅ Flexibility: Development mode for iteration
✅ Isolation: Test mode for testing
✅ Type Safety: Enum-based artifact types
✅ Format Consistency: Standard formats for each type

Design Principles Applied¶

✅ Single Source of Truth: ArtifactManager for all artifacts
✅ Separation of Concerns: Artifact management isolated
✅ Fail Fast: Overwrite policy enforced
✅ Explicit Over Implicit: Typed artifact types
✅ Testability: Mode-based isolation for tests

Related: Resource Management, Configuration System
Implementation: src/agentic_spliceai/splice_engine/config/
Last Updated: February 15, 2026