Remote Training Options: Modal, SkyPilot, and RunPods¶

Status: Reference Audience: Developers needing GPU compute beyond your local computer Last Updated: March 2026

Problem¶

Training genomic foundation models (Evo2 7B/40B) and their downstream classifiers often requires dedicated GPU compute beyond what's available on a typical development machine. The manual RunPods workflow works but involves SSH configuration, pod environment setup, GitHub SSH keys, and rsync-based data/code transfer for every new pod instance.

Goal: Minimize the friction between "code works locally on small data" and "train on a real GPU" while keeping costs predictable.

Development Principle: Local-First¶

Always develop and validate the full training workflow locally before offloading to remote GPU:

Build locally (small/synthetic data, MPS/CPU) until the pipeline runs end-to-end
Test locally with a real data subset to confirm correctness
Only then dispatch to remote GPU for production-scale training

Remote compute should be used only when the training pipeline is proven to work. This minimizes wasted GPU hours spent debugging serialization issues, data loading bugs, or training loop errors over SSH.

Website: modal.com

How It Works¶

Modal lets you decorate a Python function with a GPU requirement and call it from your local machine. The function runs in a container on Modal's infrastructure with the specified GPU. No SSH, no pod setup, no environment scripts.

import modal

app = modal.App("evo2-training")
image = modal.Image.debian_slim(python_version="3.12").pip_install(
    "torch", "transformers", "einops", "h5py", "peft", "scikit-learn"
)

@app.function(gpu="A100", image=image, timeout=3600)
def train_classifier(config: dict, embeddings_path: str) -> dict:
    """Runs on A100 in Modal's cloud."""
    import torch
    # ... training code identical to local version ...
    return model.state_dict()

# Called locally — weights returned automatically
weights = train_classifier.remote(config, "/data/embeddings.h5")
torch.save(weights, "output/classifier.pt")

Setup (One-Time)¶

pip install modal
modal setup          # Authenticates via browser, stores token locally

Data Transfer¶

Small data (< 100 MB): Pass as function arguments (serialized automatically)
Large data (genomic sequences, HDF5 embeddings): Use Modal Volumes — persistent cloud storage mounted into containers

volume = modal.Volume.from_name("genomic-data", create_if_missing=True)

@app.function(gpu="A100", volumes={"/data": volume})
def train_with_volume(config):
    # /data/ contains your uploaded HDF5 files, parquet, etc.
    embeddings = load_embeddings("/data/embeddings.h5")
    ...

Upload data once:

modal volume put genomic-data data/mane/GRCh38/ /mane/GRCh38/

Pricing (as of March 2026)¶

Billed per-second of compute. No idle costs.

GPU	Approx. $/hr	VRAM	Use Case
T4	~$0.60	16 GB	Quick tests, small classifiers
A10G	~$1.10	24 GB	Medium models, hyperparameter search
A100 40GB	~$2.10	40 GB	Evo2 7B embeddings, classifier training
A100 80GB	~$2.50	80 GB	Evo2 40B (INT8), large-batch training
H100	~$3.95	80 GB	Fastest inference/training

Pricing changes frequently. See modal.com/pricing for current rates.

Strengths¶

Near-zero setup friction (pip install + browser auth)
Same Python code locally and remotely (only the decorator/dispatch differs)
Per-second billing, scales to zero (no idle pod costs)
Container images are cached — cold start ~30-60s first time, warm after
Volumes for persistent data (model weights, embeddings, training artifacts)
Native Python SDK (no YAML, no shell scripts)
Supports custom conda environments via Image.from_conda

Limitations¶

Cold-start latency (~30-60s if container hasn't been used recently)
Slightly higher $/GPU-hour vs raw RunPod on-demand pods
Data must be uploaded to Modal Volumes (can't mount local filesystem)
Debugging remote failures requires reading Modal logs (no SSH into container)
Modal is a managed service — less control than a raw VM/pod

Option 2: SkyPilot (Automated Multi-Cloud Orchestration)¶

Website: skypilot.co | GitHub

How It Works¶

SkyPilot is an open-source orchestrator that provisions GPU VMs/pods across cloud providers (RunPod, Lambda, AWS, GCP, Azure, and more). It automates the manual steps: finding cheapest GPU, provisioning, environment setup, running your script, and teardown.

# train.yaml
resources:
  accelerators: A100:1
  cloud: runpod         # or: lambda, aws, gcp, any

setup: |
  pip install -e .
  pip install -e ./foundation_models

run: |
  python foundation_models/examples/03_train_classifier.py \
    --config configs/evo2_7b_classifier.yaml

sky launch train.yaml                  # Provisions, sets up, runs
sky down train                         # Tears down when done
sky launch train.yaml --use-spot       # Use spot/preemptible for ~3x savings

Setup (One-Time Per Provider)¶

pip install "skypilot[runpod]"         # or: skypilot[aws], skypilot[all]
sky check                              # Verifies cloud credentials

For RunPod: set RUNPOD_API_KEY environment variable.

Data Transfer¶

SkyPilot uses rsync under the hood — your local data/ directory can be synced to the VM automatically via file_mounts in the YAML:

file_mounts:
  /data:
    source: ./data/mane/GRCh38/
    mode: COPY

Or use cloud storage (S3, GCS) for larger datasets.

Pricing¶

SkyPilot itself is free (open-source). You pay the underlying cloud provider's rates:

Provider	A40 48GB ($/hr)	A100 80GB ($/hr)	Spot Discount
RunPod	~$0.69	~$1.64	N/A (on-demand only)
Lambda	~$0.80	~$1.29	N/A
AWS (p4d)	—	~$3.10	~60-70% off
GCP (a2)	—	~$3.67	~60-70% off

SkyPilot can auto-select the cheapest available provider (cloud: any).

Strengths¶

Uses your existing RunPod account (and any other cloud accounts)
Automatic price-shopping across providers
Spot/preemptible instance support for significant cost savings
SSH access to the VM (full debugging capability)
File mounts sync local directories automatically
Open-source, no vendor lock-in
YAML-based — good for reproducible experiment configs
Official RunPod integration

Limitations¶

Still VM/pod-based (boot + setup time: 2-5 minutes per launch)
Requires cloud credentials configured per provider
More infrastructure to manage than Modal (YAML configs, credential setup)
No "call a Python function remotely" pattern (batch job model instead)
Each launch is a full VM lifecycle (provision → setup → run → teardown)
Less elegant for rapid iteration vs Modal's function-call pattern

Option 3: RunPods (Current Approach — Manual Pods)¶

Website: runpod.io

How It Works¶

Manually provision a GPU pod via the RunPod web UI or CLI, SSH in, configure the environment, rsync code and data, run training, rsync results back.

Setup References¶

First-time setup: docs/getting_started/RUNPODS_SETUP.md
SSH manager scripts: runpods.example/scripts/
Pod environment setup: runpods.example/docs/POD_ENVIRONMENT_SETUP.md
Data transfer: runpods.example/docs/RSYNC_QUICK_REFERENCE.md
GitHub SSH on pod: runpods.example/docs/GITHUB_SSH_SETUP.md

Pricing¶

Lowest raw $/GPU-hour of the three options:

GPU	On-Demand ($/hr)	Community Cloud ($/hr)
A40 48GB	~$0.69	~$0.39
A100 80GB	~$1.64	~$1.04
H100 80GB	~$3.29	~$2.49

See runpod.io/pricing for current rates.

Strengths¶

Cheapest per-GPU-hour
Full SSH access and control
Persistent pods (keep running between sessions)
Community cloud pricing for additional savings
Already set up in this project (scripts, docs, environment files)

Limitations¶

Manual SSH config per pod instance
GitHub SSH keys must be configured on each new pod
Environment setup required per pod (conda create, pip install)
Rsync for code/data transfer each session
Idle pods cost money (must remember to stop/delete)
No programmatic dispatch from local code

Comparison Summary¶

Feature	Modal	SkyPilot	RunPods (Manual)
Setup friction	Minimal (pip + auth)	Medium (credentials)	High (SSH, env, keys)
Code pattern	Python function call	YAML batch job	SSH + rsync
Same code local/remote	Yes (decorator only)	Mostly (YAML wraps script)	Separate workflow
Data transfer	Volumes (cloud)	rsync / cloud storage	rsync
Debugging	Logs only	SSH into VM	Full SSH
Idle cost	None (per-second)	None (teardown)	Yes (must stop pod)
Raw GPU cost	Higher (~$2-4/hr)	Varies (cheapest cloud)	Lowest (~$0.7-1.6/hr)
Spot/preemptible	No	Yes	No
Vendor lock-in	Modal only	Multi-cloud	RunPod only

Recommendation for This Project¶

Near-Term (Current)¶

Continue using RunPods manual workflow for the initial Evo2 exon classifier experiments. The infrastructure is already set up, and the first priority is proving the training pipeline works end-to-end locally.

Medium-Term (When Training Becomes Frequent)¶

Adopt Modal for the programmatic dispatch pattern. This maps directly to the small/medium/large tier strategy:

Tier	Hardware	Method	When
Small	Local (MPS/CPU/CUDA)	Direct local call	Developing training logic
Medium	Modal T4/A10G	`train.remote(...)`	Validated pipeline, real data subset
Large	Modal A100/H100	`train.remote(...)`	Production training, full genome

Integration with AdaptiveTrainer¶

The planned AdaptiveTrainer class (see foundation_models/docs/training/deepspeed_training.md) could be extended to support a backend parameter:

trainer = AdaptiveTrainer(
    model=classifier,
    backend="auto",     # "local" | "modal" | "skypilot" | "auto"
)
# "auto" selects: local if small data + MPS/CUDA available,
#                 modal if large data or explicit GPU request
trainer.fit(train_data, val_data)

This keeps the training code identical regardless of where it runs.

Long-Term (If Multi-Cloud Cost Optimization Matters)¶

Add SkyPilot as an option for spot-instance training jobs that can tolerate preemption and longer setup times, especially for large-scale sweeps across the cheapest available provider.

Remote Training Options: Modal, SkyPilot, and RunPods¶

Problem¶

Development Principle: Local-First¶

Option 1: Modal (Recommended for Programmatic Remote Training)¶

How It Works¶

Setup (One-Time)¶

Data Transfer¶

Pricing (as of March 2026)¶

Strengths¶

Limitations¶

Option 2: SkyPilot (Automated Multi-Cloud Orchestration)¶

How It Works¶

Setup (One-Time Per Provider)¶

Data Transfer¶

Pricing¶

Strengths¶

Limitations¶

Option 3: RunPods (Current Approach — Manual Pods)¶

How It Works¶

Setup References¶

Pricing¶

Strengths¶

Limitations¶

Comparison Summary¶

Recommendation for This Project¶

Near-Term (Current)¶

Medium-Term (When Training Becomes Frequent)¶

Integration with AdaptiveTrainer¶

Long-Term (If Multi-Cloud Cost Optimization Matters)¶

Sources¶