Skip to content

scPerturb: Harmonized Perturbation Datasets

scPerturb is a harmonized collection of single-cell perturbation datasets, providing a unified resource for training and benchmarking perturbation prediction models.


Overview

Property Value
Source scperturb.org
Paper Peidli et al. (2024) "scPerturb: Harmonized Single-Cell Perturbation Data"
Format AnnData (h5ad)
License Various (check individual datasets)

Available Datasets

Dataset Perturbations Cells Cell Type Modality Notes
Replogle 2022 (K562) >5,000 genes ~2.5M K562 CRISPRi Largest, recommended for training
Norman 2019 ~300 (combinatorial) ~100k K562 CRISPRa Combinatorial perturbations
Adamson 2016 ~100 (UPR) ~10k K562 CRISPRi Focused on UPR pathway
Dixit 2016 ~24 ~10k Dendritic CRISPRi Original Perturb-seq paper

Additional Datasets

Dataset Focus Size
Frangieh 2021 Cancer immunotherapy ~200k cells
Papalexi 2021 ECCITE-seq (protein + RNA) ~40k cells
Gasperini 2019 Enhancer perturbations ~200k cells

Data Access

Option 1: scPerturb Portal

Download pre-processed h5ad files from scperturb.org.

Option 2: Python API

# Install scperturb
pip install scperturb

# Download dataset
import scperturb
adata = scperturb.load_dataset("replogle_2022_k562")

Option 3: Direct Download

# Replogle 2022 K562 (large, ~5GB)
wget https://scperturb.org/data/replogle_2022_k562.h5ad

# Norman 2019 (smaller, good for testing)
wget https://scperturb.org/data/norman_2019.h5ad

Data Structure

All scPerturb datasets follow a consistent schema:

import scanpy as sc

adata = sc.read_h5ad("replogle_2022_k562.h5ad")

# Key fields in adata.obs:
# - perturbation: Target gene name (e.g., "TP53", "non-targeting")
# - is_control: Boolean, True for non-targeting controls
# - cell_type: Cell type annotation
# - perturbation_type: "CRISPRi", "CRISPRa", "CRISPRko"

# Expression matrix
X = adata.X  # (n_cells, n_genes)

# Get control cells
controls = adata[adata.obs['is_control']]

# Get perturbed cells for specific gene
tp53_ko = adata[adata.obs['perturbation'] == 'TP53']

Preprocessing for scPPDM

Standard Pipeline

import scanpy as sc
import numpy as np

# 1. Load data
adata = sc.read_h5ad("replogle_2022_k562.h5ad")

# 2. Quality control
sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_genes(adata, min_cells=3)

# 3. Normalize
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)

# 4. Select highly variable genes
sc.pp.highly_variable_genes(adata, n_top_genes=2000)
adata = adata[:, adata.var['highly_variable']]

# 5. Split by perturbation
train_perts = [...]  # List of perturbations for training
test_perts = [...]   # Held-out perturbations

train_adata = adata[adata.obs['perturbation'].isin(train_perts + ['non-targeting'])]
test_adata = adata[adata.obs['perturbation'].isin(test_perts + ['non-targeting'])]

Creating Control-Perturbed Pairs

def create_pairs(adata, perturbation):
    """Create (control, perturbed) pairs for training."""
    controls = adata[adata.obs['is_control']].X
    perturbed = adata[adata.obs['perturbation'] == perturbation].X

    # Random pairing (or use more sophisticated matching)
    n_pairs = min(len(controls), len(perturbed))
    ctrl_idx = np.random.choice(len(controls), n_pairs, replace=False)
    pert_idx = np.random.choice(len(perturbed), n_pairs, replace=False)

    return controls[ctrl_idx], perturbed[pert_idx]

Evaluation Metrics

Standard metrics for perturbation prediction:

Metric Description
MSE Mean squared error on held-out perturbations
Pearson r Correlation between predicted and true expression
DEG overlap Overlap of differentially expressed genes
Pathway enrichment Biological pathway consistency

Use in genai-lab

For scPPDM (Diffusion)

# Train diffusion model to predict perturbation effects
# Input: control expression + perturbation embedding
# Output: perturbed expression distribution

For JEPA

# Train predictor in latent space
# Input: control latent + perturbation token
# Output: predicted perturbed latent

For initial experiments, use Norman 2019:

  • Smaller size (~100k cells)
  • Well-characterized perturbations
  • Combinatorial effects (interesting for modeling)
  • Widely used benchmark

For production training, use Replogle 2022:

  • Largest dataset (>5,000 perturbations)
  • Comprehensive coverage
  • High quality

References

  • Peidli et al. (2024) "scPerturb: Harmonized Single-Cell Perturbation Data"
  • Replogle et al. (2022) "Mapping information-rich genotype-phenotype landscapes with genome-scale Perturb-seq"
  • Norman et al. (2019) "Exploring genetic interaction manifolds constructed from rich single-cell phenotypes"
  • Dixit et al. (2016) "Perturb-Seq: Dissecting Molecular Circuits with Scalable Single-Cell RNA Profiling of Pooled Genetic Screens"