Perturbation Datasets¶
Datasets for training perturbation prediction models (scPPDM, JEPA) on CRISPR screening and Perturb-seq data.
Overview¶
Perturbation datasets capture cellular responses to genetic interventions (CRISPR knockouts, knockdowns, overexpression). These are essential for:
- scPPDM: Single-cell Perturbation Prediction via Diffusion Models
- JEPA: Joint Embedding Predictive Architecture for perturbation response
- Counterfactual generation: Predicting "what if" scenarios
Key Datasets¶
| Dataset | Cells | Perturbations | Cell Type | Document |
|---|---|---|---|---|
| scPerturb | Multiple | Multiple | Various | scperturb.md |
| Replogle 2022 | ~2.5M | >5,000 | K562 | scperturb.md |
| Norman 2019 | ~100k | ~300 (combinatorial) | K562 | scperturb.md |
| Adamson 2016 | ~10k | ~100 | K562 | scperturb.md |
Data Structure¶
Perturb-seq data typically includes:
AnnData object:
├── X: Expression matrix (cells × genes)
├── obs: Cell metadata
│ ├── perturbation: Gene targeted
│ ├── is_control: Boolean (NT control)
│ └── cell_type, batch, etc.
└── var: Gene metadata
└── gene_name, highly_variable, etc.
Key Fields¶
- Control cells: Non-targeting (NT) guide, baseline expression
- Perturbed cells: Expression after intervention
- Perturbation label: Which gene was targeted
Use Cases in genai-lab¶
1. scPPDM (Diffusion-based)¶
Train diffusion model to predict post-perturbation expression:
2. JEPA (Embedding-based)¶
Predict perturbation effects in latent space:
3. Counterfactual Generation¶
Generate "what if" scenarios:
- What if gene X was knocked out in this cell?
- What is the distribution of possible outcomes?
Documents¶
- scperturb.md — scPerturb harmonized collection
- perturb_seq_guide.md — General Perturb-seq data handling