Chest X-Ray Datasets¶
This document covers chest X-ray datasets available in genai-lab for training diffusion models and other generative architectures.
1. Synthetic X-Ray Dataset¶
Purpose: Testing and development without requiring real medical data.
Description¶
SyntheticXRayDataset generates diverse synthetic chest X-ray images with randomized anatomical structures:
- Lungs: Elliptical shapes with variable position, size, darkness
- Heart: Circular structure between lungs
- Ribs: Curved horizontal lines with variable spacing
- Spine: Optional vertical structure
- Clavicles: Optional horizontal structures at top
Usage¶
from genailab.diffusion.datasets import SyntheticXRayDataset
# Create dataset
dataset = SyntheticXRayDataset(
n_samples=1000, # Number of images to generate
img_size=128, # Image dimensions (square)
seed=42 # Reproducibility
)
# Access images
img = dataset[0] # Returns tensor of shape (1, 128, 128) in [-1, 1]
Characteristics¶
| Property | Value |
|---|---|
| Output shape | (1, H, W) grayscale |
| Value range | [-1, 1] (normalized) |
| Generation | On-init (all images pre-generated) |
| Randomization | Per-image anatomical variation |
When to Use¶
- Development: Fast iteration without data download
- Testing: Verify pipeline correctness
- Debugging: Reproducible synthetic data
- Demos: Quick demonstrations
2. Real Chest X-Ray Dataset (Kaggle)¶
Purpose: Training on real medical images for realistic diffusion models.
Source¶
Kaggle Chest X-Ray Images (Pneumonia)
- URL: https://www.kaggle.com/datasets/paultimothymooney/chest-xray-pneumonia
- License: CC BY 4.0
- Size: ~1.2 GB
Dataset Structure¶
chest_xray/
├── train/
│ ├── NORMAL/ (~1,300 images)
│ └── PNEUMONIA/ (~3,900 images)
├── val/
│ ├── NORMAL/
│ └── PNEUMONIA/
└── test/
├── NORMAL/
└── PNEUMONIA/
Usage¶
from genailab.diffusion.datasets import ChestXRayDataset
# Load real X-rays (normal cases)
dataset = ChestXRayDataset(
root_dir="data/chest_xray/train/NORMAL",
img_size=128
)
# Or use factory function
from genailab.diffusion.datasets import get_dataset
dataset = get_dataset(
name='chest_xray',
root_dir="data/chest_xray/train/NORMAL",
img_size=128
)
Download Instructions¶
- Create Kaggle account and API token
- Install kaggle CLI:
pip install kaggle - Download:
kaggle datasets download -d paultimothymooney/chest-xray-pneumonia
unzip chest-xray-pneumonia.zip -d data/
Characteristics¶
| Property | Value |
|---|---|
| Original resolution | Variable (typically 1000-2000 px) |
| Output shape | (1, H, W) grayscale (resized) |
| Value range | [-1, 1] (normalized) |
| Classes | NORMAL, PNEUMONIA |
3. Factory Function¶
Use get_dataset() for unified access:
from genailab.diffusion.datasets import get_dataset
# Synthetic
synthetic_ds = get_dataset('synthetic', img_size=128, n_samples=1000)
# Real
real_ds = get_dataset('chest_xray', root_dir='data/chest_xray/train/NORMAL', img_size=128)
4. Integration with Training¶
With Diffusion Training¶
from torch.utils.data import DataLoader
from genailab.diffusion.datasets import get_dataset
from genailab.diffusion.training import train_image_diffusion
# Create dataset and loader
dataset = get_dataset('synthetic', img_size=64, n_samples=500)
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)
# Train diffusion model
model, losses = train_image_diffusion(
dataloader=dataloader,
img_size=64,
n_epochs=50,
device='cuda'
)
See Also¶
notebooks/diffusion/03_medical_imaging_diffusion/— Full tutorial
5. Extending with New Datasets¶
To add a new medical imaging dataset:
- Create a new
Datasetclass insrc/genailab/diffusion/datasets.py - Implement
__len__and__getitem__ - Ensure output is
(C, H, W)tensor in[-1, 1] - Add to
get_dataset()factory function - Document here
Related Code¶
| File | Contents |
|---|---|
src/genailab/diffusion/datasets.py |
SyntheticXRayDataset, ChestXRayDataset, get_dataset |
src/genailab/diffusion/training.py |
train_image_diffusion |
src/genailab/diffusion/architectures.py |
UNet2D for image diffusion |