Generating Synthea Data for EHR Sequence Modeling¶

Overview¶

Deep learning models for EHR sequences require substantial training data to learn meaningful patterns. While small datasets (100-200 patients) are useful for rapid prototyping and development, production models typically need:

Minimum: 1,000+ patients for basic performance
Recommended: 5,000-10,000 patients for robust models
Optimal: 50,000+ patients for state-of-the-art results

This guide shows how to generate synthetic EHR data at scale using Synthea, a realistic patient generator that creates complete medical histories with encounters, conditions, procedures, medications, and observations.

Option 1: Generate New Synthea Data (Recommended)¶

Install Synthea¶

# Download Synthea
cd ~/work/data
wget https://github.com/synthetichealth/synthea/releases/download/master-branch-latest/synthea-with-dependencies.jar

# Or clone and build from source
git clone https://github.com/synthetichealth/synthea.git
cd synthea
./gradlew build check test

Generate 1000+ Patients¶

# Generate 1000 patients
java -jar synthea-with-dependencies.jar -p 1000

# Output will be in ./output/csv/
# Move to your data directory
mv output/csv ~/work/loinc-predictor/data/synthea/large_cohort/

Generate with Specific Conditions¶

For disease progression modeling, generate patients with specific conditions:

# CKD patients
java -jar synthea-with-dependencies.jar \
  -p 500 \
  -m chronic_kidney_disease

# Diabetes patients
java -jar synthea-with-dependencies.jar \
  -p 500 \
  -m diabetes

# Combine multiple cohorts
mkdir ~/work/loinc-predictor/data/synthea/combined_1000/
cat large_cohort/patients.csv > combined_1000/patients.csv
cat large_cohort/encounters.csv > combined_1000/encounters.csv
# ... repeat for other files

Configuration Options¶

Edit src/main/resources/synthea.properties:

# Generate more realistic data
exporter.years_of_history = 10

# Include more conditions
generate.only_alive_patients = false
generate.append_numbers_to_person_names = true

# Increase prevalence of chronic conditions
generate.chronic_kidney_disease.prevalence = 0.15
generate.diabetes.prevalence = 0.20

Option 2: Use Public Synthea Datasets¶

SyntheticMass Dataset¶

Large pre-generated Synthea dataset:

# Download SyntheticMass (1M+ patients)
wget https://synthea.mitre.org/downloads/synthea_sample_data_csv_apr2020.zip

# Extract specific subset
unzip synthea_sample_data_csv_apr2020.zip
head -n 1001 csv/patients.csv > subset_1000/patients.csv
# Filter other files by patient IDs

MITRE Synthea Downloads¶

https://synthea.mitre.org/downloads
Pre-generated datasets available
Various sizes and configurations

Option 3: Use Real De-identified EHR Data¶

If available, use real de-identified data:

MIMIC-III/IV (ICU data)
eICU Collaborative Research Database
UK Biobank
All of Us Research Program

Advantages: - Real clinical patterns - Better generalization - Meaningful outcomes

Requirements: - IRB approval - Data use agreements - Privacy compliance

Updating the Training Pipeline¶

Once you have more data, update the data path:

# In notebook or training script
data_dir = Path.home() / 'work' / 'loinc-predictor' / 'data' / 'synthea' / 'large_cohort'

# Or combine multiple cohorts
data_dirs = [
    Path.home() / 'work' / 'loinc-predictor' / 'data' / 'synthea' / 'cohort1',
    Path.home() / 'work' / 'loinc-predictor' / 'data' / 'synthea' / 'cohort2',
    Path.home() / 'work' / 'loinc-predictor' / 'data' / 'synthea' / 'cohort3',
]

all_events = []
for data_dir in data_dirs:
    adapter = SyntheaAdapter(data_dir)
    events = adapter.load_events()
    all_events.extend(events)

Expected Performance by Dataset Size¶

The relationship between dataset size and model performance for survival analysis:

Dataset Size	Expected C-index	Training Time	Use Case
100-200 patients	0.45-0.55	5-10 min	Development/debugging
500 patients	0.55-0.65	20-30 min	Initial experiments
1,000 patients	0.60-0.70	45-60 min	Baseline models
5,000 patients	0.65-0.75	3-4 hours	Production models
10,000+ patients	0.70-0.80	8-10 hours	State-of-the-art

Notes: - Performance estimates assume well-defined outcomes and sufficient event rates - With pretrained embeddings, expect +0.05-0.10 improvement in C-index - Training time varies by hardware (estimates for single GPU)

Next Steps¶

Generate/download more data (this guide)
Implement pretrained embeddings (Phase 2)
Retrain with larger dataset
Evaluate performance improvement

Resources¶

Synthea GitHub: https://github.com/synthetichealth/synthea
Synthea Wiki: https://github.com/synthetichealth/synthea/wiki
SyntheticMass: https://synthea.mitre.org/downloads
Synthea Module Builder: https://synthetichealth.github.io/module-builder/