Data Generation Documentation¶

This directory contains comprehensive guides for generating synthetic patient data using Synthea for EHR sequence modeling and survival analysis.

Documents¶

1. Data Generation Guide ¶

Purpose: Main guide for generating synthetic patient data with Synthea

Topics covered: - Why synthetic data is needed for EHR deep learning - Dataset size recommendations for different use cases - Installing and setting up Synthea - Generating different patient cohorts - Configuring Synthea for specific conditions - Integrating generated data into training pipelines - Expected model performance by dataset size

When to use: Start here for overview and general guidance on data generation.

2. Synthea CSV Export Troubleshooting ¶

Purpose: Detailed troubleshooting guide for CSV export issues

Topics covered: - Why CSV files may not be generated despite configuration - Root cause analysis of configuration precedence - Reliable solutions using command-line flags - Verification steps and best practices - Common pitfalls and how to avoid them - Lessons learned from trial-and-error debugging

When to use: Reference this when CSV files are not being generated, or to understand the correct way to ensure CSV export.

Quick Start¶

Generate 1000 Patients (CSV Format)¶

cd ~/work/synthea

# Clean output directory
rm -rf output/csv/*

# Generate with explicit CSV export
java -jar synthea-with-dependencies.jar \
  --exporter.csv.export=true \
  --exporter.fhir.export=false \
  -p 1000

# Verify success
wc -l output/csv/patients.csv  # Should show 1001 (1000 + header)
du -sh output/csv/              # Should be ~30-50 MB

# Copy to project
cp -r output/csv/* ~/work/loinc-predictor/data/synthea/large_cohort_1000/

Key Lessons¶

Configuration Precedence¶

Command-line arguments > Local properties file > Embedded JAR defaults

Always use command-line flags to ensure settings are applied.

Verification is Critical¶

Never assume data was generated in the expected format. Always verify: - CSV directory exists - Files have reasonable sizes - Patient count matches expectation

Document Your Process¶

Save the exact commands used for reproducibility and debugging.

Common Use Cases¶

Small Dataset for Testing (100 patients)¶

java -jar synthea-with-dependencies.jar \
  --exporter.csv.export=true \
  --exporter.fhir.export=false \
  -p 100

Medium Dataset for Development (1000 patients)¶

java -jar synthea-with-dependencies.jar \
  --exporter.csv.export=true \
  --exporter.fhir.export=false \
  -p 1000

Large Dataset for Training (10000 patients)¶

java -jar synthea-with-dependencies.jar \
  --exporter.csv.export=true \
  --exporter.fhir.export=false \
  -p 10000

Disease-Specific Cohort (CKD patients)¶

java -jar synthea-with-dependencies.jar \
  --exporter.csv.export=true \
  --exporter.fhir.export=false \
  -p 1000 \
  -m chronic_kidney_disease

Pretrained Embeddings Guide - Using pretrained medical code embeddings
Survival Analysis Methods - Causal survival analysis theory
Synthea Official Wiki - Comprehensive Synthea documentation

Troubleshooting Decision Tree¶

CSV files not generated?
│
├─ Are FHIR files being created instead?
│  └─ YES → See "Synthea CSV Export Troubleshooting" guide
│
├─ Is Synthea running without errors?
│  ├─ NO → Check Java installation and JAR file
│  └─ YES → Use command-line flags instead of properties file
│
└─ Are you in the correct directory?
   └─ Run: cd ~/work/synthea

Best Practices Summary¶

Use explicit command-line flags for export format control
Clean output directory before each generation
Verify output immediately after generation
Document commands in scripts or logs
Test with small datasets before scaling up
Check file sizes to ensure data was actually generated

Support¶

If you encounter issues not covered in these guides:

Check the Synthea GitHub Issues
Review the Synthea Wiki
Verify your Java version: java -version (requires Java 11+)
Check Synthea version and update if needed

Contributing¶

When adding new data generation documentation:

Place files in this directory
Update this README with links and descriptions
Include practical examples and code snippets
Document lessons learned from troubleshooting
Add to the "Common Use Cases" section if applicable