EHR Sequencing¶
Research Framework for Longitudinal EHR Sequence Modeling
A comprehensive toolkit for exploring temporal representations, learning objectives, and model architectures for disease progression, survival analysis, and temporal phenotyping under censoring and irregular follow-up.
Overview¶
EHR Sequencing applies sequence modeling techniques from genomics and NLP to Electronic Health Records, treating medical codes as "words" and patient histories as "documents" to enable:
- Disease Progression Modeling - Predict future diagnoses and outcomes
- Survival Analysis - Time-to-event modeling with proper censoring handling
- Temporal Phenotyping - Discover disease subtypes from patient trajectories
- Patient Segmentation - Cluster patients by clinical similarity
- Clinical Trajectory Analysis - Understand disease evolution patterns
The Analogy¶
DNA Sequences (ATCG...) → Genomic Language Models
↓ ↓
Medical Code Sequences → EHR Sequencing Models
(LOINC, SNOMED, ICD...) (This Project)
Key Features¶
🏥 Comprehensive Data Pipeline¶
- Multi-source adapters: Synthea, MIMIC-III support
- Visit grouping: Semantic code ordering (diagnoses → procedures → medications)
- Flexible tokenization: Visit-based, flat, or hierarchical sequences
- PyTorch integration: Ready-to-use datasets and dataloaders
🧬 Survival Analysis¶
- Discrete-time survival models: LSTM-based hazard prediction
- Synthetic outcome generation: Validated correlation (r = -0.5)
- Proper censoring handling: Negative log-likelihood loss
- C-index evaluation: Fixed-horizon risk scores to avoid length bias
🤖 Model Architectures¶
- LSTM baseline: Visit-level sequence encoding
- Discrete-time survival LSTM: Hazard prediction at each visit
- Extensible framework: Easy to add Transformers, BEHRT, etc.
📊 Evaluation & Validation¶
- Concordance index (C-index): Survival model ranking quality
- Synthetic data validation: Fast iteration with pre-validated outcomes
- Memory estimation: Plan GPU requirements before training
Quick Start¶
Installation¶
# 1. Clone repository
git clone https://github.com/pleiadian53/ehr-sequencing.git
cd ehr-sequencing
# 2. Create conda environment (choose your platform)
# macOS (M1/M2/M3):
mamba env create -f environment-macos.yml
# Linux/Windows with NVIDIA GPU:
mamba env create -f environment-cuda.yml
# CPU-only:
mamba env create -f environment-cpu.yml
# 3. Activate environment
mamba activate ehrsequencing
# 4. Install package with poetry
poetry install
# 5. Verify installation
python -c "import ehrsequencing; print(f'✅ EHR Sequencing v{ehrsequencing.__version__} ready!')"
See Installation Guide for detailed instructions.
Basic Usage¶
from ehrsequencing.data import SyntheaAdapter, VisitGrouper, PatientSequenceBuilder
# 1. Load EHR data
adapter = SyntheaAdapter('data/synthea/')
patients = adapter.load_patients(limit=100)
events = adapter.load_events(patient_ids=[p.patient_id for p in patients])
# 2. Group events into visits (with semantic code ordering)
grouper = VisitGrouper(strategy='hybrid', preserve_code_types=True)
patient_visits = grouper.group_by_patient(events)
# 3. Build patient sequences
builder = PatientSequenceBuilder(max_visits=50, max_codes_per_visit=100)
vocab = builder.build_vocabulary(patient_visits, min_frequency=5)
sequences = builder.build_sequences(patient_visits, min_visits=2)
# 4. Create PyTorch dataset
dataset = builder.create_dataset(sequences)
print(f"Created dataset with {len(dataset)} patients")
print(f"Vocabulary size: {len(vocab)}")
Survival Analysis Example¶
from ehrsequencing.models import DiscreteTimeSurvivalLSTM
from ehrsequencing.synthetic.survival import DiscreteTimeSurvivalGenerator
# Generate synthetic outcomes
generator = DiscreteTimeSurvivalGenerator(censoring_rate=0.3)
outcome = generator.generate(sequences)
# Train survival model
model = DiscreteTimeSurvivalLSTM(
vocab_size=len(vocab),
embedding_dim=128,
hidden_dim=256
)
# Evaluate with C-index
# See notebooks/02_survival_analysis/ for complete workflow
Documentation¶
📚 Tutorials¶
📓 Notebooks¶
📖 Methods¶
🛠️ Guides¶
Project Status¶
Phase: 1.5 - Survival Analysis (80% Complete)
Version: 0.1.0 (Alpha)
Status: Active Development
Recent Updates (January 2026)¶
- ✅ Implemented discrete-time survival LSTM model
- ✅ Created synthetic outcome generator with validated correlation
- ✅ Resolved C-index calculation issues (achieved 0.65-0.70)
- ✅ Comprehensive survival analysis tutorials
- 🔄 Next: Code embeddings (Med2Vec, BEHRT)
See the project repository for detailed development plan.
Research Focus¶
This project explores multiple dimensions of EHR sequence modeling:
Temporal Representations¶
- Visit-based sequences
- Flat event streams
- Hierarchical code structures
- Time-aware embeddings
Learning Objectives¶
- Supervised prediction (disease onset, mortality)
- Self-supervised pre-training (masked language modeling)
- Survival analysis (time-to-event with censoring)
- Representation learning (patient embeddings)
Model Architectures¶
- LSTMs (baseline and survival variants)
- Transformers (BEHRT-style)
- Graph neural networks (code relationships)
- Hybrid architectures
Real-World Challenges¶
- Censoring (patients lost to follow-up)
- Irregular sampling (variable visit frequencies)
- Missing data (incomplete records)
- Length bias (variable sequence lengths)
Citation¶
If you use this framework in your research, please cite:
@software{ehr_sequencing_2026,
title = {EHR Sequencing: Research Framework for Longitudinal EHR Modeling},
author = {EHR Sequencing Research Team},
year = {2026},
url = {https://github.com/pleiadian53/ehr-sequencing}
}
License¶
This project is licensed under the MIT License - see the LICENSE file for details.
Acknowledgments¶
- Synthea: Synthetic patient data generation
- PyHealth: Reference implementations for EHR modeling
- Material for MkDocs: Documentation framework with LaTeX support
Contact¶
- GitHub: pleiadian53/ehr-sequencing
- Issues: Report bugs or request features
Built with ❤️ for advancing healthcare AI research