Skip to content

EHR Sequencing

Research Framework for Longitudinal EHR Sequence Modeling

A comprehensive toolkit for exploring temporal representations, learning objectives, and model architectures for disease progression, survival analysis, and temporal phenotyping under censoring and irregular follow-up.


Overview

EHR Sequencing applies sequence modeling techniques from genomics and NLP to Electronic Health Records, treating medical codes as "words" and patient histories as "documents" to enable:

  • Disease Progression Modeling - Predict future diagnoses and outcomes
  • Survival Analysis - Time-to-event modeling with proper censoring handling
  • Temporal Phenotyping - Discover disease subtypes from patient trajectories
  • Patient Segmentation - Cluster patients by clinical similarity
  • Clinical Trajectory Analysis - Understand disease evolution patterns

The Analogy

DNA Sequences (ATCG...)  →  Genomic Language Models
    ↓                              ↓
Medical Code Sequences   →  EHR Sequencing Models
(LOINC, SNOMED, ICD...)      (This Project)

Key Features

🏥 Comprehensive Data Pipeline

  • Multi-source adapters: Synthea, MIMIC-III support
  • Visit grouping: Semantic code ordering (diagnoses → procedures → medications)
  • Flexible tokenization: Visit-based, flat, or hierarchical sequences
  • PyTorch integration: Ready-to-use datasets and dataloaders

🧬 Survival Analysis

  • Discrete-time survival models: LSTM-based hazard prediction
  • Synthetic outcome generation: Validated correlation (r = -0.5)
  • Proper censoring handling: Negative log-likelihood loss
  • C-index evaluation: Fixed-horizon risk scores to avoid length bias

🤖 Model Architectures

  • LSTM baseline: Visit-level sequence encoding
  • Discrete-time survival LSTM: Hazard prediction at each visit
  • Extensible framework: Easy to add Transformers, BEHRT, etc.

📊 Evaluation & Validation

  • Concordance index (C-index): Survival model ranking quality
  • Synthetic data validation: Fast iteration with pre-validated outcomes
  • Memory estimation: Plan GPU requirements before training

Quick Start

Installation

# 1. Clone repository
git clone https://github.com/pleiadian53/ehr-sequencing.git
cd ehr-sequencing

# 2. Create conda environment (choose your platform)
# macOS (M1/M2/M3):
mamba env create -f environment-macos.yml
# Linux/Windows with NVIDIA GPU:
mamba env create -f environment-cuda.yml
# CPU-only:
mamba env create -f environment-cpu.yml

# 3. Activate environment
mamba activate ehrsequencing

# 4. Install package with poetry
poetry install

# 5. Verify installation
python -c "import ehrsequencing; print(f'✅ EHR Sequencing v{ehrsequencing.__version__} ready!')"

See Installation Guide for detailed instructions.

Basic Usage

from ehrsequencing.data import SyntheaAdapter, VisitGrouper, PatientSequenceBuilder

# 1. Load EHR data
adapter = SyntheaAdapter('data/synthea/')
patients = adapter.load_patients(limit=100)
events = adapter.load_events(patient_ids=[p.patient_id for p in patients])

# 2. Group events into visits (with semantic code ordering)
grouper = VisitGrouper(strategy='hybrid', preserve_code_types=True)
patient_visits = grouper.group_by_patient(events)

# 3. Build patient sequences
builder = PatientSequenceBuilder(max_visits=50, max_codes_per_visit=100)
vocab = builder.build_vocabulary(patient_visits, min_frequency=5)
sequences = builder.build_sequences(patient_visits, min_visits=2)

# 4. Create PyTorch dataset
dataset = builder.create_dataset(sequences)
print(f"Created dataset with {len(dataset)} patients")
print(f"Vocabulary size: {len(vocab)}")

Survival Analysis Example

from ehrsequencing.models import DiscreteTimeSurvivalLSTM
from ehrsequencing.synthetic.survival import DiscreteTimeSurvivalGenerator

# Generate synthetic outcomes
generator = DiscreteTimeSurvivalGenerator(censoring_rate=0.3)
outcome = generator.generate(sequences)

# Train survival model
model = DiscreteTimeSurvivalLSTM(
    vocab_size=len(vocab),
    embedding_dim=128,
    hidden_dim=256
)

# Evaluate with C-index
# See notebooks/02_survival_analysis/ for complete workflow

Documentation

📚 Tutorials

📓 Notebooks

📖 Methods

🛠️ Guides


Project Status

Phase: 1.5 - Survival Analysis (80% Complete)
Version: 0.1.0 (Alpha)
Status: Active Development

Recent Updates (January 2026)

  • ✅ Implemented discrete-time survival LSTM model
  • ✅ Created synthetic outcome generator with validated correlation
  • ✅ Resolved C-index calculation issues (achieved 0.65-0.70)
  • ✅ Comprehensive survival analysis tutorials
  • 🔄 Next: Code embeddings (Med2Vec, BEHRT)

See the project repository for detailed development plan.


Research Focus

This project explores multiple dimensions of EHR sequence modeling:

Temporal Representations

  • Visit-based sequences
  • Flat event streams
  • Hierarchical code structures
  • Time-aware embeddings

Learning Objectives

  • Supervised prediction (disease onset, mortality)
  • Self-supervised pre-training (masked language modeling)
  • Survival analysis (time-to-event with censoring)
  • Representation learning (patient embeddings)

Model Architectures

  • LSTMs (baseline and survival variants)
  • Transformers (BEHRT-style)
  • Graph neural networks (code relationships)
  • Hybrid architectures

Real-World Challenges

  • Censoring (patients lost to follow-up)
  • Irregular sampling (variable visit frequencies)
  • Missing data (incomplete records)
  • Length bias (variable sequence lengths)

Citation

If you use this framework in your research, please cite:

@software{ehr_sequencing_2026,
  title = {EHR Sequencing: Research Framework for Longitudinal EHR Modeling},
  author = {EHR Sequencing Research Team},
  year = {2026},
  url = {https://github.com/pleiadian53/ehr-sequencing}
}

License

This project is licensed under the MIT License - see the LICENSE file for details.


Acknowledgments

  • Synthea: Synthetic patient data generation
  • PyHealth: Reference implementations for EHR modeling
  • Material for MkDocs: Documentation framework with LaTeX support

Contact


Built with ❤️ for advancing healthcare AI research