Discrete-Time Survival Analysis: Framework and Foundations¶

Topics: Survival math, hazard function, censoring, label preparation, synthetic data

Table of Contents¶

Why Discrete-Time Survival?
Core Quantities
Censoring
Label Preparation
Synthetic Data Generation
Evaluation Metrics

Why Discrete-Time Survival?¶

Survival analysis models the time until an event occurs, accounting for patients who have not experienced the event by the end of observation (censored observations). In EHR data, typical outcomes include:

Hospital readmission (30-day, 90-day)
Mortality (in-hospital, 30-day, 1-year)
Disease onset (diabetes, heart failure, stroke)
Treatment failure (time to efficacy loss or adverse event)

EHR data arrives in discrete time intervals (visits), making discrete-time survival a natural fit:

Visits are discrete events — patients are observed at specific time points
Events occur between visits (interval censoring is the norm)
Computational simplicity — no need for continuous-time hazard estimation
Flexible hazard modeling — non-parametric per-visit hazard captures complex patterns

Core Quantities¶

Hazard Function¶

At each visit \(t\), the model predicts a hazard \(h_t\):

\[h_t = P(T = t \mid T \geq t,\, X_t)\]

where: - \(T\) is the event time (visit index when event occurs) - \(X_t\) is the patient history up to visit \(t\) - \(h_t \in [0, 1]\) is the probability of event at visit \(t\), given survival to \(t\)

Interpretation: \(h_t = 0.2\) means a 20% chance the event occurs at visit \(t\), given the patient survived to visit \(t\).

Survival Function¶

\[S(t) = P(T > t) = \prod_{i=1}^{t} (1 - h_i)\]

Example:

Visit 1: h₁ = 0.10  →  S(1) = 0.90
Visit 2: h₂ = 0.15  →  S(2) = 0.90 × 0.85 = 0.765
Visit 3: h₃ = 0.20  →  S(3) = 0.765 × 0.80 = 0.612

Cumulative Incidence¶

\[F(t) = 1 - S(t) = 1 - \prod_{i=1}^{t} (1 - h_i)\]

Censoring¶

A patient is censored at time \(c\) if the event has not occurred by the end of observation. The patient's true event time \(T\) is unknown, but we know \(T > c\).

Right censoring is the standard case in EHR: - Patient leaves the health system - Study ends before event occurs - Patient dies from an unrelated cause

Key assumption: Censoring is non-informative — the reason for censoring is independent of the event risk. This assumption should be verified in real data.

Label Preparation¶

For each patient, survival labels consist of:

Visit sequence — medical codes at each visit
Event time — visit index when event occurred (or last observed visit if censored)
Event indicator — 1 if event occurred, 0 if censored

Example: 30-Day Readmission¶

Patient A (event occurred):

{
    'patient_id': 'A001',
    'visits': [
        {'date': '2020-01-01', 'codes': [250, 401, 500], 'age': 65},  # Visit 0
        {'date': '2020-01-05', 'codes': [500, 501], 'age': 65},       # Visit 1
        {'date': '2020-01-20', 'codes': [250, 428], 'age': 65},       # Visit 2: readmission
    ],
    'event_time': 2,        # Event at visit 2
    'event_indicator': 1,   # Event occurred
}

Patient B (censored):

{
    'patient_id': 'B002',
    'visits': [
        {'date': '2020-02-01', 'codes': [493, 500], 'age': 45},  # Visit 0
        {'date': '2020-02-03', 'codes': [500, 501], 'age': 45},  # Visit 1
        {'date': '2020-03-15', 'codes': [500], 'age': 45},       # Visit 2: no event
    ],
    'event_time': 3,        # Last observed visit
    'event_indicator': 0,   # Censored
}

Label Generation¶

def get_event_time(visits):
    """Find first visit where event occurs, else return censoring time."""
    for t, visit in enumerate(visits):
        if is_readmission(visit, visits[:t]):
            return t, 1  # (event_time, event_indicator)
    return len(visits), 0  # censored

# Batch labels
event_times      = torch.tensor([2, 3, 1, 5, 4])      # visit index
event_indicators = torch.tensor([1, 0, 1, 0, 1])      # 1=event, 0=censored

Synthetic Data Generation¶

For development and testing, DiscreteTimeSurvivalGenerator creates outcomes with realistic risk-time correlation.

Risk factors used: - Comorbidity burden (number of unique diagnosis codes) - Visit frequency (healthcare utilization) - Code diversity (condition complexity)

from ehrsequencing.synthetic.survival import DiscreteTimeSurvivalGenerator

generator = DiscreteTimeSurvivalGenerator(
    censoring_rate=0.3,   # 30% of patients censored
    time_scale=0.3,       # Controls average event timing
    seed=42
)

outcome = generator.generate(patient_sequences)
# outcome.event_times:      (N,) visit indices
# outcome.event_indicators: (N,) 1=event, 0=censored

Important: Synthetic outcomes should have negative risk-time correlation (higher-risk patients experience events earlier). Verify with:

from scipy.stats import spearmanr
r, p = spearmanr(risk_scores, event_times)
# Expect r ≈ -0.4 to -0.6

Evaluation Metrics¶

C-index (Concordance Index)¶

Measures discrimination — how well the model ranks patients by risk:

\[C = \frac{\text{concordant pairs}}{\text{comparable pairs}}\]

A pair \((i, j)\) is comparable if patient \(i\) had an event before patient \(j\) (and \(i\)'s event is observed). It is concordant if the model assigns higher risk to \(i\).

\(C = 1.0\): perfect discrimination
\(C = 0.5\): random (no discrimination)
\(C < 0.5\): worse than random

Practical note: Use a fixed-horizon C-index to avoid length bias in variable-length sequences.

Brier Score¶

Measures calibration — how close predicted probabilities are to observed outcomes:

\[\text{BS}(t) = \frac{1}{N} \sum_{i=1}^{N} \left( S_i(t) - \mathbb{1}[T_i > t] \right)^2\]

\(\text{BS} = 0\): perfect calibration
\(\text{BS} = 0.25\): uninformative model (predicts 0.5 everywhere)

Relationship Between Metrics¶

Metric	Measures	Optimized by
C-index	Discrimination (ranking)	Ranking loss
Brier Score	Calibration (probability accuracy)	NLL loss

A model can be well-calibrated but poorly discriminative, or vice versa. See 06_loss_functions.md for how to optimize for each.

Next: 02_ehr_to_tokens.md — how patient visit sequences become BEHRT-ready tensors.