Skip to content

Q: Do you think segment ID and the notion of segments have their potential use cases for this BEHRT based discrete time survival prediction?

This is a very good question. Not “should I use this API field,” but: does the concept of segments actually encode something meaningful for EHR survival?

Let’s slow down and separate three things:

  1. What “segment IDs” mean in vanilla BERT
  2. What BEHRT already encodes
  3. Whether an additional segment notion adds signal or just redundancy

1. What segment IDs mean in BERT

In original BERT, each token embedding is:

[ \mathbf{e}_t ============

\mathbf{e}^{token}_{c_t} + \mathbf{e}^{position}t + \mathbf{e}^{segment} ]

where:

[ s_t \in {0,1} ]

Segment IDs distinguish sentence A from sentence B in next-sentence prediction tasks.

It is a binary “this token belongs to block A vs block B.”

There is no deeper time meaning.

It’s just a learned bias vector added to all tokens in one group.


2. What BEHRT already does

Your BEHRT embedding already includes:

[ \mathbf{e}_t ============

\mathbf{e}^{code}{c_t} + \mathbf{e}^{age} + \mathbf{e}^{visit}_{v_t} + \mathbf{e}^{position}_t ]

So you already encode:

  • which visit a token belongs to
  • temporal progression via position
  • biological progression via age

So “visit structure” is already explicitly encoded.

That is important.


3. So… does segment ID add anything?

Only if it encodes something not already represented.

Segment embeddings are useful if you want to distinguish:

[ \text{Block A tokens} \neq \text{Block B tokens} ]

in a way that position and visit IDs cannot capture.

Now we examine potential use cases.


Case 1: Pre-event vs post-event segmentation

For discrete survival modeling, you only use visits up to event time.

There is no post-event region in training.

So segments for “before vs after event” are meaningless.

Verdict: no value.


Case 2: Baseline period vs follow-up period

Suppose you define:

  • Segment 0 = baseline observation window
  • Segment 1 = risk prediction window

Then segment embedding lets the model know:

“Everything before index K is context, everything after is prediction horizon.”

But in your setup, the model is predicting hazard per visit sequentially.

You’re not splitting into two logical sequences.

Verdict: maybe useful if you restructure task.


Case 3: Treatment exposure regimes

Now this gets interesting.

Suppose:

[ s_t = \begin{cases} 0 & \text{before treatment initiation} \ 1 & \text{after treatment initiation} \end{cases} ]

Now segment embedding encodes:

“biological regime shift.”

That could help the transformer separate two dynamics.

However:

You already have treatment tokens in your code vocabulary.

So treatment exposure can already be learned as part of sequence context.

Segment embedding would only provide a coarse binary marker.

Verdict: possibly useful if treatment initiation marks a structural shift.


Case 4: Multimodal fusion

If you later combine:

  • structured EHR codes
  • free text embeddings
  • imaging tokens
  • genomic tokens

Segments become powerful.

You could define:

[ s_t = \begin{cases} 0 & \text{structured codes} \ 1 & \text{clinical notes} \ 2 & \text{omics tokens} \end{cases} ]

Now segment embedding acts like a modality identifier.

That is clean and elegant.

Verdict: strong use case.


Case 5: Competing risk modeling

Suppose you model multiple hazard types:

  • death
  • ICU admission
  • readmission

You could mark segments corresponding to different event heads or prediction horizons.

But this is better handled via multi-head output rather than segment embedding.

Verdict: not ideal.


What Does It Mean Mechanistically?

Segment embedding is just:

[ \mathbf{e}^{segment}_{s_t} ]

a learned vector added to token embedding.

It shifts all tokens in that segment in embedding space.

So it creates a linear bias.

It does NOT:

  • alter attention structure directly
  • create time gating
  • impose ordering constraints

It’s simply a feature tag.


Philosophical Question: Do You Need It?

Ask yourself:

Does the sequence contain qualitatively distinct blocks that the model cannot infer from content + visit ID?

If yes → segment embeddings help.

If no → redundant complexity.


For Your Current Discrete-Time Survival Model

Right now:

  • You flatten tokens across visits
  • Visit IDs already encode grouping
  • Hazard is predicted at visit level

Segment IDs would likely be redundant.

Worse: they might encourage the model to rely on a shortcut signal rather than learning temporal dynamics.

My honest assessment:

For your current architecture, segment IDs do not meaningfully enhance discrete-time survival modeling.


When I Would Add Them

If you evolve EHR-seq into:

  1. Causal modeling where regime changes matter
  2. Pre/post intervention modeling
  3. Multimodal fusion
  4. Counterfactual generation tasks
  5. Contrastive sequence blocks

Then segments become structurally meaningful.


A Deeper Insight

In survival modeling, what truly matters is not segmentation.

It is risk set structure.

The model must implicitly learn:

[ P(T = t \mid T \ge t, \text{history}) ]

Segment embeddings do not encode risk-set information.

Masks and hazard construction do.

That’s where the mathematical structure lives.


If you want, we can now explore something much more interesting:

Instead of segment embeddings, should BEHRT use:

  • visit-level positional encoding
  • time-delta encoding
  • continuous time embeddings
  • learned risk-set embeddings

Those will affect survival modeling far more profoundly than segments ever could.

And that’s where the real modeling decisions live.