Skip to content

Within-Visit Structure and Multi-Level Embeddings

Date: January 20, 2026
Focus: How to structure codes within a visit when timestamps are unavailable


The Problem

Challenge: A visit contains multiple codes (ICD, LOINC, RxNorm, etc.) that may not have individual timestamps.

Visit 1 (2023-01-15):
  - E11.9 (Type 2 diabetes)
  - 4548-4 (HbA1c)
  - 38341003 (Hypertension)
  - RxNorm:860975 (Metformin)

# Question: What order should these codes be in?
# They're all "from the same visit" but conceptually different

Key Questions: 1. How do we impose structure within each visit? 2. Does order matter if codes lack individual timestamps? 3. What does the LSTM learn at each level?


Part 1: Within-Visit Structure Approaches

Idea: Group codes by their semantic type, impose a canonical ordering.

# Canonical ordering within a visit
VISIT_CODE_ORDER = [
    'diagnosis',      # ICD codes
    'procedure',      # CPT, SNOMED procedures
    'lab',           # LOINC codes
    'medication',    # RxNorm
    'vital',         # Vital signs
]

# Example visit after ordering
Visit 1:
  Diagnosis:  E11.9, 38341003
  Lab:        4548-4
  Medication: RxNorm:860975

Advantages: - ✅ Consistent structure across all visits - ✅ Reflects clinical workflow (diagnose → test → treat) - ✅ Enables model to learn semantic patterns - ✅ No arbitrary ordering

Implementation:

from typing import List, Dict
from dataclasses import dataclass

@dataclass
class CodeWithType:
    code: str
    code_type: str  # 'diagnosis', 'lab', 'medication', etc.
    value: Optional[float] = None  # For lab values

class VisitStructurer:
    """Structure codes within a visit by semantic type."""

    # Canonical ordering
    TYPE_ORDER = {
        'diagnosis': 0,
        'procedure': 1,
        'lab': 2,
        'medication': 3,
        'vital': 4,
        'other': 5
    }

    def structure_visit(
        self, 
        codes: List[CodeWithType]
    ) -> List[CodeWithType]:
        """
        Order codes within a visit by semantic type.

        Within each type, codes are sorted alphabetically for consistency.
        """
        # Sort by type, then alphabetically within type
        structured = sorted(
            codes,
            key=lambda c: (self.TYPE_ORDER.get(c.code_type, 999), c.code)
        )
        return structured

# Example usage
visit_codes = [
    CodeWithType('RxNorm:860975', 'medication'),
    CodeWithType('E11.9', 'diagnosis'),
    CodeWithType('4548-4', 'lab', value=7.2),
    CodeWithType('38341003', 'diagnosis'),
]

structurer = VisitStructurer()
ordered_codes = structurer.structure_visit(visit_codes)

# Result:
# [E11.9, 38341003, 4548-4, RxNorm:860975]
# diagnosis → diagnosis → lab → medication

Approach 2: Set-Based Representation (Order-Invariant)

Idea: Treat visit as an unordered set, use pooling to aggregate.

class SetBasedVisitEncoder(nn.Module):
    """
    Encode visit as a set of codes (order-invariant).

    Uses pooling (mean, max, or attention) to aggregate.
    """

    def __init__(self, code_embed_dim: int, visit_embed_dim: int):
        super().__init__()
        self.code_embeddings = nn.Embedding(vocab_size, code_embed_dim)
        self.projection = nn.Linear(code_embed_dim, visit_embed_dim)

    def forward(self, visit_codes, visit_mask):
        """
        Args:
            visit_codes: [batch, max_codes_per_visit]
            visit_mask: [batch, max_codes_per_visit]

        Returns:
            visit_embedding: [batch, visit_embed_dim]
        """
        # Embed codes
        code_embeds = self.code_embeddings(visit_codes)  # [batch, codes, dim]

        # Mask padding
        code_embeds = code_embeds * visit_mask.unsqueeze(-1)

        # Aggregate (mean pooling - order invariant)
        visit_repr = code_embeds.sum(dim=1) / visit_mask.sum(dim=1, keepdim=True)

        # Project to visit space
        return self.projection(visit_repr)

Advantages: - ✅ No ordering assumptions - ✅ Mathematically clean (permutation invariant) - ✅ Simple implementation

Disadvantages: - ❌ Loses potential semantic ordering information - ❌ Treats all codes equally (no clinical workflow)


Approach 3: Attention-Based Aggregation (Best of Both Worlds)

Idea: Use self-attention within each visit to learn importance weights.

class AttentionVisitEncoder(nn.Module):
    """
    Use self-attention to aggregate codes within a visit.

    Learns which codes are most important for the visit representation.
    """

    def __init__(
        self, 
        code_embed_dim: int, 
        visit_embed_dim: int,
        num_heads: int = 4
    ):
        super().__init__()
        self.code_embeddings = nn.Embedding(vocab_size, code_embed_dim)

        # Self-attention within visit
        self.self_attention = nn.MultiheadAttention(
            embed_dim=code_embed_dim,
            num_heads=num_heads,
            batch_first=True
        )

        # Project to visit embedding
        self.projection = nn.Linear(code_embed_dim, visit_embed_dim)

    def forward(self, visit_codes, visit_mask):
        """
        Args:
            visit_codes: [batch, max_codes_per_visit]
            visit_mask: [batch, max_codes_per_visit]

        Returns:
            visit_embedding: [batch, visit_embed_dim]
            attention_weights: [batch, num_heads, codes, codes]
        """
        # Embed codes
        code_embeds = self.code_embeddings(visit_codes)  # [batch, codes, dim]

        # Self-attention (codes attend to each other)
        attn_output, attn_weights = self.self_attention(
            code_embeds, code_embeds, code_embeds,
            key_padding_mask=~visit_mask.bool()
        )

        # Aggregate (mean over codes)
        visit_repr = attn_output.mean(dim=1)  # [batch, code_embed_dim]

        # Project to visit space
        visit_embedding = self.projection(visit_repr)

        return visit_embedding, attn_weights

Advantages: - ✅ Learns importance of each code - ✅ Order-invariant but captures relationships - ✅ Interpretable (attention weights show which codes matter) - ✅ Flexible (adapts to different visit types)

Disadvantages: - ❌ More complex - ❌ Requires more computation


Approach 4: Hierarchical Type Embeddings

Idea: Add type embeddings to indicate code category.

class HierarchicalVisitEncoder(nn.Module):
    """
    Add type embeddings to distinguish code categories.

    Similar to position embeddings in Transformers.
    """

    def __init__(self, code_embed_dim: int, visit_embed_dim: int):
        super().__init__()

        # Code embeddings
        self.code_embeddings = nn.Embedding(vocab_size, code_embed_dim)

        # Type embeddings (diagnosis, lab, medication, etc.)
        self.type_embeddings = nn.Embedding(
            num_embeddings=6,  # diagnosis, procedure, lab, med, vital, other
            embedding_dim=code_embed_dim
        )

        # LSTM over codes
        self.code_lstm = nn.LSTM(
            input_size=code_embed_dim,
            hidden_size=code_embed_dim,
            batch_first=True
        )

        self.projection = nn.Linear(code_embed_dim, visit_embed_dim)

    def forward(self, visit_codes, code_types, visit_mask):
        """
        Args:
            visit_codes: [batch, max_codes_per_visit]
            code_types: [batch, max_codes_per_visit] (0=diagnosis, 1=procedure, etc.)
            visit_mask: [batch, max_codes_per_visit]

        Returns:
            visit_embedding: [batch, visit_embed_dim]
        """
        # Embed codes and types
        code_embeds = self.code_embeddings(visit_codes)
        type_embeds = self.type_embeddings(code_types)

        # Combine (additive, like position embeddings)
        combined_embeds = code_embeds + type_embeds

        # LSTM over codes (now with type information)
        lstm_out, (hidden, _) = self.code_lstm(combined_embeds)

        # Use final hidden state
        visit_repr = hidden[-1]

        return self.projection(visit_repr)

Advantages: - ✅ Explicitly encodes code type - ✅ Works with sequential models (LSTM) - ✅ Learns type-specific patterns


Part 2: What Does LSTM Learn at Each Level?

Multi-Level Embedding Learning

The LSTM architecture learns three levels of embeddings:

Level 1: Code Embeddings (learned)
Level 2: Visit Embeddings (learned)
Level 3: Patient Embeddings (learned)

Let me clarify exactly what is learned at each level:


Level 1: Code Embeddings

What: Individual medical code representations (E11.9, 4548-4, etc.)

Learned by: nn.Embedding layer (or pre-trained from CEHR-BERT)

self.code_embeddings = nn.Embedding(
    num_embeddings=vocab_size,  # e.g., 10,000 codes
    embedding_dim=code_embed_dim  # e.g., 128
)

# Input: code ID (integer)
code_id = 42  # e.g., E11.9 mapped to ID 42

# Output: code embedding (vector)
code_embedding = self.code_embeddings(code_id)  # [128]

What is learned: - Semantic meaning of each code - Relationships between codes (e.g., diabetes codes cluster together) - Clinical context (e.g., HbA1c associated with diabetes)

Example:

# After training, similar codes have similar embeddings
embedding_E11_9 = code_embeddings[42]    # Type 2 diabetes
embedding_E11_0 = code_embeddings[43]    # Type 1 diabetes
cosine_similarity(embedding_E11_9, embedding_E11_0)  # High similarity

embedding_J45_9 = code_embeddings[100]   # Asthma
cosine_similarity(embedding_E11_9, embedding_J45_9)  # Low similarity

Can use pre-trained:

# Option 1: Learn from scratch
self.code_embeddings = nn.Embedding(vocab_size, code_embed_dim)

# Option 2: Use pre-trained (CEHR-BERT)
pretrained_embeds = load_cehrbert_embeddings()
self.code_embeddings = nn.Embedding.from_pretrained(pretrained_embeds, freeze=False)


Level 2: Visit Embeddings

What: Representation of an entire visit (aggregation of codes)

Learned by: LSTM or pooling over codes within a visit

# Within-visit LSTM
self.visit_lstm = nn.LSTM(
    input_size=code_embed_dim,  # 128
    hidden_size=code_embed_dim,  # 128
    num_layers=1,
    batch_first=True
)

# Input: sequence of code embeddings for one visit
visit_codes = [code_emb_1, code_emb_2, code_emb_3, ...]  # [num_codes, 128]

# Output: visit embedding
lstm_out, (hidden, _) = self.visit_lstm(visit_codes)
visit_embedding = hidden[-1]  # [128]

What is learned: - How to combine multiple codes into a visit representation - Importance of code order (if using LSTM) or relationships (if using attention) - Visit-level patterns (e.g., "diabetes visit" vs "routine checkup")

Example:

# Visit 1: Diabetes-related
visit_1_codes = [E11.9, 4548-4, RxNorm:860975]  # Diabetes, HbA1c, Metformin
visit_1_embedding = encode_visit(visit_1_codes)  # [128]

# Visit 2: Hypertension-related
visit_2_codes = [I10, 8480-6, RxNorm:197361]  # Hypertension, BP, Lisinopril
visit_2_embedding = encode_visit(visit_2_codes)  # [128]

# Different visit types have different embeddings
cosine_similarity(visit_1_embedding, visit_2_embedding)  # Moderate similarity

Projection to visit space:

# Often project to a different dimension
self.visit_projection = nn.Linear(code_embed_dim, visit_embed_dim)
visit_embedding = self.visit_projection(visit_repr)  # [128] → [256]


Level 3: Patient Embeddings

What: Representation of entire patient trajectory (sequence of visits)

Learned by: LSTM over visit sequence

# Visit sequence LSTM
self.sequence_lstm = nn.LSTM(
    input_size=visit_embed_dim + 2,  # 256 + 2 time features
    hidden_size=hidden_dim,  # 512
    num_layers=2,
    batch_first=True
)

# Input: sequence of visit embeddings
patient_visits = [visit_emb_1, visit_emb_2, ..., visit_emb_N]  # [N, 256]

# Output: patient embedding
sequence_out, (final_hidden, _) = self.sequence_lstm(patient_visits)
patient_embedding = final_hidden[-1]  # [512]

What is learned: - Temporal patterns across visits - Disease progression trajectories - Long-term dependencies (e.g., chronic conditions) - Patient-level risk factors

Example:

# Patient A: Stable diabetes
patient_A_visits = [diabetes_visit_1, diabetes_visit_2, diabetes_visit_3]
patient_A_embedding = encode_patient(patient_A_visits)  # [512]

# Patient B: Progressing CKD
patient_B_visits = [ckd_stage2_visit, ckd_stage3_visit, ckd_stage4_visit]
patient_B_embedding = encode_patient(patient_B_visits)  # [512]

# Different trajectories have different embeddings
cosine_similarity(patient_A_embedding, patient_B_embedding)  # Low similarity


Part 3: Complete Architecture with All Three Levels

class ThreeLevelLSTMEncoder(nn.Module):
    """
    Three-level LSTM encoder for EHR sequences.

    Level 1: Code embeddings (learned or pre-trained)
    Level 2: Visit embeddings (learned via LSTM/attention)
    Level 3: Patient embeddings (learned via LSTM over visits)
    """

    def __init__(
        self,
        vocab_size: int,
        code_embed_dim: int = 128,
        visit_embed_dim: int = 256,
        patient_embed_dim: int = 512,
        num_layers: int = 2,
        use_attention: bool = False,
        use_type_embeddings: bool = True
    ):
        super().__init__()

        # ====================================================================
        # LEVEL 1: Code Embeddings
        # ====================================================================
        self.code_embeddings = nn.Embedding(
            num_embeddings=vocab_size,
            embedding_dim=code_embed_dim,
            padding_idx=0
        )

        # Optional: Type embeddings (diagnosis, lab, medication, etc.)
        if use_type_embeddings:
            self.type_embeddings = nn.Embedding(
                num_embeddings=6,  # 6 code types
                embedding_dim=code_embed_dim
            )
        else:
            self.type_embeddings = None

        # ====================================================================
        # LEVEL 2: Visit Embeddings
        # ====================================================================
        if use_attention:
            # Attention-based visit encoder
            self.visit_encoder = nn.MultiheadAttention(
                embed_dim=code_embed_dim,
                num_heads=4,
                batch_first=True
            )
        else:
            # LSTM-based visit encoder
            self.visit_encoder = nn.LSTM(
                input_size=code_embed_dim,
                hidden_size=code_embed_dim,
                num_layers=1,
                batch_first=True
            )

        # Project to visit embedding space
        self.visit_projection = nn.Linear(code_embed_dim, visit_embed_dim)

        # ====================================================================
        # LEVEL 3: Patient Embeddings
        # ====================================================================
        self.patient_encoder = nn.LSTM(
            input_size=visit_embed_dim + 2,  # +2 for time features
            hidden_size=patient_embed_dim,
            num_layers=num_layers,
            batch_first=True,
            dropout=0.1 if num_layers > 1 else 0
        )

        self.use_attention = use_attention

    def encode_codes(self, codes, code_types=None):
        """
        LEVEL 1: Encode individual codes.

        Args:
            codes: [batch, num_codes]
            code_types: [batch, num_codes] (optional)

        Returns:
            code_embeddings: [batch, num_codes, code_embed_dim]
        """
        code_embeds = self.code_embeddings(codes)

        if self.type_embeddings is not None and code_types is not None:
            type_embeds = self.type_embeddings(code_types)
            code_embeds = code_embeds + type_embeds

        return code_embeds

    def encode_visit(self, visit_codes, code_types=None, visit_mask=None):
        """
        LEVEL 2: Encode a visit (aggregate codes).

        Args:
            visit_codes: [batch, max_codes_per_visit]
            code_types: [batch, max_codes_per_visit]
            visit_mask: [batch, max_codes_per_visit]

        Returns:
            visit_embedding: [batch, visit_embed_dim]
        """
        # Get code embeddings
        code_embeds = self.encode_codes(visit_codes, code_types)

        if self.use_attention:
            # Attention-based aggregation
            attn_out, _ = self.visit_encoder(
                code_embeds, code_embeds, code_embeds,
                key_padding_mask=~visit_mask.bool() if visit_mask is not None else None
            )
            visit_repr = attn_out.mean(dim=1)
        else:
            # LSTM-based aggregation
            lstm_out, (hidden, _) = self.visit_encoder(code_embeds)
            visit_repr = hidden[-1]

        # Project to visit space
        return self.visit_projection(visit_repr)

    def encode_patient(self, patient_visits, time_features, visit_mask):
        """
        LEVEL 3: Encode patient trajectory (sequence of visits).

        Args:
            patient_visits: [batch, num_visits, max_codes_per_visit]
            time_features: [batch, num_visits, 2]
            visit_mask: [batch, num_visits, max_codes_per_visit]

        Returns:
            patient_embedding: [batch, patient_embed_dim]
            visit_embeddings: [batch, num_visits, visit_embed_dim]
        """
        batch_size, num_visits, max_codes = patient_visits.shape

        # Encode each visit
        visit_embeds = []
        for i in range(num_visits):
            visit_emb = self.encode_visit(
                patient_visits[:, i, :],
                visit_mask=visit_mask[:, i, :] if visit_mask is not None else None
            )
            visit_embeds.append(visit_emb)

        visit_embeds = torch.stack(visit_embeds, dim=1)  # [batch, visits, visit_dim]

        # Add time features
        visit_embeds_with_time = torch.cat([visit_embeds, time_features], dim=-1)

        # LSTM over visits
        sequence_out, (final_hidden, _) = self.patient_encoder(visit_embeds_with_time)

        patient_embedding = final_hidden[-1]  # [batch, patient_embed_dim]

        return patient_embedding, visit_embeds

Part 4: Summary and Recommendations

Recommended primary approach:

Semantic Grouping + Type Embeddings (Approach 1 + 4)

# 1. Order codes by semantic type
visit_codes = structure_by_type(raw_codes)  # diagnosis → lab → medication

# 2. Add type embeddings
code_embeds = code_embeddings(codes) + type_embeddings(types)

# 3. LSTM over ordered codes
visit_embedding = lstm(code_embeds)

Why: - ✅ Consistent, interpretable structure - ✅ Reflects clinical workflow - ✅ Works well with LSTM - ✅ Can still use pre-trained code embeddings

Alternative: Attention-based (Approach 3) - Use when order-invariance is desired - More flexible but more complex - Better for Transformer models


What LSTM Learns: Summary Table

Level What Learned By Dimension Example
Level 1 Code embeddings nn.Embedding [vocab_size, 128] E11.9 → [0.1, -0.3, ...]
Level 2 Visit embeddings LSTM/Attention over codes [256] [diabetes visit] → [0.5, 0.2, ...]
Level 3 Patient embeddings LSTM over visits [512] [patient trajectory] → [0.3, -0.1, ...]

Key Points: 1. All three levels are learned (or Level 1 can be pre-trained) 2. Each level captures different granularity 3. Gradients flow through all levels during training 4. Can freeze Level 1 if using pre-trained embeddings


Implementation Priority

Phase 1: Simple Baseline - Semantic grouping (canonical order) - LSTM over codes (Level 2) - LSTM over visits (Level 3) - Learn all embeddings from scratch

Phase 2: Enhanced - Add type embeddings - Use pre-trained code embeddings (CEHR-BERT) - Experiment with attention

Phase 3: Advanced - Attention-based visit encoder - Hierarchical attention (code → visit → patient) - Multi-task learning


Next: Ready to implement the data pipeline with visit structuring?