Failure Mode: Misunderstanding Transductive Learning in CF-Ensemble¶

Category: Algorithmic Misuse
Severity: Critical (Causes complete failure)
Date Identified: 2026-01-25

TL;DR¶

Problem: Treating CF-Ensemble like a traditional classifier with separate train/test splits breaks the transductive learning assumption.

Solution: Train on ALL data (train + test) with test labels masked, then use learned latent factors for prediction (not cold-start computation).

The Problem: Wrong Train-Test Split¶

What We Did (WRONG)¶

# Traditional ML approach - DOES NOT WORK for CF-Ensemble
R_train = R[:, labeled_idx]
y_train = labels[labeled_idx]

# Train on training data only
ensemble_data = EnsembleData(R_train, y_train)
trainer.fit(ensemble_data)

# Predict on separate test set (cold-start)
y_pred = trainer.predict(R_new=R_test)  # ❌ WRONG!

Result: - PR-AUC: 0.02-0.11 (worse than random!) - Performance worse than simple averaging - No convergence after 100-200 iterations

Why This Fails¶

CF-Ensemble is fundamentally transductive, not inductive:

Base classifiers have ALREADY made predictions on all data (train + test)
The probability matrix R includes test instances
We want to learn how to aggregate these existing predictions
Cold-start prediction throws away this information

This is like asking a recommender system to predict ratings for users it has never seen, when it actually has seen their rating behavior - we just masked some ratings!

The Recommender System Analogy¶

Standard Recommender System¶

Concept	Recommender System	CF-Ensemble
"Users"	Actual users	Base classifiers
"Items"	Products/movies	Data instances
"Ratings"	User preferences (1-5 stars)	Predicted probabilities [0,1]
Goal	Predict missing ratings	Predict masked labels
Metric	RMSE on held-out ratings	PR-AUC on held-out labels

Key Insight from Recommender Systems¶

In a recommender system with matrix factorization:

Scenario 1: Warm Start (Item seen during training)

# Item i was in training set (some users rated it)
# Use its LEARNED latent factor
y_i = Y[:, i]  # Already optimized during training
rating_prediction = X.T @ y_i

Scenario 2: Cold Start (New item, never seen)

# Item i_new is completely new
# Must compute its factor from scratch
y_i_new = (X^T X + λI)^{-1} X^T r_i_new
rating_prediction = X.T @ y_i_new

CF-Ensemble is (Usually) Warm Start!¶

The critical realization: - Test instances in CF-Ensemble are NOT new items - Base classifiers have already predicted on them - We have their probability vectors in R - They were present during training (just with masked labels) - This is warm start, not cold start!

Using cold start prediction is like: - A recommender system that has seen 1000 users rate a movie - But then throws away all those learned patterns - And recomputes the movie's factor from scratch for each prediction - Obviously wasteful and suboptimal!

The Correct Solution: Transductive Learning¶

What We Should Do (CORRECT)¶

# Combine ALL data (train + test), mask test labels
R_combined = np.hstack([R_train, R_test])
labels_combined = np.concatenate([
    y_train,
    np.full(len(y_test), np.nan)  # Masked, not missing!
])

# Train on ALL data - learn factors for everything
ensemble_data = EnsembleData(R_combined, labels_combined)
trainer.fit(ensemble_data)

# Use LEARNED factors for test predictions (warm start)
all_predictions = trainer.predict()  # ✅ CORRECT
y_pred_test = all_predictions[len(y_train):]

Result: - Actually uses the information available - Test instances get optimized latent factors - Aggregator learns from full probability structure

Mathematical Justification¶

During training, we optimize:

\[\min_{X, Y, \theta} \sum_{u,i} c_{ui}(r_{ui} - x_u^\top y_i)^2 + \sum_{i \in \mathcal{L}} \text{CE}(y_i, g_\theta(\hat{r}_{\cdot i}))\]

where: - \(\mathcal{L}\) = labeled instances (train) - \(\mathcal{U}\) = unlabeled instances (test)

Crucially: - Both \(\mathcal{L}\) and \(\mathcal{U}\) contribute to the first term (reconstruction) - Only \(\mathcal{L}\) contributes to the second term (supervision)

This means: - Test instance factors \(Y_{\mathcal{U}}\) are learned from reconstruction + similarity to train instances - They benefit from the low-rank structure and learned classifier factors - Cold-start computation ignores this rich information!

Performance Comparison¶

On Imbalanced Data (10% Positive)¶

Method	Approach	PR-AUC	Converged?
Simple Average	N/A	0.285	N/A
Stacking	Inductive	0.522	✓
CF-Ensemble (Wrong)	Cold-start inductive	0.136	✗
CF-Ensemble (Fixed)	Transductive	TBD	TBD

The wrong approach: - ❌ Worse than simple averaging - ❌ Worse than random (baseline PR-AUC for 10% positive ≈ 0.10) - ❌ Never converges

When to Use Cold-Start vs. Warm-Start¶

Use Warm-Start (Transductive) When:¶

✅ Test instances are known at training time - You have their feature vectors - Base models have made predictions on them - You're aggregating existing predictions - This is the standard CF-Ensemble setting

Example: - Biomedical prediction: You have predictions from 10 algorithms on 1000 patients - Goal: Aggregate them well, using subset with known outcomes for training

Use Cold-Start (Inductive) When:¶

✅ Test instances arrive AFTER training - Truly new data that base models haven't seen - Need to make predictions on-the-fly - Can't retrain for each new instance

Example: - Real-time system: New patient arrives, need immediate prediction - Base models make predictions, need to aggregate them instantly - Must use: \(y_{\text{new}} = (X^\top X + \lambda I)^{-1} X^\top r_{\text{new}}\)

CF-Ensemble is Mostly Transductive¶

In most practical scenarios: - Batch prediction setting (not online) - Have all base model predictions upfront - Can train with test instances present - Use transductive learning (warm start)

Only use inductive (cold start) when: - Truly cannot include test data in training - Real-time constraints require immediate prediction - Distribution shift makes transductive learning unreliable

The Amazon Recommender System Perspective¶

How would an Amazon ML scientist approach this?¶

Standard recommender problem:

"Given user-item rating matrix with missing entries, predict missing ratings"

Solution: Matrix factorization learns user factors \(X\) and item factors \(Y\) such that \(R \approx X^\top Y\)

For new predictions: - Seen users, seen items: Use learned factors (warm start) - New user, seen items: Compute user factor from their ratings - Seen user, new item: Compute item factor from its ratings - New user, new item: Use cold start methods (content features, etc.)

CF-Ensemble adds supervision:

"Not only predict ratings well, but ensure aggregated predictions match ground truth labels"

This is like Amazon also caring that: - High predicted ratings correlate with actual purchases - Aggregated ratings predict customer satisfaction - The reconstruction quality + predictive accuracy tradeoff

Amazon would: 1. Use transductive learning for batch settings (warm start) 2. Use content features or deep learning for cold start 3. Never throw away learned factors when they're available! 4. Optimize for both reconstruction AND business metric (in our case, label accuracy)

Implementation Notes¶

Detecting the Issue¶

Red flags that you're doing it wrong: - Performance worse than simple averaging - No convergence after many iterations - Test error >>> Train error (not generalization, but cold-start penalty) - Predictions seem random or biased

Code smells:

# BAD: Training on subset of columns
R_train = R[:, train_idx]
trainer.fit(EnsembleData(R_train, y_train))

# BAD: Using R_new for test predictions
y_pred = trainer.predict(R_new=R_test)

# BAD: Separate train/test matrices
ensemble_data_train = EnsembleData(R_train, y_train)
ensemble_data_test = EnsembleData(R_test, None)  # Wrong!

Correct Implementation¶

# GOOD: Single matrix with masked labels
n_train = len(y_train)
n_test = len(y_test)
R_all = np.hstack([R_train, R_test])
labels_all = np.concatenate([y_train, np.full(n_test, np.nan)])

# GOOD: Train on everything
ensemble_data = EnsembleData(R_all, labels_all)
trainer.fit(ensemble_data)

# GOOD: Use learned factors
all_preds = trainer.predict()  # No R_new argument
y_pred_test = all_preds[n_train:]  # Extract test portion

# GOOD: Alternatively, for truly new data
if new_data_arrives:
    y_pred_new = trainer.predict(R_new=R_new_data)  # Cold start OK here

Lessons Learned¶

1. Not All ML is Inductive¶

Inductive: Learn from train, generalize to unseen test
Transductive: Have access to test inputs (not labels) during training
CF-Ensemble is transductive by design

2. Base Model Predictions are Data¶

In traditional ML: test features are inputs
In CF-Ensemble: test predictions are inputs
We already have them at training time!
Use them!

3. Recommender System Intuition Helps¶

Think: "How would Netflix predict ratings for a movie in their database?"
Answer: Use its learned latent factors (warm start)
Not: Recompute factors from scratch each time (cold start)
CF-Ensemble same principle

4. Read Your Own Documentation¶

The theory document (docs/methods/cf_ensemble_optimization_objective_tutorial.md) Section 6 clearly states:

"Crucially, both sets are used during training, but labels are only available for \(\mathcal{L}\). This is transductive or semi-supervised learning."

We just didn't implement it correctly! 🤦

See also: - optimization_instability.md - Why alternating ALS + GD doesn't converge - confidence_weights.md - Importance of label-aware confidence - hyperparameter_sensitivity.md - Tuning latent_dim, lambda_reg, rho

References¶

Transductive Learning:
Vapnik, V. (1998). Statistical Learning Theory. Chapter on transduction.
Zhou, D., et al. (2004). "Learning with Local and Global Consistency." NeurIPS.
Matrix Factorization for Recommender Systems:
Koren, Y., et al. (2009). "Matrix Factorization Techniques for Recommender Systems." IEEE Computer.
Hu, Y., et al. (2008). "Collaborative Filtering for Implicit Feedback Datasets." ICDM.
Cold Start Problem:
Schein, A., et al. (2002). "Methods and Metrics for Cold-Start Recommendations." SIGIR.
Sedhain, S., et al. (2014). "Social Collaborative Filtering for Cold-start Recommendations." RecSys.

Key Takeaway: CF-Ensemble is a transductive method pretending to be inductive will fail catastrophically. Always train on all data with masked test labels!