CF-Ensemble Failure Modes¶

This directory documents common failure modes, pitfalls, and how to avoid them when implementing and using CF-Ensemble.

Purpose¶

CF-Ensemble is a sophisticated method combining collaborative filtering and ensemble learning. While powerful, it has several subtle failure modes that can cause: - Complete performance breakdown (worse than simple averaging) - Non-convergence of optimization - Misleading results

These documents help you: 1. Recognize when something is wrong 2. Diagnose the root cause 3. Fix the issue with proven solutions

Failure Modes¶

1. Transductive vs. Inductive Learning ¶

Problem: Using traditional train/test split breaks CF-Ensemble
Symptom: Performance worse than simple averaging, worse than random
Cause: Treating test instances as "new" when they should be "seen"
Fix: Train on ALL data with masked test labels (transductive learning)

Critical: This is the #1 most common mistake. If your CF-Ensemble performs terribly, check this first!

Key insight from recommender systems: - Test instances are like "movies in your database with some ratings hidden" - NOT like "movies you've never heard of" - Use warm-start (learned factors), not cold-start (recompute factors)

2. Optimization Instability ¶

Problem: Alternating ALS + gradient descent doesn't converge
Symptom: Flat supervised loss, no improvement over iterations
Cause: ALS and gradient descent optimize different objectives that conflict
Fix: Use joint gradient descent via PyTorch/JAX

When this happens: - Reconstruction loss decreases - Supervised loss stays flat (~0.5) - PR-AUC stuck near random - Never converges even after 200+ iterations

Solutions: 1. Recommended: Joint PyTorch optimization (CFEnsemblePyTorchTrainer) 2. Quick fix: Two-stage training (pure reconstruction → train aggregator) 3. Workaround: Damped alternating updates (slow aggregator learning)

Diagnostic Checklist¶

If CF-Ensemble isn't working, check these in order:

1. Data Split ✓¶

Are you training on ALL data (train + test)?
Are test labels masked with np.nan?
Are you using transductive prediction (predict() not predict(R_new=...))?

If NO to any: See transductive_vs_inductive.md

2. Convergence ✓¶

Does training converge within 100-200 iterations?
Is supervised loss decreasing?
Is performance improving over iterations?

If NO to any: See optimization_instability.md

3. Performance ✓¶

Is CF-Ensemble better than simple averaging?
Is it competitive with stacking?
Does it improve with more labeled data?

If NO: Check: - Hyperparameters (latent_dim, lambda_reg, rho) - Confidence weights (label-aware vs. certainty-based) - Data quality (are base model predictions reasonable?)

4. Hyperparameters ✓¶

Is latent_dim appropriate for your data? (10-50, or ~√m)
Is lambda_reg not too strong? (try 0.001-0.1)
Is rho in a reasonable range? (0.3-0.7)

Quick Reference: Symptoms → Fixes¶

Symptom	Likely Cause	Fix
PR-AUC < Simple Average	Wrong train/test split	Use transductive learning
PR-AUC ≈ Random	Wrong train/test split OR no convergence	Check data split AND convergence
Never converges	Optimization instability	Use PyTorch trainer
Supervised loss flat	Optimization instability	Use PyTorch trainer
Works on easy data, fails on hard	Hyperparameters	Tune `latent_dim`, `lambda_reg`
Predictions all similar	Over-regularization	Decrease `lambda_reg`
Overfitting to train	Under-regularization	Increase `lambda_reg`

Best Practices¶

1. Always Use Transductive Learning (Unless You Can't)¶

# ✓ CORRECT: Transductive
R_all = np.hstack([R_train, R_test])
labels_all = np.concatenate([y_train, np.full(len(y_test), np.nan)])
trainer.fit(EnsembleData(R_all, labels_all))
y_pred = trainer.predict()[len(y_train):]  # Use learned factors

# ✗ WRONG: Inductive (unless truly necessary)
trainer.fit(EnsembleData(R_train, y_train))
y_pred = trainer.predict(R_new=R_test)  # Cold-start, loses information

2. Use PyTorch for Production¶

# Recommended for production
from cfensemble.optimization import CFEnsemblePyTorchTrainer

trainer = CFEnsemblePyTorchTrainer(
    n_classifiers=m,
    latent_dim=20,
    rho=0.5,
    max_epochs=200,
    optimizer='adam',
    patience=20
)

Why: - Guaranteed convergence - Modern optimizers (Adam, learning rate scheduling) - GPU acceleration - Easier to extend

3. Start Simple, Then Improve¶

Stage 1: Validate the approach

# Two-stage training (simple, fast)
trainer_recon = CFEnsembleTrainer(rho=1.0)  # Pure reconstruction
trainer_recon.fit(data)
# Then train aggregator separately

Stage 2: Optimize performance

# Joint PyTorch optimization (better results)
trainer = CFEnsemblePyTorchTrainer(...)
trainer.fit(data)

4. Always Check Baselines First¶

Before trusting CF-Ensemble results:

# Simple average
y_pred_simple = np.mean(R_test, axis=0)

# Stacking
from sklearn.linear_model import LogisticRegression
stacker = LogisticRegression().fit(R_train.T, y_train)
y_pred_stack = stacker.predict_proba(R_test.T)[:, 1]

# CF-Ensemble should beat simple average
# And be competitive with stacking

Theory - Mathematical foundations
Examples - Code examples and benchmarks

Contributing¶

Found a new failure mode? Please document it:

Describe the problem - What goes wrong?
Show symptoms - How do you recognize it?
Explain the cause - Why does it happen?
Provide solution - How to fix it?
Add examples - Code snippets showing wrong vs. right

Follow the format in existing documents. PRs welcome!

Lessons Learned¶

From Amazon Recommender Systems¶

Warm start vs. cold start: - Movies in database (some ratings hidden) → warm start - Brand new movies → cold start - CF-Ensemble is (usually) warm start!

From Machine Learning¶

Not all ML is inductive: - Inductive: Learn from train, apply to unseen test - Transductive: Have test inputs (not labels) during training - CF-Ensemble is transductive by design

From Optimization Theory¶

Alternating optimization is fragile: - Works when all steps optimize the SAME objective - Fails when objectives conflict - Joint optimization with unified gradients is more robust

Remember: Most CF-Ensemble failures are NOT bugs, but misunderstandings of the method's assumptions. Understanding these failure modes will save you hours of debugging!