CF-Ensemble Quick Reference¶
Essential equations and concepts at a glance
Core Idea¶
Combine matrix reconstruction with supervised learning, inspired by knowledge distillation:
Pure reconstruction reproduces errors. Adding supervision teaches what "signal" means.
The Complete Objective¶
Term 1: Reconstruction Loss¶
Purpose: Faithfully reproduce probability matrix (all points)
Interpretation: "Learn latent factors that reconstruct base model predictions"
Term 2: Supervised Loss¶
where: $\(\text{CE}(y, \hat{p}) = -y \log \hat{p} - (1-y) \log(1-\hat{p})\)$
Purpose: Predict correct labels (labeled points only)
Interpretation: "Ensure aggregated predictions match ground truth"
Key Objects and Notation¶
| Symbol | Meaning | Typical Size |
|---|---|---|
| \(R \in [0,1]^{m \times n}\) | Probability matrix | 10 × 1000 |
| \(r_{ui}\) | Classifier \(u\)'s prob for point \(i\) | [0, 1] |
| \(X \in \mathbb{R}^{d \times m}\) | Classifier latent factors | 20 × 10 |
| \(Y \in \mathbb{R}^{d \times n}\) | Instance latent factors | 20 × 1000 |
| \(x_u \in \mathbb{R}^d\) | Latent vector for classifier \(u\) | dim 20 |
| \(y_i \in \mathbb{R}^d\) | Latent vector for point \(i\) | dim 20 |
| \(\hat{r}_{ui} = x_u^\top y_i\) | Reconstructed probability | [0, 1] |
| \(C \in \mathbb{R}_+^{m \times n}\) | Confidence/reliability weights | 10 × 1000 |
| \(c_{ui}\) | Trust in \(r_{ui}\) | ≥ 0 |
| \(\mathcal{L} \subset \{1,\ldots,n\}\) | Labeled point indices | e.g., |
| \(\mathcal{U} \subset \{1,\ldots,n\}\) | Unlabeled point indices | e.g., |
| \(y_i \in \{0,1\}\) | Ground truth label | 0 or 1 |
| \(\hat{p}_i \in [0,1]\) | Final predicted probability | [0, 1] |
| \(g_\theta: \mathbb{R}^m \to [0,1]\) | Aggregation function | e.g., mean |
| \(\rho \in [0,1]\) | Trade-off parameter | 0.3 - 0.7 |
| \(\lambda > 0\) | Regularization strength | 0.01 - 0.1 |
| \(d \in \mathbb{N}\) | Latent dimension | 10 - 50 |
Aggregation Function¶
Maps reconstructed probabilities to final prediction:
where \(\hat{r}_{\cdot i} = [\hat{r}_{1i}, \ldots, \hat{r}_{mi}]^\top\)
Common Choices¶
Simple mean: $\(g(\hat{r}_{\cdot i}) = \frac{1}{m}\sum_{u=1}^m \hat{r}_{ui}\)$
Weighted mean: $\(g_w(\hat{r}_{\cdot i}) = \sigma(w^\top \hat{r}_{\cdot i} + b)\)$ where \(\sigma(z) = 1/(1+e^{-z})\)
Confidence Weights¶
Purpose¶
Weight which probability predictions to trust more during factorization.
Strategies¶
1. Certainty-based (default): $\(c_{ui} = |r_{ui} - 0.5|\)$
2. Label-aware (for labeled data): $\(c_{ui} = \begin{cases} r_{ui} & \text{if } y_i = 1 \\ 1 - r_{ui} & \text{if } y_i = 0 \end{cases}\)$
3. Calibration-based: $\(c_{ui} = 1 - \text{Brier}_u = 1 - \frac{1}{N}\sum_j (r_{uj} - y_j)^2\)$
4. Agreement-based: $\(c_{ui} = 1 - \text{Var}_u(r_{\cdot i})\)$
Optimization: Alternating Least Squares¶
Update Classifier Factors (fix \(Y\))¶
For each classifier \(u\): $\(x_u = \left( Y C_u Y^\top + \lambda I \right)^{-1} Y C_u r_u\)$
where: - \(C_u = \text{diag}(c_{u1}, \ldots, c_{un})\) - \(r_u = [r_{u1}, \ldots, r_{un}]^\top\)
Update Instance Factors (fix \(X\))¶
For each point \(i\): $\(y_i = \left( X C_i X^\top + \lambda I \right)^{-1} X C_i r_i\)$
where: - \(C_i = \text{diag}(c_{1i}, \ldots, c_{mi})\) - \(r_i = [r_{1i}, \ldots, r_{mi}]^\top\)
Update Aggregator (fix \(X, Y\))¶
Gradient descent on supervised loss: $\(\theta \leftarrow \theta - \eta \nabla_\theta L_{\text{sup}}(X, Y, \theta)\)$
Algorithm Summary¶
Input: R (m × n), labels (n,), hyperparameters
Output: X (d × m), Y (d × n), θ
1. Initialize:
- X ← random(d, m) × 0.01
- Y ← random(d, n) × 0.01
- θ ← initialize_aggregator()
2. For epoch = 1 to max_iter:
a. X ← ALS_update_classifiers(Y, R, C, λ)
b. Y ← ALS_update_instances(X, R, C, λ)
c. θ ← gradient_step(θ, X, Y, labels, η)
d. loss ← compute_combined_loss(ρ)
e. If converged: break
3. Return X, Y, θ
Prediction¶
On Training/Test Data (transductive)¶
Given fitted \(X, Y, \theta\): $\(\hat{p}_i = g_\theta(X^\top y_i)\)$
On New Data (inductive)¶
Given new point \(i_{\text{new}}\) with predictions \(r_{\cdot, i_{\text{new}}}\):
-
Solve for latent factor: $\(y_{i_{\text{new}}} = (X C X^\top + \lambda I)^{-1} X C r_{i_{\text{new}}}\)$
-
Predict: $\(\hat{p}_{i_{\text{new}}} = g_\theta(X^\top y_{i_{\text{new}}})\)$
Hyperparameter Tuning¶
Recommended Ranges¶
| Parameter | Range | Default | Tuning Strategy |
|---|---|---|---|
| \(\rho\) | [0.3, 0.7] | 0.5 | Grid search |
| \(d\) | [10, 50] | 20 | Rule: \(d \approx \sqrt{m}\) |
| \(\lambda\) | [0.01, 0.1] | 0.01 | Grid search |
| Aggregator | {mean, weighted} | weighted | Compare both |
Validation Strategy¶
for rho in [0.3, 0.5, 0.7]:
for d in [10, 20, 30]:
for lambda in [0.01, 0.05, 0.1]:
# Train on L_train ∪ U (transductive)
# Validate on L_val
# Select best by validation AUC
Connection to Knowledge Distillation¶
| Aspect | Knowledge Distillation | CF-Ensemble |
|---|---|---|
| Soft targets | Teacher predictions \(q_t\) | Probability matrix \(R\) |
| Student | Small neural network | Latent factors \(X, Y\) |
| Soft loss | \(\text{KL}(q_t \| q_s)\) | \(\sum c_{ui}(r_{ui} - x_u^\top y_i)^2\) |
| Hard loss | \(\text{CE}(y_g, q_s)\) | \(\sum_{i \in \mathcal{L}} \text{CE}(y_i, g(\hat{r}_{\cdot i}))\) |
| Combined | \(\rho \cdot T^2 \cdot \text{KL} + (1-\rho) \cdot \text{CE}\) | \(\rho \cdot L_{\text{recon}} + (1-\rho) \cdot L_{\text{sup}}\) |
| Key insight | Match teacher + match labels | Reconstruct matrix + match labels |
Diagnostic Metrics¶
Reconstruction Quality¶
Monitor separately on \(\mathcal{L}\) and \(\mathcal{U}\). Large gap suggests overfitting.
Prediction Performance¶
- ROC-AUC: Primary metric
- Brier score: Calibration quality
- F1 score: Balanced accuracy
Comparison Baseline¶
Always compare against: 1. Simple averaging: \(\frac{1}{m}\sum_u r_{ui}\) 2. Best single model: \(\max_u \text{AUC}_u\) 3. Stacking: Logistic regression on \(R\)
When to Use CF-Ensemble¶
✅ Good fit when:¶
- Base models have complementary errors
- Moderate labeled data (100-10,000 samples)
- Test set is known at training time (transductive)
- Need interpretable latent structure
❌ Poor fit when:¶
- Base models are nearly perfect
- Very large labeled dataset (stacking better)
- Test distribution very different from train
- Need strict inductive guarantees
Common Pitfalls¶
❌ Setting \(\rho = 1\)¶
Pure reconstruction reproduces errors. Always use \(\rho \in [0.3, 0.7]\).
❌ Ignoring confidence weights¶
All predictions treated equally. Use label-aware confidence for \(\mathcal{L}\).
❌ Too large \(d\)¶
Overfitting. Start with \(d \approx \sqrt{m}\), tune down if needed.
❌ Forgetting regularization¶
Latent factors explode. Always use \(\lambda > 0\).
❌ Not checking convergence¶
Stopped too early or too late. Monitor loss curves.
Checklist for Success¶
- Diverse base models (different algorithms/hyperparameters)
- Reasonable \(\rho\) (not 0 or 1)
- Label-aware confidence for labeled data
- Validation set for hyperparameter tuning
- Convergence monitoring (loss curves)
- Baseline comparisons (averaging, stacking)
- Latent space visualization (sanity check)
- Test on ≥3 different datasets
One-Liner Summary¶
CF-Ensemble = Collaborative Filtering + Supervised Learning
We factorize the ensemble's probability matrix to discover latent structure, while using ground truth labels to distinguish signal from noise.
Further Reading¶
- Knowledge Distillation Tutorial
- CF-Ensemble Optimization Tutorial
- Class-Weighted Gradients
- Failure Modes
Print this page for quick reference during implementation!