Failure Mode: ALS Without Label-Aware Confidence¶
Category: Implementation Error
Severity: High (Wrong objective being optimized)
Date Identified: 2026-01-25
Status: FIXED - Use label-aware confidence or PyTorch
TL;DR¶
Problem: Original ALS implementation optimized reconstruction only, ignoring the supervised loss term.
Root Cause: ALS was minimizing \(L_{\text{recon}}\) when it should approximate \(\mathcal{L}_{\text{CF}} = \rho \cdot L_{\text{recon}} + (1-\rho) \cdot L_{\text{sup}}\)
Solutions: 1. ALS + Label-Aware Confidence (approximation, fast) 2. PyTorch Joint Gradient Descent (exact, recommended)
The Real Problem: Wrong Objective¶
Original Implementation (WRONG)¶
for iteration in range(max_iter):
# Step 1: Update X given Y (ALS - ONLY reconstruction)
X = argmin_X ||C ⊙ (R - X^T Y)||² + λ||X||²
# Step 2: Update Y given X (ALS - ONLY reconstruction)
Y = argmin_Y ||C ⊙ (R - X^T Y)||² + λ||Y||²
# Step 3: Update θ given X, Y (GD - ONLY supervision)
θ = θ - lr * ∇_θ CE(y, g_θ(X^T Y))
The fundamental error: ALS optimizes \(L_{\text{recon}}\) only, completely ignoring \(L_{\text{sup}}\)!
This is NOT what the KD-inspired combined loss intended: $\(\mathcal{L}_{\text{CF}} = \rho \cdot L_{\text{recon}} + (1-\rho) \cdot L_{\text{sup}}\)$
Why This Fails¶
In Knowledge Distillation: - You don't optimize soft targets and hard labels separately - You compute \(\mathcal{L}_{\text{KD}} = \rho \cdot L(\text{soft}) + (1-\rho) \cdot L(\text{hard})\) - Then backprop through the combined loss
Original CF-Ensemble ALS was doing the equivalent of:
# WRONG KD approach (nobody does this!)
for epoch in range(epochs):
loss_soft = compute_soft_loss()
loss_soft.backward()
optimizer_soft.step() # Only optimize for soft targets
loss_hard = compute_hard_loss()
loss_hard.backward()
optimizer_hard.step() # Separately optimize for hard labels
Of course this fails! The gradients point in different directions!
Iteration N:
ALS: "Let me change X, Y to minimize reconstruction error"
→ Reconstruction loss: 2.5 → 1.8 ✓
→ Supervised loss: 0.35 → 0.52 ✗ (got worse!)
Iteration N+1:
GD: "Let me change θ to fix supervised loss"
→ Supervised loss: 0.52 → 0.38 ✓
→ But now X, Y are still optimized for old θ
Iteration N+2:
ALS: "Reconstruction is bad again, let me fix it"
→ Changes X, Y again
→ Undoes what GD tried to do
→ Cycle repeats...
Evidence from Experiments¶
Diagnostic Results¶
Running on simple, balanced data (positive_rate=0.50, quality=0.85):
Config: latent_dim=20, lambda_reg=0.01, rho=0.5
Iter 0: Recon Loss=30.45, Sup Loss=0.502, PR-AUC=0.500
Iter 20: Recon Loss=5.23, Sup Loss=0.501, PR-AUC=0.502
Iter 40: Recon Loss=2.15, Sup Loss=0.509, PR-AUC=0.501
Iter 60: Recon Loss=1.68, Sup Loss=0.507, PR-AUC=0.503
Iter 100: Recon Loss=1.55, Sup Loss=0.509, PR-AUC=0.504
❌ Supervised loss FLAT (oscillates around 0.50)
❌ PR-AUC FLAT (stuck at ~0.50, basically random)
❌ Never converges
Compare to baselines: - Simple Average: PR-AUC = 0.942 - Stacking: PR-AUC = 0.951 - CF-Ensemble: PR-AUC = 0.504 (random!)
Isolated Component Tests¶
Test 1: ALS alone (ρ=1.0)
Result: ✅ Converges! RMSE = 0.012 (excellent reconstruction)Test 2: Aggregator alone
# Fixed X, Y, train aggregator only
for iter in range(20):
aggregator.update(X_fixed, Y_fixed, labeled_idx, labels, lr=0.1)
Test 3: Together (ρ=0.5)
Result: ❌ Fails! PR-AUC stuck at ~0.50, no convergenceConclusion: Each component works, but together they interfere!
Why Alternating Optimization Fails Here¶
Classic Conditions for Convergence¶
Alternating optimization (e.g., EM, ALS) converges when:
- Each step decreases the SAME objective
-
❌ We have: ALS minimizes reconstruction, GD minimizes supervision
-
Objective is jointly convex (or has nice structure)
-
❌ Our combined loss is non-convex in (X, Y, θ)
-
Steps don't undo each other
- ❌ ALS changes invalidate aggregator weights
- ❌ Aggregator needs different X, Y than reconstruction wants
The Fundamental Conflict¶
Reconstruction wants: - X, Y to faithfully reproduce R - Even if R contains systematic errors - Smooth, low-rank approximation
Supervision wants: - X, Y (and θ) to predict labels correctly - Amplify signal, suppress noise - May require higher rank or different structure
With alternating updates: - Reconstruction pulls X, Y one direction - Supervision pulls θ (which depends on X, Y) another direction - They never reach equilibrium
Why Can't ALS Optimize the Combined Loss Directly?¶
The challenge: The combined loss is: $\(\mathcal{L}_{\text{CF}} = \rho \cdot \underbrace{\sum c_{ui}(r_{ui} - x_u^\top y_i)^2}_{\text{quadratic}} + (1-\rho) \cdot \underbrace{\sum CE(y_i, g_\theta(X^\top Y))}_{\text{NOT quadratic}}\)$
ALS requires quadratic objectives to get closed-form solutions.
The supervised term contains: - Sigmoid: \(\sigma(w^\top(X^\top y_i) + b)\) - Logarithm: \(\log(\sigma(...))\)
These are non-quadratic, so no closed-form ALS solution exists!
Analogy: VAE also has reconstruction + KL terms, but uses gradient descent, not closed-form, because the combined loss is non-quadratic.
Solutions¶
Two valid approaches with different trade-offs:
Solution 1: ALS + Label-Aware Confidence (Fast Approximation)¶
Strategy: Make ALS approximate the combined loss by modulating confidence weights.
Instead of directly optimizing the combined loss, optimize: $\(\min_{X,Y} \sum_{u,i} \tilde{c}_{ui}(r_{ui} - x_u^\top y_i)^2 + \lambda(\|X\|^2 + \|Y\|^2)\)$
where \(\tilde{c}_{ui}\) is label-aware: - For \(y_i = 1\): \(\tilde{c}_{ui} = c_{ui}(1 + \alpha \cdot r_{ui})\) (reward high predictions) - For \(y_i = 0\): \(\tilde{c}_{ui} = c_{ui}(1 + \alpha \cdot (1-r_{ui}))\) (reward low predictions)
How this works: - High \(\tilde{c}_{ui}\) → ALS prioritizes reconstructing this \(r_{ui}\) accurately - Label-aware weighting: High \(\tilde{c}_{ui}\) when prediction agrees with label - Result: ALS indirectly learns to preserve correct predictions
Implementation:
from cfensemble.optimization import CFEnsembleTrainer
trainer = CFEnsembleTrainer(
n_classifiers=m,
latent_dim=20,
rho=0.5,
use_label_aware_confidence=True, # Enable approximation
label_aware_alpha=1.0 # Supervision strength
)
trainer.fit(data)
Advantages: - ✅ Fast (O(d³) closed-form ALS) - ✅ No PyTorch dependency - ✅ Reasonable approximation of supervision
Disadvantages: - ❌ Approximate (not exact combined gradient) - ❌ Requires tuning α parameter - ❌ Less flexible than gradient descent
Solution 2: Joint Gradient Descent via PyTorch (Exact, Recommended)¶
Use PyTorch/JAX for automatic differentiation:
import torch
class CFEnsembleNet(torch.nn.Module):
def __init__(self, m, n, d):
super().__init__()
self.X = torch.nn.Parameter(torch.randn(d, m) * 0.01)
self.Y = torch.nn.Parameter(torch.randn(d, n) * 0.01)
self.w = torch.nn.Parameter(torch.ones(m) / m)
self.b = torch.nn.Parameter(torch.zeros(1))
def forward(self, indices):
# Reconstruct probabilities
R_hat = self.X.T @ self.Y # (m × n)
R_hat_subset = R_hat[:, indices]
# Aggregate
logits = self.w @ R_hat_subset + self.b
return torch.sigmoid(logits)
def combined_loss(self, R, C, labels, labeled_idx, rho, lambda_reg):
# Reconstruction loss
R_hat = self.X.T @ self.Y
recon_loss = torch.sum(C * (R - R_hat)**2)
reg_loss = lambda_reg * (torch.sum(self.X**2) + torch.sum(self.Y**2))
# Supervised loss
y_pred = self.forward(labeled_idx)
y_true = labels[labeled_idx]
sup_loss = torch.nn.functional.binary_cross_entropy(y_pred, y_true)
# Combined
return rho * (recon_loss + reg_loss) + (1 - rho) * sup_loss
# Training loop
model = CFEnsembleNet(m, n, d)
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
for epoch in range(max_epochs):
optimizer.zero_grad()
loss = model.combined_loss(R, C, labels, labeled_idx, rho=0.5, lambda_reg=0.01)
loss.backward() # ✅ Gradients for ALL parameters
optimizer.step() # ✅ Update ALL parameters together
Advantages: - ✅ Single unified objective - ✅ All parameters updated consistently - ✅ Guaranteed descent (with proper learning rate) - ✅ Can use modern optimizers (Adam, AdamW, etc.) - ✅ Automatic differentiation - no manual gradient derivation
Disadvantages: - Slower per iteration than closed-form ALS - Need to tune learning rates - Requires PyTorch/JAX dependency
Solution 2: Two-Stage Training¶
Decouple the conflicting objectives:
# Stage 1: Pure reconstruction (ρ=1.0)
trainer_recon = CFEnsembleTrainer(
rho=1.0, # Pure reconstruction
max_iter=100
)
trainer_recon.fit(ensemble_data)
# Extract learned factors
X_final = trainer_recon.X
Y_final = trainer_recon.Y
# Stage 2: Train aggregator only (fix X, Y)
aggregator = WeightedAggregator(m)
for epoch in range(max_epochs):
aggregator.update(
X_final, Y_final,
labeled_idx, labels,
lr=0.1
)
Advantages: - ✅ Simple to implement (minimal changes) - ✅ Each stage converges reliably - ✅ Fast (uses closed-form ALS) - ✅ Interpretable (clear separation of concerns)
Disadvantages: - X, Y don't benefit from supervision - May be suboptimal vs. joint optimization - Essentially "stacking on reconstructed features"
Solution 3: Damped Alternating Updates¶
Slow down aggregator to reduce oscillations:
for iteration in range(max_iter):
# ALS updates (fast)
X = update_X(Y, R, C, lambda_reg)
Y = update_Y(X, R, C, lambda_reg)
# Aggregator update (SLOW)
if iteration % 10 == 0: # Update less frequently
aggregator.update(X, Y, labeled_idx, labels, lr=0.001) # Small LR
Advantages: - Minimal code changes - Lets reconstruction stabilize first
Disadvantages: - Still fundamentally unstable - Very slow convergence for aggregator - Requires careful tuning of update frequency and LR
Solution 4: Weighted ALS (Modify ALS to See Supervision)¶
Make ALS aware of supervised loss:
# Standard ALS update for X (ignores labels):
X = (Y C Y^T + λI)^{-1} Y C R
# Modified ALS update (incorporates labels):
# Add penalty for X, Y that produce bad predictions
# This is complex - need to linearize supervised loss
Advantages: - Keeps closed-form updates - Unifies objectives
Disadvantages: - Complex to derive - No longer closed-form (need approximation) - Loses ALS speed advantage
Recommendation¶
For production: Use Solution #1 (PyTorch) because: 1. Most flexible and extensible 2. Proven optimization (Adam, learning rate schedules) 3. Can add advanced features (attention, deep aggregator, etc.) 4. Standard in modern ML
For quick validation: Use Solution #2 (Two-Stage) because: 1. Fast to implement and test 2. Provides baseline performance 3. If it beats baselines, validates the approach 4. Can then invest in Solution #1 for better results
Mathematical Analysis¶
Why Joint Optimization Helps¶
Combined loss: $\(\mathcal{L}(X, Y, \theta) = \rho \cdot L_{\text{recon}}(X, Y) + (1-\rho) \cdot L_{\text{sup}}(X, Y, \theta)\)$
Joint gradient descent: $\(\begin{align} X &\leftarrow X - \eta_X \cdot \nabla_X \mathcal{L} \\ Y &\leftarrow Y - \eta_Y \cdot \nabla_Y \mathcal{L} \\ \theta &\leftarrow \theta - \eta_\theta \cdot \nabla_\theta \mathcal{L} \end{align}\)$
Key property: All gradients computed w.r.t. the SAME loss - \(\nabla_X \mathcal{L}\) considers both reconstruction AND supervision - \(\nabla_Y \mathcal{L}\) considers both reconstruction AND supervision - \(\nabla_\theta \mathcal{L}\) only affects supervision (but consistent with X, Y gradients)
Result: Monotonic decrease in loss (with appropriate learning rates)
Comparison: Alternating vs. Joint¶
| Property | Alternating ALS+GD | Joint GD |
|---|---|---|
| Objective per step | Different | Same |
| Convergence guarantee | ❌ No | ✅ Yes (with LR schedule) |
| Speed per iteration | Fast (closed form) | Slower (gradient computation) |
| Total iterations to converge | ∞ (doesn't converge) | ~100-500 |
| Optimization quality | Poor | Good |
| Ease of extension | Hard | Easy (autodiff) |
Implementation Priorities¶
Phase 1: Validate Approach (Week 1)¶
- ✅ Implement two-stage training
- ✅ Test on benchmark data
- ✅ Verify it beats baselines
- Document results
Phase 2: Production Solution (Week 2-3)¶
- Implement PyTorch-based joint optimization
- Add learning rate scheduling (ReduceLROnPlateau)
- Add early stopping (based on validation loss)
- Comprehensive benchmarking vs. two-stage
Phase 3: Advanced Features (Week 4+)¶
- Try different optimizers (Adam vs. AdamW vs. SGD)
- Implement advanced aggregators (attention-based)
- Add batch training for large datasets
- GPU acceleration
Lessons Learned¶
1. Alternating Optimization is Fragile¶
- Works great for single-objective problems (e.g., NMF, ALS for recommendations)
- Fails when objectives conflict
- Always check: are all steps optimizing the SAME thing?
2. Closed-Form ≠ Better¶
- ALS is fast per iteration
- But if it doesn't converge, speed is useless
- Gradient descent slower per iteration, but converges
3. Modern Tools Help¶
- PyTorch/JAX handle complex gradients automatically
- Don't need to derive update equations manually
- Can focus on model design, not optimization
4. Validate Components Separately¶
- Test ALS alone → works!
- Test aggregator alone → works!
- Test together → fails!
- This isolation was KEY to finding the bug
Related Failure Modes¶
See also:
- transductive_vs_inductive.md - Using wrong train/test split
- hyperparameter_sensitivity.md - Tuning λ, ρ, d
- confidence_weights.md - Label-aware weighting
References¶
- Alternating Optimization:
- Boyd, S., et al. (2011). "Distributed Optimization and Statistical Learning via ADMM." Foundations and Trends in ML.
-
Bezdek, J., Hathaway, R. (2003). "Convergence of Alternating Optimization." Neural, Parallel & Scientific Comp.
-
Matrix Factorization Optimization:
- Zhou, Y., et al. (2008). "Large-scale Parallel Collaborative Filtering." KDD.
-
Gemulla, R., et al. (2011). "Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent." KDD.
-
Joint Training with Multiple Objectives:
- Kendall, A., et al. (2018). "Multi-Task Learning Using Uncertainty to Weigh Losses." CVPR.
- Chen, Z., et al. (2018). "GradNorm: Gradient Normalization for Adaptive Loss Balancing." ICML.
Key Takeaway: When combining multiple objectives, use joint optimization with unified gradients. Alternating updates only work when all steps minimize the same objective!