ALS with Label-Aware Confidence: Mathematical Derivation¶
How ALS approximates the combined reconstruction + supervision objective
Table of Contents¶
- Introduction
- The Combined Objective
- Why ALS Cannot Optimize It Directly
- Label-Aware Confidence Approximation
- Deriving the ALS Updates
- Complete Algorithm
- Theoretical Analysis
- Implementation
Introduction¶
Goal: Optimize the combined KD-inspired objective: $\(\mathcal{L}_{\text{CF}} = \rho \cdot L_{\text{recon}}(X, Y) + (1-\rho) \cdot L_{\text{sup}}(X, Y, \theta)\)$
This balances: - Reconstruction: Faithfully representing base classifier predictions - Supervision: Ensuring aggregated predictions match ground truth labels
Challenge: The supervised term is non-quadratic (contains sigmoid + log), preventing closed-form ALS solutions.
Solution: Use label-aware confidence weighting to approximate supervision within ALS.
This document derives the mathematical foundation for this approximation.
The Combined Objective¶
Full Loss Function¶
Following the knowledge distillation analogy:
where: - \(R \in [0,1]^{m \times n}\): Probability matrix from base classifiers - \(X \in \mathbb{R}^{d \times m}\): Classifier latent factors - \(Y \in \mathbb{R}^{d \times n}\): Instance latent factors - \(C \in \mathbb{R}_+^{m \times n}\): Confidence weights - \(\mathcal{L} \subset \{1,\ldots,n\}\): Labeled instances - \(g_\theta\): Aggregation function (e.g., \(\sigma(w^\top(\cdot) + b)\)) - \(CE\): Binary cross-entropy
What Each Term Does¶
Reconstruction term (\(L_{\text{recon}}\)): - Quadratic in \(X\) and \(Y\) - Has closed-form ALS solution - Encourages \(X^\top Y \approx R\) (low-rank approximation)
Supervised term (\(L_{\text{sup}}\)): $\(CE(y_i, g_\theta(X^\top y_i)) = -y_i \log \sigma(w^\top(X^\top y_i) + b) - (1-y_i) \log(1-\sigma(...))\)$ - Non-quadratic (sigmoid + log) - No closed-form solution - Encourages predictions to match labels
Why ALS Cannot Optimize It Directly¶
ALS Requires Quadratic Objectives¶
Standard ALS works when the objective is quadratic in each variable block:
This gives closed-form solution: $\(x_u = (YC_uY^\top + \lambda I)^{-1} YC_u r_u\)$
The Supervised Term is Non-Quadratic¶
Taking gradient of \(L_{\text{sup}}\) w.r.t. \(X\):
The problem: - Contains \(\sigma(\cdot)\) (non-linear) - Depends on \(\theta\) (aggregator parameters) - No closed-form inverse
Conclusion: Cannot derive ALS update equations for the full combined loss.
Label-Aware Confidence Approximation¶
Key Insight¶
Instead of directly minimizing: $\(\mathcal{L}_{\text{CF}} = \rho \cdot L_{\text{recon}} + (1-\rho) \cdot L_{\text{sup}}\)$
Approximate by minimizing: $\(\mathcal{L}_{\text{approx}}(X, Y) = \sum_{u,i} \tilde{c}_{ui}(r_{ui} - x_u^\top y_i)^2 + \lambda(\|X\|_F^2 + \|Y\|_F^2)\)$
where \(\tilde{c}_{ui}\) is label-aware: encodes supervision signal through confidence modulation.
Label-Aware Confidence Definition¶
where: - \(c_{ui}^{\text{base}}\): Base confidence (typically \(|r_{ui} - 0.5|\) for certainty) - \(\alpha \geq 0\): Supervision strength parameter
How This Incorporates Supervision¶
For positive instances (\(y_i = 1\)): - High \(r_{ui}\) (correct prediction) → High \(\tilde{c}_{ui}\) → ALS prioritizes preserving this - Low \(r_{ui}\) (incorrect prediction) → Lower \(\tilde{c}_{ui}\) → ALS deprioritizes this
For negative instances (\(y_i = 0\)): - Low \(r_{ui}\) (correct prediction) → High \(\tilde{c}_{ui}\) → ALS prioritizes preserving this - High \(r_{ui}\) (incorrect prediction) → Lower \(\tilde{c}_{ui}\) → ALS deprioritizes this
Result: ALS learns to reconstruct correct predictions well while downweighting errors.
Mathematical Justification¶
Expand the approximate objective:
where \(s_{ui}\) is the supervision signal: $\(s_{ui} = \begin{cases} r_{ui} & \text{if } y_i = 1 \\ 1 - r_{ui} & \text{if } y_i = 0 \\ 0 & \text{if unlabeled} \end{cases}\)$
This is equivalent to: $\(\mathcal{L}_{\text{approx}} \approx \underbrace{\sum c_{ui}^{\text{base}}(r_{ui} - x_u^\top y_i)^2}_{\text{reconstruction}} + \alpha \cdot \underbrace{\sum c_{ui}^{\text{base}} s_{ui}(r_{ui} - x_u^\top y_i)^2}_{\text{supervision-weighted reconstruction}}\)$
The second term implicitly encourages \(X^\top y_i\) to match predictions that agree with labels!
Deriving the ALS Updates¶
Classifier Factor Update¶
Objective (fix \(Y\), optimize \(X\)): $\(\min_X \sum_{u=1}^{m} \sum_{i=1}^{n} \tilde{c}_{ui}(r_{ui} - x_u^\top y_i)^2 + \lambda \|X\|_F^2\)$
This decomposes into \(m\) independent problems (one per classifier): $\(\min_{x_u} \sum_{i=1}^{n} \tilde{c}_{ui}(r_{ui} - x_u^\top y_i)^2 + \lambda \|x_u\|^2\)$
Gradient: $\(\nabla_{x_u} = -2\sum_{i=1}^{n} \tilde{c}_{ui}(r_{ui} - x_u^\top y_i)y_i + 2\lambda x_u\)$
Set to zero: $\(\sum_{i=1}^{n} \tilde{c}_{ui} y_i y_i^\top x_u + \lambda x_u = \sum_{i=1}^{n} \tilde{c}_{ui} r_{ui} y_i\)$
Rearrange: $\((Y\tilde{C}_u Y^\top + \lambda I)x_u = Y\tilde{C}_u r_u\)$
where: - \(\tilde{C}_u = \text{diag}(\tilde{c}_{u1}, \ldots, \tilde{c}_{un})\): Label-aware confidence for classifier \(u\) - \(r_u = [r_{u1}, \ldots, r_{un}]^\top\): Predictions from classifier \(u\)
Closed-form solution: $\(\boxed{x_u = (Y\tilde{C}_u Y^\top + \lambda I)^{-1} Y\tilde{C}_u r_u}\)$
Instance Factor Update¶
Objective (fix \(X\), optimize \(Y\)): $\(\min_Y \sum_{u=1}^{m} \sum_{i=1}^{n} \tilde{c}_{ui}(r_{ui} - x_u^\top y_i)^2 + \lambda \|Y\|_F^2\)$
This decomposes into \(n\) independent problems (one per instance): $\(\min_{y_i} \sum_{u=1}^{m} \tilde{c}_{ui}(r_{ui} - x_u^\top y_i)^2 + \lambda \|y_i\|^2\)$
Gradient: $\(\nabla_{y_i} = -2\sum_{u=1}^{m} \tilde{c}_{ui}(r_{ui} - x_u^\top y_i)x_u + 2\lambda y_i\)$
Set to zero: $\(\sum_{u=1}^{m} \tilde{c}_{ui} x_u x_u^\top y_i + \lambda y_i = \sum_{u=1}^{m} \tilde{c}_{ui} r_{ui} x_u\)$
Rearrange: $\((X\tilde{C}_i X^\top + \lambda I)y_i = X\tilde{C}_i r_i\)$
where: - \(\tilde{C}_i = \text{diag}(\tilde{c}_{1i}, \ldots, \tilde{c}_{mi})\): Label-aware confidence for instance \(i\) - \(r_i = [r_{1i}, \ldots, r_{mi}]^\top\): Predictions for instance \(i\)
Closed-form solution: $\(\boxed{y_i = (X\tilde{C}_i X^\top + \lambda I)^{-1} X\tilde{C}_i r_i}\)$
Key Observation¶
The update equations have the same form as standard ALS, but use label-aware confidence \(\tilde{C}\) instead of base confidence \(C\).
This is why the approximation is elegant: we don't modify the ALS algorithm structure, just the confidence weights!
Complete Algorithm¶
Algorithm: ALS with Label-Aware Confidence¶
Input: - \(R \in [0,1]^{m \times n}\): Probability matrix - \(y \in \{0,1\}^n\): Labels (with NaN for unlabeled) - \(\mathcal{L}\): Set of labeled indices - \(\rho \in [0,1]\): Reconstruction vs. supervision trade-off - \(\alpha \geq 0\): Label-aware supervision strength - \(\lambda > 0\): Regularization strength - \(d\): Latent dimensionality
Output: - \(X \in \mathbb{R}^{d \times m}\): Classifier factors - \(Y \in \mathbb{R}^{d \times n}\): Instance factors
Steps:
-
Initialize:
-
Compute base confidence:
-
Compute label-aware confidence:
-
Iterate until convergence:
for iteration in 1 to max_iter: # Update classifier factors for u in 1 to m: A ← Y @ diag(C_label_aware[u,:]) @ Y.T + λ*I b ← Y @ (C_label_aware[u,:] * R[u,:]) X[:,u] ← solve(A, b) # Update instance factors for i in 1 to n: A ← X @ diag(C_label_aware[:,i]) @ X.T + λ*I b ← X @ (C_label_aware[:,i] * R[:,i]) Y[:,i] ← solve(A, b) # Check convergence (optional: recompute loss) if ||ΔX|| < tol and ||ΔY|| < tol: break -
Return \(X, Y\)
Aggregator Training (Separate Step)¶
After ALS converges, train the aggregator \(\theta\):
R_hat ← X.T @ Y
for epoch in 1 to max_epochs:
y_pred ← aggregator(R_hat[:,labeled_indices], θ)
loss ← cross_entropy(y[labeled_indices], y_pred)
θ ← θ - lr * ∇_θ loss
Theoretical Analysis¶
Convergence Guarantee¶
Theorem: Each ALS update (fixing one set of factors) decreases the approximate objective: $\(\mathcal{L}_{\text{approx}}(X^{t+1}, Y^t) \leq \mathcal{L}_{\text{approx}}(X^t, Y^t)\)$ $\(\mathcal{L}_{\text{approx}}(X^{t+1}, Y^{t+1}) \leq \mathcal{L}_{\text{approx}}(X^{t+1}, Y^t)\)$
Proof: Each update minimizes a convex quadratic objective, guaranteeing decrease.
Corollary: ALS converges to a local minimum of \(\mathcal{L}_{\text{approx}}\).
Approximation Quality¶
Question: How well does \(\mathcal{L}_{\text{approx}}\) approximate \(\mathcal{L}_{\text{CF}}\)?
Intuition: Label-aware confidence creates a first-order approximation of the supervision gradient.
Formal analysis:
Consider the supervision gradient: $\(\nabla_{X,Y} L_{\text{sup}} \propto \sum_{i \in \mathcal{L}} (g_\theta(X^\top y_i) - y_i) \cdot \nabla_{X,Y} g_\theta\)$
Label-aware confidence implicitly adds a term proportional to: $\(\alpha \sum_{i \in \mathcal{L}} s_{ui} \cdot \nabla_{X,Y} \|R - X^\top Y\|^2\)$
where \(s_{ui}\) correlates with \((g_\theta - y_i)\) when \(g_\theta\) is well-calibrated.
Result: The approximation is good when: 1. Base classifiers are well-calibrated 2. Aggregator is simple (e.g., mean or weighted mean) 3. \(\alpha\) is tuned appropriately
Comparison to PyTorch¶
ALS + Label-Aware: - Approximate combined loss - Fast (closed-form updates) - May have small optimality gap
PyTorch: - Exact combined loss - Slower (gradient computation) - Guaranteed to find better local minimum
Expected gap: 5-15% in PR-AUC (ALS slightly worse but much faster).
Implementation¶
Python Implementation¶
def update_classifier_factors_label_aware(
Y: np.ndarray,
R: np.ndarray,
C_label_aware: np.ndarray,
lambda_reg: float
) -> np.ndarray:
"""
Update classifier factors using label-aware confidence.
Parameters:
Y: Instance factors (d × n)
R: Probability matrix (m × n)
C_label_aware: Label-aware confidence (m × n)
lambda_reg: Regularization strength
Returns:
X: Updated classifier factors (d × m)
"""
d, n = Y.shape
m = R.shape[0]
X = np.zeros((d, m))
lambda_I = lambda_reg * np.eye(d)
for u in range(m):
# Label-aware confidence for this classifier
c_u = C_label_aware[u, :]
# Weighted gram matrix: Y @ diag(c_u) @ Y.T
Y_weighted = Y * c_u[None, :]
A = Y_weighted @ Y.T + lambda_I
# Weighted target: Y @ (c_u * r_u)
r_u = R[u, :]
b = Y_weighted @ r_u
# Solve: A x_u = b
X[:, u] = np.linalg.solve(A, b)
return X
def compute_label_aware_confidence(
R: np.ndarray,
labels: np.ndarray,
alpha: float = 1.0,
base_confidence: Optional[np.ndarray] = None
) -> np.ndarray:
"""
Compute label-aware confidence weights.
Parameters:
R: Probability matrix (m × n)
labels: Ground truth labels (n,) with NaN for unlabeled
alpha: Supervision strength
base_confidence: Base confidence (default: certainty |r - 0.5|)
Returns:
C_label_aware: Label-aware confidence (m × n)
"""
if base_confidence is None:
base_confidence = np.abs(R - 0.5)
C = base_confidence.copy()
labeled_mask = ~np.isnan(labels)
for i in np.where(labeled_mask)[0]:
if labels[i] == 1:
# Positive: reward high predictions
C[:, i] = base_confidence[:, i] * (1 + alpha * R[:, i])
else:
# Negative: reward low predictions
C[:, i] = base_confidence[:, i] * (1 + alpha * (1 - R[:, i]))
return C
Usage Example¶
from cfensemble.data import EnsembleData
from cfensemble.optimization import CFEnsembleTrainer
# Create data
data = EnsembleData(R, labels)
# Train with label-aware confidence
trainer = CFEnsembleTrainer(
n_classifiers=m,
latent_dim=20,
rho=0.5, # Note: rho used in loss computation, not directly in ALS
lambda_reg=0.01,
use_label_aware_confidence=True, # Enable approximation
label_aware_alpha=1.0, # Supervision strength
max_iter=100
)
trainer.fit(data)
predictions = trainer.predict()
Class-Weighted Gradients for Imbalanced Data¶
The Challenge¶
While label-aware confidence approximates supervision in the ALS updates (X, Y), the aggregator parameters (θ) are trained separately via gradient descent. On imbalanced data (e.g., 10% positive, 90% negative), standard gradient descent can cause aggregator weights to collapse to negative values, destroying predictions.
The Problem¶
Standard aggregator gradient treats all instances equally:
With imbalanced data, the majority class dominates gradient computation:
10% positive: residual ≈ -0.5 (minority)
90% negative: residual ≈ +0.5 (MAJORITY DOMINATES!)
→ Total gradient: +0.40 (positive)
→ Weights decrease continuously → collapse!
The Solution¶
Class-weighted gradients weight instances by inverse class frequency:
This ensures each class (not instance) contributes equally to gradients, preventing majority class domination.
Usage¶
trainer = CFEnsembleTrainer(
n_classifiers=m,
latent_dim=20,
use_label_aware_confidence=True, # ALS approximation
use_class_weights=True, # Aggregator class weighting (default)
max_iter=100
)
Result: Weights remain positive and stable, achieving perfect performance even with 10% positive rate.
See: docs/methods/optimization/class_weighted_gradients.md for complete derivation and experimental results.
Conclusion¶
Key Takeaways:
-
ALS cannot optimize the combined loss exactly because \(L_{\text{sup}}\) is non-quadratic
-
Label-aware confidence provides an elegant approximation:
- Modulates \(c_{ui}\) based on label agreement
- Preserves ALS structure (closed-form updates)
-
Incorporates supervision indirectly
-
Mathematical foundation is sound:
- Each ALS step provably decreases \(\mathcal{L}_{\text{approx}}\)
- Approximation quality is first-order in supervision gradient
-
Empirically works well for calibrated base models
-
Trade-off with PyTorch:
- ALS: Fast approximate optimization
- PyTorch: Exact but slower optimization
The label-aware confidence trick makes ALS viable for CF-Ensemble while maintaining computational efficiency!
References¶
- ALS for Matrix Factorization: Hu et al. (2008) "Collaborative Filtering for Implicit Feedback"
- Knowledge Distillation: Hinton et al. (2015) "Distilling Knowledge"
- Confidence Weighting: Koren et al. (2009) "Matrix Factorization Techniques"
- Multi-task Learning: Chen et al. (2018) "GradNorm: Gradient Normalization"