ALS Approximation vs. Exact PyTorch Optimization¶

Category: Implementation Choice
Status: Both valid approaches with different trade-offs
Date: 2026-01-25

TL;DR¶

Goal: Optimize the combined KD-inspired loss: $$\mathcal{L}_{\text{CF}} = \rho \cdot L_{\text{recon}}(X, Y) + (1-\rho) \cdot L_{\text{sup}}(X, Y, \theta)$$

Two approaches: 1. ALS with label-aware confidence (APPROXIMATION) - Fast but approximate 2. PyTorch joint gradient descent (EXACT) - Slower but exact

Neither is "wrong" - they're different algorithmic choices with trade-offs.

The Core Challenge¶

Why Can't ALS Optimize the Full Loss?¶

ALS works for quadratic objectives: $$\min_{X, Y} \sum_{u,i} c_{ui}(r_{ui} - x_u^\top y_i)^2 + \lambda(\|X\|^2 + \|Y\|^2)$$

This has closed-form solutions: - Fix Y, solve for X: $(YC_uY^\top + \lambda I)x_u = YC_u r_u$ - Fix X, solve for Y: $(XC_iX^\top + \lambda I)y_i = XC_i r_i$

The supervised loss is NOT quadratic: $$L_{\text{sup}} = \sum_{i \in \mathcal{L}} \underbrace{-y_i \log \sigma(w^\top(X^\top y_i) + b)}_{\text{non-quadratic!}} - (1-y_i) \log(1-\sigma(...))$$

Contains $\sigma(\cdot)$ (sigmoid)
Contains $\log(\cdot)$
No closed-form solution for $\nabla_X, \nabla_Y$

Conclusion: Cannot derive closed-form ALS for the full combined loss.

Approach 1: ALS with Label-Aware Confidence (Approximation)¶

Strategy¶

Key insight: Modulate confidence weights to incorporate supervision indirectly.

Instead of directly optimizing: $$\mathcal{L}_{\text{CF}} = \rho \cdot L_{\text{recon}} + (1-\rho) \cdot L_{\text{sup}}$$

Approximate by optimizing: $$\mathcal{L}_{\text{approx}} = \sum_{u,i} \tilde{c}_{ui}(r_{ui} - x_u^\top y_i)^2 + \lambda(\|X\|^2 + \|Y\|^2)$$

where $\tilde{c}_{ui}$ is label-aware:

# For labeled instances:
if y_i == 1:
    c_ui = base_confidence * (1 + α * r_ui)       # Reward high predictions
else:
    c_ui = base_confidence * (1 + α * (1 - r_ui))  # Reward low predictions

# For unlabeled instances:
c_ui = base_confidence  # Typically |r_ui - 0.5| (certainty)

How This Approximates Supervision¶

Intuition: - High $c_{ui}$ → ALS prioritizes matching $r_{ui}$ during reconstruction - Label-aware weighting: High $c_{ui}$ when prediction agrees with label - Result: ALS "wants" to preserve correct predictions, discard errors

Mathematical connection: $$\min_X \sum_{u,i} c_{ui}(r_{ui} - x_u^\top y_i)^2 \approx \min_X \left[\text{recon} + \alpha \cdot \text{supervision signal}\right]$$

The label-aware weighting creates an implicit supervision gradient.

Implementation¶

from cfensemble.data import EnsembleData
from cfensemble.optimization import CFEnsembleTrainer

# Create data
data = EnsembleData(R, labels)

# Train with label-aware confidence
trainer = CFEnsembleTrainer(
    n_classifiers=m,
    latent_dim=20,
    rho=0.5,
    use_label_aware_confidence=True,   # Enable approximation
    label_aware_alpha=1.0               # Supervision strength
)
trainer.fit(data)

Trade-offs¶

Advantages: - ✅ Fast: O(d³) closed-form ALS updates - ✅ No autodiff: Works without PyTorch/JAX - ✅ Interpretable: Confidence weights show which predictions matter - ✅ Scalable: Parallelizable across classifiers/instances

Disadvantages: - ❌ Approximate: Not exact gradient of combined loss - ❌ Indirect supervision: α parameter needs tuning - ❌ Potential instability: Alternating updates can oscillate - ❌ Limited flexibility: Hard to extend (e.g., attention aggregators)

Approach 2: PyTorch Joint Gradient Descent (Exact)¶

Strategy¶

Directly optimize the combined loss via backpropagation:

# Forward pass
R_hat = X.T @ Y
y_pred = aggregator(R_hat)

# Combined loss (exact)
loss = rho * reconstruction_loss(R, R_hat, C) + (1 - rho) * supervised_loss(y, y_pred)

# Backward pass (unified gradients)
loss.backward()  # Computes ∇_X L_CF, ∇_Y L_CF, ∇_θ L_CF

# Update (all parameters together)
optimizer.step()  # X, Y, θ updated consistently

Why This is Exact¶

Key property: All gradients computed w.r.t. same unified loss

\[\begin{align} \nabla_X \mathcal{L}_{\text{CF}} &= \rho \cdot \nabla_X L_{\text{recon}} + (1-\rho) \cdot \nabla_X L_{\text{sup}} \\ \nabla_Y \mathcal{L}_{\text{CF}} &= \rho \cdot \nabla_Y L_{\text{recon}} + (1-\rho) \cdot \nabla_Y L_{\text{sup}} \\ \nabla_\theta \mathcal{L}_{\text{CF}} &= (1-\rho) \cdot \nabla_\theta L_{\text{sup}} \end{align}\]

Each gradient considers both reconstruction AND supervision
Updates move in direction that decreases total loss
Guaranteed descent (with appropriate learning rate)

Implementation¶

from cfensemble.optimization import CFEnsemblePyTorchTrainer

# Train with exact optimization
trainer = CFEnsemblePyTorchTrainer(
    n_classifiers=m,
    latent_dim=20,
    rho=0.5,
    max_epochs=200,
    optimizer='adam',
    lr=0.01
)
trainer.fit(data)

Trade-offs¶

Advantages: - ✅ Exact: True gradient of combined loss - ✅ Unified: All parameters updated consistently - ✅ Flexible: Easy to extend (deep aggregators, attention, etc.) - ✅ Modern: Standard approach in deep learning (VAE, multi-task, etc.) - ✅ GPU acceleration: Can scale to large problems

Disadvantages: - ❌ Slower per iteration: Gradient computation vs. closed-form - ❌ Requires autodiff: PyTorch/JAX dependency - ❌ More hyperparameters: Learning rates, optimizers, schedules

The Knowledge Distillation Analogy¶

How is KD Actually Optimized?¶

Loss: $$\mathcal{L}_{\text{KD}} = \rho \cdot \underbrace{KL(q_{\text{teacher}} \| q_{\text{student}})}_{\text{soft targets}} + (1-\rho) \cdot \underbrace{CE(y, q_{\text{student}})}_{\text{hard labels}}$$

Optimization:

# Nobody uses closed-form for KD!
loss = rho * soft_loss + (1 - rho) * hard_loss
loss.backward()
optimizer.step()  # Standard gradient descent

Why? Because: - KL divergence is not quadratic - Cross-entropy is not quadratic - No closed-form solution exists

CF-Ensemble is the same: Combined loss requires gradient descent, not closed-form.

The VAE Analogy¶

VAE loss: $$\mathcal{L}_{\text{VAE}} = \underbrace{\mathbb{E}[\log p(x|z)]}_{\text{reconstruction}} + \underbrace{KL(q(z|x) \| p(z))}_{\text{prior regularization}}$$

Optimization: - Reconstruction: $\log p(x|z)$ (not quadratic due to likelihood) - KL term: Closed-form for Gaussian, but combined loss still needs gradients - Solution: Reparameterization trick + backprop (not closed-form)

Key insight: Even with some closed-form components, the combined objective usually requires gradient descent.

When to Use Which Approach¶

Use ALS (Approximation) When:¶

✅ Speed is critical - Need fast iterations (O(d³) vs. O(d²n) gradients) - Large-scale problems where closed-form helps

✅ No autodiff framework - Production environment without PyTorch/JAX - Minimal dependencies required

✅ Interpretability matters - Want to analyze confidence weights - Need to explain which predictions were prioritized

✅ Good enough approximation - Label-aware confidence captures supervision signal adequately - Results competitive with exact optimization

Example use cases: - Real-time systems needing <100ms inference - Embedded systems with limited compute - Exploratory analysis where speed matters

Use PyTorch (Exact) When:¶

✅ Accuracy is critical - Need best possible performance - Research/publication requiring exact optimization

✅ Complex models - Advanced aggregators (attention, transformers) - Deep architectures beyond simple weighted average

✅ GPU available - Can leverage hardware acceleration - Large batch training

✅ Standard ML pipeline - Already using PyTorch for other models - Want consistency with modern ML practices

Example use cases: - Production ML systems with GPU infrastructure - Research exploring new aggregation architectures - Scenarios where PyTorch already in stack

Empirical Comparison¶

Expected Performance (Hypothesis)¶

Based on approximation theory:

Metric	ALS + Label-Aware	PyTorch Joint
PR-AUC	0.30-0.40	0.35-0.45
Convergence	50-200 iter	30-100 epochs
Speed (CPU)	1-2 sec	5-10 sec
Speed (GPU)	N/A	1-2 sec
Memory	Low	Medium

Prediction: PyTorch should be 5-15% better PR-AUC, ALS should be 2-5x faster on CPU.

Benchmark Plan¶

Run pytorch_vs_als_benchmark.py with both methods:

python examples/benchmarks/pytorch_vs_als_benchmark.py

This will compare: 1. CF-ALS with label-aware confidence (α=1.0) 2. CF-PyTorch with exact optimization 3. Baselines (simple average, stacking)

Across 3 imbalance levels (10%, 5%, 1% positive).

Implementation Recommendations¶

For Research/Development (Use Both)¶

Phase 1 - Fast iteration with ALS:

# Quick experiments
trainer_als = CFEnsembleTrainer(
    latent_dim=20,
    use_label_aware_confidence=True,
    max_iter=100
)
trainer_als.fit(data)  # Fast prototyping

Phase 2 - Optimize with PyTorch:

# Best performance
trainer_pt = CFEnsemblePyTorchTrainer(
    latent_dim=20,
    max_epochs=200,
    optimizer='adam'
)
trainer_pt.fit(data)  # Production quality

For Production (Choose One)¶

CPU-constrained → ALS:

# Optimized for speed
trainer = CFEnsembleTrainer(
    latent_dim=20,
    use_label_aware_confidence=True,
    label_aware_alpha=1.0,
    lambda_reg=0.01
)

GPU-available → PyTorch:

# Optimized for accuracy
trainer = CFEnsemblePyTorchTrainer(
    latent_dim=20,
    device='cuda',
    max_epochs=200,
    optimizer='adamw'
)

Conclusion¶

Key Takeaways:

ALS cannot exactly optimize the combined loss (supervision term is non-quadratic)
Label-aware confidence is a smart approximation (incorporates supervision via weighting)
PyTorch is the exact solution (like KD, VAE, all modern ML)
Both approaches are valid (trade speed vs. accuracy)

Recommendation: - Research: Use PyTorch (exact, flexible, extensible) - Production: Evaluate both, choose based on constraints - Default: Start with PyTorch unless strong reason to use ALS

The failure wasn't using ALS—it was using ALS without label-aware confidence. With label-aware confidence, ALS becomes a reasonable fast approximation!

References¶

Matrix Factorization: Koren et al. (2009) - ALS for recommender systems
Knowledge Distillation: Hinton et al. (2015) - Combined soft/hard targets
VAE: Kingma & Welling (2014) - Reparameterization trick
Multi-task Learning: Chen et al. (2018) - Joint gradient descent