Skip to content

CF-Ensemble: Meta-learning via Latent-Factor-Based Collaborative Filtering

License: MIT Python 3.10+ Tests Code style: black

A breakthrough framework for ensemble classification using collaborative filtering


🌟 Overview

Ensemble learning combines multiple base models to improve predictive performance. This project introduces a novel ensemble transformation stage using latent factor-based collaborative filtering (CF) – an additional layer of meta-learning that transforms base-level predictions before traditional ensemble integration.

💡 The Core Idea

We treat ensemble learning as a collaborative filtering problem:

┌─────────────────────────────────────────────────────────────┐
│                                                             │
│  Recommender Systems         →      Ensemble Learning       │
│  ──────────────────                  ─────────────────      │
│                                                             │
│  👥 Users                    →      🤖 Base Classifiers      │
│  🎬 Items (Movies)           →      📊 Data Points           │
│  ⭐ Ratings (1-5)            →      🎯 Predictions (0-1)     │
│                                                             │
│  Matrix Factorization        →      CF-Ensemble Transform   │
│                                                             │
└─────────────────────────────────────────────────────────────┘

🎯 Why This Matters

Classification in biomedical domains faces unique challenges: - ⚖️ Class imbalance and skewed distributions - 🔍 Missing values and noisy measurements
- 🧬 Complex biological relationships that vary by problem - 🎲 No consensus on best classifiers (problem-dependent)

Our Solution: Transform ensemble predictions using matrix factorization to: 1. ✨ Increase reliability of probability estimates 2. 🔬 Discover patterns in how classifiers perform 3. 🧭 Interpret results through latent factor analysis 4. 🎯 Identify challenging instances automatically


📊 Basic Workflow

From Base Classifiers to Final Prediction

flowchart TD
    subgraph group1["📥 Stage 1: Base Prediction & Transformation"]
        A["🤖 Base Classifiers<br/><small>Diverse heterogeneous models</small>"]
        B["📊 Prediction Matrix R<br/><small>m classifiers × n instances</small>"]
        C["✨ CF Transformation<br/><small>Matrix factorization</small>"]
    end

    subgraph group2["📤 Stage 2: Reconstruction & Integration"]
        D["🔄 Reconstructed Matrix P<br/><small>Improved probability estimates</small>"]
        E["🎯 Ensemble Integration<br/><small>Weighted aggregation</small>"]
        F["📈 Final Prediction<br/><small>Class probabilities</small>"]
    end

    A --> B
    B --> C
    C -.->|"Matrix<br/>Factorization"| D
    D --> E
    E --> F

    style A fill:#E3F2FD,stroke:#1976D2,stroke-width:3px,color:#1A237E
    style B fill:#FFF3E0,stroke:#F57C00,stroke-width:3px,color:#E65100
    style C fill:#C8E6C9,stroke:#388E3C,stroke-width:4px,color:#1B5E20
    style D fill:#FFF3E0,stroke:#F57C00,stroke-width:3px,color:#E65100
    style E fill:#B3E5FC,stroke:#0288D1,stroke-width:4px,color:#01579B
    style F fill:#F3E5F5,stroke:#7B1FA2,stroke-width:3px,color:#4A148C

    style group1 fill:#BEBEBE,stroke:#CED4DA,stroke-width:2px,color:#495057
    style group2 fill:#BEBEBE,stroke:#CED4DA,stroke-width:2px,color:#495057

Or see the original workflow diagram with the probability matrix view:

CF-Ensemble Workflow

The process consists of three stages:

  1. 🏗️ Ensemble Generation: Train diverse base classifiers
  2. 🔄 Ensemble Transformation (⭐ Our Innovation): Apply CF to transform predictions
  3. 🎯 Ensemble Integration: Combine transformed predictions

🚀 Quick Start

Installation

# Clone repository
git clone https://github.com/pleiadian53/cf-ensemble.git
cd cf-ensemble

# Create environment
mamba env create -f environment.yml
mamba activate cfensemble

# Install package
pip install -e .

Basic Usage

from cfensemble.data import EnsembleData
from cfensemble.optimization import CFEnsembleTrainer

# Your ensemble predictions (m classifiers × n instances)
R = ...  # probability matrix
labels = ...  # ground truth with NaN for unlabeled

# Train CF-Ensemble
ensemble_data = EnsembleData(R, labels)
trainer = CFEnsembleTrainer(latent_dim=10, rho=0.5)
trainer.fit(ensemble_data)

# Get improved predictions
P = trainer.predict(R)  # Reconstructed probabilities

🎯 Features

✅ Semi-Supervised Learning

  • Leverages unlabeled data to learn classifier reliabilities
  • No labels needed for calibration
  • Optimal at 5-10% minority class (validated!)

✅ Confidence Weighting

  • Multiple strategies (uniform, certainty, label-aware, learned)
  • Handles systematic biases and miscalibration
  • Interpretable confidence weights

✅ Optimized for Imbalanced Data

  • Best performance at 5% minority class (+3.94% PR-AUC gain)
  • PR-AUC as primary metric
  • Realistic biomedical scenarios (rare diseases, splice sites)

✅ Dual Optimization Backends

  • ALS (Alternating Least Squares): CPU-friendly, stable
  • PyTorch: GPU acceleration for large-scale problems

✅ Comprehensive Documentation

  • Random baseline calculations
  • Clinical significance thresholds
  • State-of-the-art methods comparison (2026)
  • Complete mathematical derivations

📊 Validated Results (2026-01-24)

The 5% Sweet Spot Discovery 🏆

Imbalance Peak Improvement Status
10% positives +1.06% ✅ Recommended
5% positives +3.94% 🏆 ✅✅✅ OPTIMAL
1% positives +0.10% ❌ Skip

Key Finding: 5% minority class shows BEST gains (non-monotonic relationship!)

See: Complete Results


📖 Documentation

Essential Reading

Deep Dives


💡 Examples

See Examples for complete runnable examples:

Confidence Weighting

  • quality_threshold_experiment.py - Validate when confidence weighting helps
  • phase3_confidence_weighting.py - Compare all strategies
  • reliability_model_demo.py - Learned reliability weights

Optimization

  • compare_als_pytorch.py - Compare ALS vs PyTorch gradient descent

🤝 Contributing

Contributions welcome! Please open an issue or pull request on GitHub.


📄 License

MIT License - see LICENSE file for details.


📚 Citation

If you use this code in your research, please cite:

@software{cfensemble2026,
  title={CF-Ensemble: Semi-supervised Ensemble Learning with Confidence Weighting},
  author={CF-Ensemble Research Team},
  year={2026},
  url={https://github.com/pleiadian53/cf-ensemble}
}

Documentation site: https://pleiadian53.github.io/cf-ensemble/