Explainable AI in Ensemble ML Pancreatic Cancer Diagnosis

Plain-English Explanations

Overview

Pages 1-2

What This Study Is About and Why It Matters

Pancreatic cancer kills over 400,000 people annually worldwide, and its five-year survival rate remains below 10%. By 2030 it is projected to surpass lung cancer as a leading cause of cancer death. A major reason for these grim numbers is the lack of reliable, non-invasive screening tools. Traditional imaging methods such as CT, MRI, PET, and endoscopic ultrasonography (EUS) are expensive, have limited sensitivity, and struggle to detect small lesions because of the deep anatomical position of the pancreas.

This study, published in Scientific Reports (2024) by researchers at King Saud University and Near East University, proposes an alternative approach: using machine learning (ML) algorithms trained on urine biomarkers to classify pancreatic cancer cases. The key advantage of urine-based testing is that it is completely non-invasive, low cost, and allows for easy repeat measurements. Unlike blood-based biomarkers such as CA 19-9, urine collection requires no needles, no fasting, and no specialized equipment.

The authors trained six conventional ML models and developed a novel ensemble voting classifier, then hybridized it with each standalone model to create six additional hybrid algorithms. Crucially, they incorporated Shapley Additive Explanations (SHAP), an explainable AI (XAI) technique, to make the model outputs interpretable to clinicians. The goal is not just high accuracy but also transparency: understanding which features drive each prediction so that healthcare professionals can trust and act on the results.

TL;DR: This study builds ensemble and hybrid ML models trained on urine biomarkers to classify pancreatic cancer non-invasively, then uses SHAP to explain which clinical features drive each prediction, aiming for both high accuracy and clinical transparency.

Dataset and Preprocessing

Pages 3-4

The Urine Biomarker Dataset and How It Was Prepared

The study used a publicly available Kaggle dataset originally assembled by Debernardi et al., containing 590 urine samples with 14 attributes. Samples fell into three categories: 183 healthy controls (no pancreatic conditions), 208 benign cases (including chronic pancreatitis, gallbladder disorders, cystic lesions, and gastrointestinal symptoms), and 199 malignant cases (confirmed PDAC). All samples were collected before surgery or chemotherapy and were age- and sex-matched where feasible.

The clinical attributes included age, sex, plasma CA 19-9, creatinine, LYVE1, REG1B, TFF1, and REG1A. The output variable was reclassified into a binary target: "cancerous" versus "non-cancerous" (grouping healthy and benign cases together). Several non-predictive columns (sample ID, patient cohort, sample origin) were removed during feature selection.

For preprocessing, categorical variables were label-encoded (e.g., sex was transformed to 1 for male, 0 for female), and missing numerical values were filled using mean imputation. The dataset was split 80/20 for training and testing, yielding 118 test samples. GridSearchCV was applied for hyperparameter tuning across all models to ensure optimal performance.

TL;DR: The dataset contained 590 urine samples with biomarkers like LYVE1, REG1B, TFF1, REG1A, and plasma CA 19-9. After preprocessing (label encoding, mean imputation, feature selection), the data was split 80/20 and used to train six ML models with GridSearchCV tuning.

Machine Learning Models

Pages 4-5

The Six Standalone Classification Algorithms

The study evaluated six conventional ML classifiers, each chosen for distinct strengths. Logistic Regression (LR) was included for its simplicity, low computational cost, and reliability in binary classification. K-Nearest Neighbors (KNN) is a non-parametric method that classifies samples based on the majority class of their nearest neighbors, making it effective at detecting local data patterns. Random Forest (RF) builds multiple independent decision trees on random data subsets and uses majority voting for the final prediction, providing robust performance across diverse datasets.

Support Vector Machine (SVM) finds the optimal hyperplane separating classes by maximizing the margin between data points, making it well-suited for complex, high-dimensional datasets despite higher computational cost. Naive Bayes (NB) applies Bayes' theorem with the assumption that features are independent, computing posterior probabilities for each class and assigning the sample to the most probable category. Decision Tree (DT) makes hierarchical splits on attributes to partition data into increasingly pure subgroups, offering straightforward interpretability.

Each model was selected to bring a different algorithmic perspective: parametric versus non-parametric, linear versus nonlinear, low versus high computational cost. This diversity was intentional, as combining complementary classifiers in an ensemble can compensate for individual weaknesses and improve generalization on unseen clinical data.

TL;DR: Six ML models were used: Logistic Regression, KNN, Random Forest, SVM, Naive Bayes, and Decision Tree. Each brings different strengths (simplicity, robustness, interpretability, handling complexity), and their diversity was key to building a stronger ensemble classifier.

Standalone Model Results

Pages 5-7

How Each Individual Model Performed

Among the six standalone classifiers, Random Forest and Naive Bayes tied for the highest accuracy at 94.07%. RF achieved the best recall (96.25%), F1-score (95.65%), and AUC-ROC (99.08%), making it the strongest all-around performer. NB had the highest precision at 98.67%, meaning that when it predicted a case as cancerous, it was almost always correct, though it missed more true positives (recall of 92.50%).

Decision Tree placed third with 89.83% accuracy and balanced precision/recall of 92.50% each. Logistic Regression achieved 86.44% accuracy with a precision of 91.03%. KNN reached 83.90% accuracy but maintained strong recall at 92.50%. SVM had the lowest accuracy at 78.81% and precision at 77.78%, though its recall matched RF at 96.25%, indicating it rarely missed cancerous cases but produced many false positives.

These results established that RF and NB were the two best-performing models, making them natural candidates for the ensemble voting classifier. The performance gap between the top and bottom models (94.07% vs. 78.81% accuracy) underscored the importance of algorithm selection and demonstrated that no single model was optimal across all metrics simultaneously.

TL;DR: Random Forest and Naive Bayes both achieved 94.07% accuracy, with RF leading on AUC-ROC (99.08%) and NB on precision (98.67%). SVM had the lowest accuracy (78.81%) but high recall, showing that algorithm choice significantly affects diagnostic performance.

Ensemble Voting Classifier

Pages 6-8

Building and Evaluating the Ensemble Model

The ensemble voting classifier combined the two highest-accuracy standalone models (RF and NB) using soft voting, which averages the predicted probabilities from each model rather than simply counting majority votes. This approach leverages RF's strong recall and AUC performance alongside NB's exceptional precision, creating a classifier that balances sensitivity and specificity.

The ensemble model outperformed all standalone algorithms, achieving 96.61% accuracy, 98.72% precision, 96.25% recall, 97.47% F1-score, and 98.98% AUC-ROC. These results surpassed prior ensemble methods in the literature, including a rand index classifier with gradient descent (92.0% accuracy), a stacking ensemble (91.0% AUC), a boosting ensemble (82.6% accuracy), and an XGBoost-based model (94.0% accuracy).

The confusion matrix analysis confirmed that the ensemble model was highly specific and sensitive in distinguishing cancerous from non-cancerous cases. Its precision of 98.72% is particularly important in a clinical context: it means that nearly every positive prediction corresponds to a true cancer case, minimizing unnecessary follow-up procedures, patient anxiety, and healthcare costs associated with false positives.

TL;DR: The soft-voting ensemble of RF and NB achieved 96.61% accuracy and 98.72% precision, outperforming all standalone models and prior published ensemble approaches. Its near-perfect precision minimizes false positives, which is critical for clinical deployment.

Hybrid Models

Pages 7-9

Novel Hybrid Models: Combining the Ensemble with Each Standalone Classifier

After building the ensemble voting classifier, the authors hybridized it with each of the six standalone models, producing six novel hybrid algorithms: Voting Classifier-RF, Voting Classifier-NB, Voting Classifier-SVM, Voting Classifier-DT, Voting Classifier-KNN, and Voting Classifier-LR. This hybridization leverages the ensemble's overall strength while incorporating each individual model's unique perspective on the data.

The Voting Classifier-RF hybrid achieved the best overall performance among all models in the study, with 94.92% accuracy, 96.25% precision and recall, and an AUC of 99.05% (95% CI: 0.93-1.00). Voting Classifier-NB and Voting Classifier-SVM tied with 94.92% accuracy as well, though with different precision-recall trade-offs. Even weaker standalone models showed substantial improvement when hybridized: SVM jumped from 78.81% to 94.92% accuracy, and KNN rose from 83.90% to 94.07%.

All six hybrid models exceeded the performance of their corresponding standalone counterparts, confirming that the hybridization strategy successfully compensates for individual model weaknesses. The AUC-ROC curves for the hybrid models were consistently higher, demonstrating improved discrimination between cancerous and non-cancerous cases across all classification thresholds.

TL;DR: Hybridizing the ensemble with each standalone model boosted every classifier's performance. Voting Classifier-RF achieved the highest AUC (99.05%), and even SVM's accuracy jumped from 78.81% to 94.92% through hybridization, proving the approach compensates for individual model weaknesses.

Cross-Validation

Pages 9-10

Five-Fold Cross-Validation for Model Robustness

To verify that the results were not artifacts of a single train-test split, the authors applied 5-fold cross-validation. This technique divides the data into five equal subsets, trains on four, tests on the remaining one, and rotates through all five combinations. The 5-fold approach was chosen over 10-fold because it maintains class balance within each fold and offers better computational efficiency without sacrificing meaningful accuracy gains, especially for low-variance models like LR and KNN.

Cross-validation confirmed Random Forest as the most robust model, both standalone (mean accuracy 0.92 +/- 0.01, AUC 0.98) and in hybrid form (Voting Classifier-RF: 0.92 +/- 0.01, AUC 0.98, 95% CI: 0.87-0.99). The tight standard deviation of 0.01 indicates highly consistent performance across different data partitions. SVM remained the weakest standalone model (0.77 +/- 0.03) but showed major improvement when hybridized (0.89 +/- 0.02).

The cross-validation results were statistically significant, with the Voting Classifier-RF hybrid surpassing all other models. The consistency between the single train-test split results and the cross-validation findings strengthens confidence that the models generalize well and are not overfitting to the specific training data partition.

TL;DR: Five-fold cross-validation confirmed Random Forest as the most robust model (0.92 +/- 0.01 accuracy, AUC 0.98) both standalone and hybridized. The tight standard deviations across folds demonstrate that the models generalize reliably and are not overfitting.

SHAP Interpretability

Pages 10-11

What SHAP Reveals About the Most Important Biomarkers

SHAP (Shapley Additive Explanations) is a game-theory-based XAI method that assigns each feature an importance value for every individual prediction. It treats each clinical variable as a "player" in a cooperative game, calculating how much each player contributes to the final classification. SHAP supports both global explanations (overall model behavior) and local explanations (why the model made a specific prediction for a specific patient).

The SHAP summary plot revealed three features with the greatest positive impact on pancreatic cancer prediction: benign sample diagnosis, TFF1, and LYVE1. Higher values of these three parameters substantially increased the likelihood of a cancer diagnosis. TFF1 (Trefoil Factor 1) and LYVE1 (Lymphatic Vessel Endothelial Hyaluronan Receptor 1) are both urine-measurable biomarkers that have been independently associated with pancreatic malignancy in prior research.

For clinical practice, these SHAP insights mean that when a patient presents with elevated TFF1 and LYVE1 levels, clinicians can use this information to justify ordering confirmatory tests, recommend liquid biopsies, or proceed to advanced imaging. The transparency provided by SHAP also allows oncologists to validate AI predictions against their own clinical judgment rather than relying on opaque black-box outputs.

The authors note that applying SHAP to complex ensemble and hybrid models requires substantial computation, as it must calculate feature contributions across large configuration sets. They suggest two practical solutions: using the SHAP Kernel Explainer technique or performing feature selection before SHAP analysis to reduce dimensionality and speed up real-time deployment in time-sensitive oncologic settings.

TL;DR: SHAP analysis identified TFF1, LYVE1, and benign sample diagnosis as the top three features driving cancer predictions. This transparency allows clinicians to validate AI output against clinical judgment and use elevated biomarker levels to trigger confirmatory testing.

Limitations and Future Directions

Pages 11-14

What This Study Cannot Do Yet and Where Research Goes Next

Computational complexity: Merging models with different architectures and functionalities demands significant processing time and resources. Real-time deployment of hybrid models in clinical environments may be challenging, particularly in low-resource settings with minimal computing capacity. The SHAP analysis adds further computational overhead, especially for ensemble and hybrid algorithms dealing with high-dimensional data.

Dataset limitations: The study relied on a single publicly available dataset of 590 samples from a specific population with predetermined demographic and clinical characteristics. This raises concerns about generalizability to populations with diverse genetic backgrounds, environmental exposures, and levels of healthcare access. Missing data, which is common in clinical records, was handled with mean imputation, but more sophisticated techniques may be needed for larger, noisier real-world datasets.

Urine biomarker variability: While urine-based testing is non-invasive and cost-effective, biomarker concentrations in urine can fluctuate based on food consumption, metabolic state, and fluid intake. Standardized protocols for sample collection, processing, and measurement are essential before these models can be deployed clinically. Real-world validation studies are needed to confirm that the high accuracies observed in this controlled dataset translate to diverse clinical populations.

Future directions: The authors recommend multi-center data collection across different geographic regions and ethnic populations, integration of multi-omics data (genomics, proteomics, metabolomics), and exploration of multimodal algorithms combining urine biomarkers with imaging data such as CT, PET, and MRI. They also highlight the potential of transfer learning for adapting models trained on one cancer type to related malignancies, and the growing role of AIoT and IoMT for remote monitoring and continuous early detection.

TL;DR: Key limitations include a single-site dataset (590 samples), computational demands of hybrid models, and urine biomarker variability from diet and metabolism. Future work should focus on multi-center validation, multi-omics integration, multimodal imaging, and standardized sample collection protocols.

Incorporation of Explainable Artificial Intelligence in Ensemble Machine Learning-Driven Pancreatic Cancer Diagnosis

Original Paper (PDF)