Pancreatic cancer (PC) remains one of the deadliest malignancies worldwide, with a 5-year survival rate persistently below 10%. It ranks as the 12th most common cancer globally but is the third leading cause of cancer-related mortality in high-income countries such as Australia and the United States. In Saudi Arabia, the incidence has climbed dramatically, from just 131 new cases in 2005 to 579 in 2022, and the disease carries the lowest five-year survival rate among all cancers in the Kingdom. The dismal prognosis is driven primarily by late-stage diagnosis, aggressive tumour biology, and limited treatment efficacy. Surgical resection combined with adjuvant chemotherapy remains the gold standard, but eligibility is restricted to patients with localised disease and adequate performance status.
Diagnostic bottlenecks: No safe and effective population-level screening method currently exists for detecting PC at asymptomatic or early stages. Conventional imaging modalities, including endoscopic ultrasonography (EUS), CT, MRI, and PET, are hindered by high costs, limited sensitivity, and anatomical constraints associated with the pancreas's retroperitoneal location. Serum carbohydrate antigen 19-9 (CA19-9) shows promise as a biomarker for tailoring treatment intensity, with markedly elevated levels (>500 U/mL) in anatomically resectable PC potentially justifying intensive neoadjuvant chemotherapy. However, the absence of standardised thresholds and inconsistent predictive performance limit its clinical utility.
The machine learning opportunity: Machine learning (ML) can integrate heterogeneous clinical, imaging, and molecular variables to detect patterns invisible to human observation. By incorporating demographics, patient history, laboratory markers, imaging features, and pathology findings, ML-based systems can support earlier diagnosis, improve risk stratification, and assist in treatment planning. However, real-world adoption remains limited by small and heterogeneous datasets, lack of external validation, and the "black box" nature of many high-performing models. Even minor errors in oncology diagnostics can have serious consequences, making model reliability, reproducibility, and interpretability essential for safe clinical deployment.
This review by Alharbi and Alfayez (2025) is among the first technically rigorous reviews dedicated exclusively to the role of explainable artificial intelligence (XAI) in ML-based prediction of PC. It is the first article in a two-part series, with a companion paper addressing feature engineering and clinical integration strategies.
The authors conducted a focused literature search using PubMed as the primary database and Google Scholar for supplementary retrieval. Subscription-based databases such as Scopus and Web of Science were not included due to institutional access constraints. The search targeted peer-reviewed studies published between 2020 and 2025 that were relevant to ML and XAI in pancreatic cancer prediction.
Selection process: From an initial pool of 177 records, the authors removed duplicates, restricted results to full-text peer-reviewed articles, and screened for studies specifically addressing ML-based early prediction of PC. This process yielded 21 eligible ML studies. Of those 21, only three directly incorporated XAI methods. The search strategy, filtering steps, and eligibility criteria are summarised in a flow diagram (Figure 1). The authors also present a temporal trend plot (Figure 2) illustrating the emergence of ML and XAI studies in PC from 2020 to 2025, showing that despite a gradual increase in ML publications, XAI adoption remains minimal.
Scope and classification: The review grouped methods by model architecture, data modality, and interpretability framework. Rather than conducting a meta-analysis, the authors synthesised findings narratively to evaluate the technical underpinnings, interpretability outcomes, and clinical relevance of XAI applications. The review focuses specifically on the three studies that integrated XAI, while feature-engineering works and clinical variables are addressed in the companion review.
This methodological approach has notable limitations. Using only PubMed and Google Scholar may have missed relevant studies indexed exclusively in Scopus or Web of Science. Additionally, the small number of eligible XAI studies (3 out of 21) reflects the genuine scarcity of explainability research in pancreatic cancer rather than a filtering artefact.
The review provides a thorough overview of ML methods relevant to PC prediction. Supervised learning algorithms, including logistic regression (LR), support vector machines (SVM), random forests (RF), and gradient boosting (GB), use labelled datasets to predict known outcomes. These are applied to classification tasks (e.g., distinguishing malignant from benign lesions) and regression tasks (e.g., estimating survival probabilities). Unsupervised learning techniques, including principal component analysis (PCA), k-means clustering, Gaussian mixture models, DBSCAN, and BIRCH, operate on unlabelled data to uncover latent patterns, identify patient subgroups, and reduce dimensionality.
Deep learning (DL): A specialized branch of ML based on artificial neural networks (ANNs), DL expands these capabilities through multi-layered architectures such as convolutional neural networks (CNNs) and recurrent neural networks (RNNs). These models extract hierarchical representations from high-dimensional data and have achieved state-of-the-art performance in oncology tasks including tumour detection, segmentation, staging, mutation prediction, and radiomics-based analysis. By integrating demographics, comorbidities, biomarker trends, imaging features, and pathology findings, ML systems can identify high-risk individuals earlier and more consistently than rule-based approaches.
Persistent barriers: Despite these strengths, several obstacles limit real-world application: heterogeneous and incomplete clinical datasets, limited external validation, challenges in multimodal data harmonisation, poor workflow integration, and the inherent opacity of many high-performing models. The authors position these barriers as the core motivation for XAI. Transparent, clinically aligned insights into model decision-making are needed to bridge the gap between algorithmic performance and clinical usability.
The review categorises XAI methods across three key dimensions. The first dimension, staging, differentiates between ante-hoc and post-hoc methods. Ante-hoc (intrinsic) explainability is embedded directly within the model architecture, such as decision trees or rule-based models. Post-hoc methods are applied after model training to approximate or uncover the reasoning behind predictions from complex architectures like deep neural networks.
The second dimension, model compatibility, distinguishes model-agnostic methods (applicable to any ML algorithm regardless of structure) from model-specific methods (tailored to particular architectures). SHAP and LIME are model-agnostic, while GradCAM is model-specific to CNNs. The third dimension, scope, differentiates between local methods (explaining individual predictions, critical for case-by-case clinical decisions) and global methods (characterising overall model behaviour across the dataset).
SHAP has emerged as a reference standard for clinical model interpretation. It uses game theory to attribute importance to each feature by computing Shapley values, offering both local and global explanations. While SHAP values are inherently local, they can be aggregated across patients to yield global insights through mean absolute SHAP values or global importance plots. LIME builds simple surrogate models locally around each prediction but is less stable. GradCAM highlights image regions relevant to CNN-based predictions using gradient information, supporting visual diagnostic interpretation. Partial Dependence Plots (PDPs) show the average effect of a feature on predictions but are limited by their assumption of feature independence. DeepLIFT traces contributions of neurons by comparing activations to a reference baseline, useful for genomic networks.
In practice, post-hoc approaches dominate clinical applications because they can be applied without retraining models and generate outputs (feature rankings, heatmaps, rule sets) that align with clinical reasoning. However, choosing the right method is nontrivial, as techniques vary in assumptions, stability, and interpretive depth.
Almisned et al. (2025): This study proposed an ensemble ML framework for early PC detection using clinical and biomarker features. Six ML algorithms were evaluated alongside an ensemble voting classifier, with SHAP applied for model interpretation. SHAP identified benign sample diagnosis, Trefoil Factor 1 (TFF1), and Lymphatic vessel endothelial hyaluronan receptor 1 (LYVE1) as the top predictors with strong positive influence on early-stage PC prediction. Elevated LYVE1 levels were consistently associated with malignancy, suggesting the need for targeted interventions such as non-invasive imaging or liquid biopsies. SHAP also enabled patient-level risk assessments, reinforcing its value for multidisciplinary decision-making and tailored treatment planning.
Keyl et al. (2022): This study applied SHAP within a random survival forest (RSF) model trained on clinical data from 203 patients with advanced pancreatic ductal adenocarcinoma (PDAC). Baseline predictors included CA19-9, C-reactive protein (CRP), neutrophil-to-lymphocyte ratio (NLR), age, and metastatic status. SHAP analysis revealed CRP and NLR as the dominant predictors of poor survival, followed by age and CA19-9. Higher serum protein levels and absence of metastatic disease (M0 status) were associated with improved survival outcomes. This study validates XAI's potential not just in classification tasks but also in survival modelling critical to advanced-stage cancer management.
Chen et al. (2023): This study compared the predictive performance of RSF, eXtreme gradient boosting (XGB), and Cox proportional hazards regression using electronic health record (EHR) data from two US healthcare systems. The study incorporated SurvSHAP, an XAI method specifically tailored for survival models. Age emerged as the most influential predictor across all three models, while abdominal pain contributed minimally in the RSF and XGB models but was more prominent in the Cox regression model. SurvSHAP visualisations uncovered heterogeneous predictive logic across algorithms, increasing transparency and confidence in model outputs.
Together, these three studies illustrate how SHAP and SurvSHAP can enhance model transparency, clinical trust, and actionability. They also reflect the growing recognition that interpretability is not optional but a core necessity for translating AI solutions into meaningful clinical practice.
The review draws a careful distinction between interpretability and explainability, two terms often used interchangeably but embodying distinct objectives. Interpretability refers to models whose internal structure is directly understandable by humans without external explanatory tools. This is achieved through domain-aligned constraints such as sparsity, rule-based logic, monotonicity, or additive structure. Examples include logistic regression with clinically meaningful coefficients, decision trees, Certifiably Optimal Rule Lists (CORELS), and generalized additive models (GAMs). In oncology, such models enable clinicians to confirm that predictors align with biological and epidemiological knowledge, for example, elevated CA19-9 or rapid weight loss increasing PC risk.
Explainability, in contrast, refers to post-hoc analytic techniques that approximate or summarise the behaviour of complex, non-transparent "black-box" models after training. Methods such as SHAP, SurvSHAP, LIME, GradCAM, Anchors, and Partial Dependence Plots generate feature-level or instance-level explanations without exposing internal computations. These tools are widely used because high-performing models like XGBoost, random forests, and deep neural networks often sacrifice inherent transparency to maximise predictive accuracy.
The authors present interpretability, explainability, and completeness as overlapping but non-identical constructs (Figure 4). Completeness describes the degree to which either approach faithfully captures the true computational or causal pathways. As Rudin (2019) argues, post-hoc explanations cannot fully replicate a black-box model's reasoning. If an explanation perfectly captured the model, the model itself would be unnecessary. This fidelity gap is particularly consequential in medicine, where misaligned or incomplete explanations can mislead clinicians, obscure confounding, or mask algorithmic biases.
The authors advocate a hybrid strategy: prioritising interpretable models when feasible and applying rigorous, high-fidelity post-hoc XAI when complex models are unavoidable. This provides the most robust pathway toward clinically actionable and ethically sound AI for PC prediction.
Physician trust deficit: Many high-performing ML models operate as black boxes with deeply layered, nonlinear structures containing millions of parameters, making their decision processes inherently opaque. Even with XAI tools, explanations often remain partial. Attribution heatmaps may highlight influential features without explaining their clinical significance. For frontline clinicians lacking advanced ML or statistical training, these explanations may be inaccessible or insufficiently actionable, widening the gap between developers and end-users.
Methodological instability: Trust is further undermined by the inconsistency of some XAI methods, whose explanations can be unstable or misleading. Several studies question the reliability of post-hoc explanations, citing their susceptibility to instability when subjected to minor data perturbations. The lack of standardised metrics for assessing explanation quality worsens this issue, limiting comparison across clinical contexts. Without benchmarks analogous to AUC or accuracy for predictive performance, there is no consistent way to evaluate whether an explanation is faithful to the model's actual reasoning.
System-level barriers: At the infrastructure level, limited EHR integration, non-intuitive interfaces, and shifting regulatory standards further erode clinician confidence in AI-driven recommendations. Current studies are largely confined to single-institution cohorts or limited datasets, which restricts generalisability. The adoption of XAI in PC prediction is strikingly limited, with only a handful of studies incorporating interpretability frameworks. Even within these applications, explanations are typically restricted to surface-level feature attributions, rarely aligned with pathophysiological knowledge.
The review also acknowledges its own methodological limitations. By searching only PubMed and Google Scholar, relevant studies indexed exclusively in Scopus or Web of Science may have been missed. The small number of eligible XAI studies makes broad conclusions difficult to draw with confidence.
Standardised benchmarks: The authors call for developing standardised evaluation metrics for explanation quality, analogous to AUC or accuracy metrics for predictive performance. This would allow robust clinical evaluation and meaningful comparison of XAI approaches across studies and clinical settings.
External validation and federated learning: Future work should prioritise scalability and cross-institutional validation. Current studies are confined to single centres or limited datasets, restricting generalisability. The authors suggest that cross-institutional federated learning frameworks combined with XAI could strengthen model robustness while preserving data privacy. Integration into EHR systems and clinical workflows is similarly underdeveloped, and effective deployment will require intuitive, clinician-facing interfaces where model explanations are presented alongside conventional diagnostic tools.
Multimodal explainability: PC prediction increasingly involves EHR variables, biomarkers, imaging, and genomics. Future work should aim to unify explanations across these modalities, enabling a holistic understanding of patient-level insights and supporting collaborative decision-making in multidisciplinary oncology boards. Moving beyond feature rankings to clinically grounded narratives that link model predictions to biological plausibility and treatment relevance is essential.
Human-centred design: Technical innovation in explainability will have limited impact unless it resonates with clinician reasoning and patient communication. The authors emphasise human-centred design, co-development with end-users, and iterative feedback between data scientists and healthcare providers. The authors also note that they are conducting an IRB-approved study (NRR25/67/3, KAIMRC) developing an ML-based model for predicting PC risk in patients with chronic metabolic disorders using local EHR data, with XAI integration to ensure interpretability and clinical trustworthiness.