Artificial intelligence in breast cancer survival prediction: a comprehensive systematic review and meta-analysis

Frontiers in Oncology 2024 AI 10 Explanations View Original
Original Paper (PDF)

Unable to display PDF. Download it here or view on PMC.

Plain-English Explanations
Pages 1-2
Why a Systematic Review of AI for Breast Cancer Survival Prediction Is Needed

Breast cancer (BC) remains the most prevalent cancer and the leading cause of cancer-related mortality in women globally. The disease is significantly heterogeneous, meaning that individual patients can have very different survival outcomes depending on factors such as patient demographics (e.g., age), tumor characteristics (e.g., size, lymph node involvement, TNM staging), and biomarker status (e.g., estrogen receptor [ER], progesterone receptor [PR], and HER2 expression). Accurate survival prediction is essential for guiding clinical decision-making, evaluating treatment efficacy, and developing personalized therapeutic strategies.

The promise of AI and machine learning: Artificial intelligence (AI) and machine learning (ML) algorithms have emerged as powerful tools for analyzing complex medical data. Researchers have developed various ML-based survival models using classifiers such as multilayer perceptron (MLP), random forest (RF), decision tree (DT), and support vector machine (SVM) on datasets with thousands of patients. Deep learning (DL) approaches have also been explored for predicting post-operative breast cancer survival. However, the rapid evolution of AI techniques means that the landscape of available methods changes quickly, necessitating regular review updates.

The gap this review fills: While previous systematic reviews have examined ML applications in breast cancer, many focused narrowly on 5-year survival rates or specific algorithm types. This review aims to comprehensively synthesize and classify all AI algorithms used in BC survival prediction from 2016 to 2023, presenting them in a structured form. The authors evaluate performance across diverse contexts, critically examine validation strategies, and identify limitations with a particular focus on external validation and real-world clinical applicability. The review was registered in PROSPERO (CRD42024513350) to ensure transparency and methodological rigor.

TL;DR: Breast cancer survival varies enormously based on tumor characteristics and biomarkers like ER/PR/HER2. AI and ML offer promising tools for prediction, but the field evolves rapidly. This systematic review synthesizes all AI methods used for BC survival prediction from 2016 to 2023, evaluating their performance, validation approaches, and clinical readiness.
Pages 2-3
PRISMA-Guided Methodology: How the 32 Studies Were Identified

PRISMA framework: This systematic review and meta-analysis followed the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines. Three electronic databases were searched: Web of Science, PubMed, and Scopus, covering the period from January 2016 to August 2023. The search strategy, developed collaboratively by two authors, used three core concepts ("Breast Cancer," "Survival Prediction," and "Machine Learning") along with their synonyms derived from MeSH terms. Only English-language journal articles were included.

Inclusion and exclusion criteria: Included studies had to be original research published as journal articles, written in English, available in full text, and employing ML algorithms specifically for BC survival prediction using clinical data. Excluded were conference papers, protocols, letters, book chapters, observational studies, pilot studies, reviews, and meta-analyses. Studies that applied ML to breast cancer but not for survival prediction, and studies predicting survival for other cancer types, were also excluded.

Screening and selection results: The systematic search yielded 140 articles initially. After deduplication using EndNote software, 82 duplicates were removed, leaving 58 studies for screening. Titles and abstracts were independently screened by four reviewers, with disagreements resolved by two additional reviewers. Full texts of potentially eligible studies were then assessed. Ultimately, 32 articles met the eligibility criteria and were included in the final analysis, as depicted in the PRISMA flow diagram.

Quality assessment: Methodological quality was evaluated using the Qiao Quality Assessment tool, which encompasses five key domains: unmet need, reproducibility, robustness, generalizability, and clinical significance, further divided into nine specific items. Studies were scored on a binary scale (Yes/No), and those achieving a threshold of 5 or more points were categorized as high-quality for inclusion in the meta-analysis.

TL;DR: Following PRISMA guidelines, the authors searched Web of Science, PubMed, and Scopus (2016-2023) and identified 140 articles. After deduplication and screening, 32 studies met inclusion criteria. Quality was assessed using the Qiao tool across five domains, with a threshold of 5+ points required for inclusion.
Pages 3-5
Traditional, Modern, and Hybrid ML Approaches in the Reviewed Studies

Three generations of ML methods: The review categorizes algorithms into three generations. Traditional methods include well-understood algorithms with strong theoretical foundations and interpretable results, such as K-nearest neighbors (KNN), support vector machines (SVM), decision trees (DT), Naive Bayes (NB), and artificial neural networks (ANN). Modern methods leverage deep neural networks inspired by the brain's architecture, including convolutional neural networks (CNN), deep neural networks (DNN), long short-term memory networks (LSTM), recurrent neural networks (RNN), and ensemble methods like random forests and XGBoost. Hybrid methods strategically combine traditional and modern approaches to capitalize on complementary strengths.

Distribution across studies: Hybrid models were the most commonly used, appearing in 40.62% (13 studies) with the highest average accuracy of 91.73%. Traditional methods were used in 37.5% (12 studies) with an average accuracy of 87.23%. Modern techniques appeared in 21.87% (7 studies) with an average accuracy of 88.72%. The majority of included studies (81.3%, n=26) were published between 2019 and 2023, reflecting the accelerating pace of research in this area.

Learning paradigms: Supervised learning dominated, employed in 75% (24 studies) with an average accuracy of 88.82%. The combination of supervised and unsupervised learning was the second most frequent approach (25%, 8 studies), achieving a higher average accuracy of 92.07%. Notably, no studies used semi-supervised learning, reinforcement learning, or unsupervised learning alone. This finding suggests that the labeled data available for breast cancer survival prediction naturally lends itself to supervised approaches, though combining paradigms may yield improved performance.

TL;DR: Hybrid models combining traditional ML and deep learning were used most often (40.62%) and achieved the highest accuracy (91.73%). Supervised learning dominated at 75% of studies, but studies combining supervised and unsupervised paradigms achieved a higher average accuracy of 92.07%. Modern deep learning approaches are accelerating in adoption since 2019.
Pages 5-7
Data Sources, Sample Sizes, and Clinical Features Used for Prediction

Data availability: Private datasets were the predominant choice, used in 56.25% (18 studies, avg. accuracy 89.42%). Publicly available datasets, including well-known repositories like TCGA (The Cancer Genome Atlas), METABRIC (Molecular Taxonomy of Breast Cancer International Consortium), and SEER (Surveillance, Epidemiology, and End Results), were used in 34.37% (11 studies, avg. accuracy 89.80%). A small subset (9.37%, 3 studies) combined both public and private data, achieving the highest average accuracy of 93.1%. Among single-dataset studies, METABRIC (n=4, 12.5%), SEER (n=4, 12.5%), and TCGA (n=3, 9.38%) were the most frequently used.

Sample sizes and data modalities: Dataset sizes varied dramatically, ranging from 38 to 163,413 cases per dataset, with an average of 12,041 cases. Studies using multiple datasets (n=8) achieved a superior mean validation accuracy of 95.76%, compared to 88.52% for single-dataset studies (n=24). This finding underscores the importance of training on diverse data. Data modalities included clinical records, genomic/omics data, and imaging data, with clinical data being the most common. Notably, only one study employed all BC data modalities simultaneously.

Time windows for prediction: The 5-year window was the most frequently explored timeframe (37.5%, 12 studies, avg. accuracy 86.81%). However, the 10-year prediction window achieved the highest accuracy, with 5 studies reaching an impressive 97.95% average accuracy. Shorter windows (1 and 3 years) yielded lower accuracy, each represented by only one study. These findings suggest that longer prediction horizons may yield more reliable models, though further investigation is warranted.

Clinical features: Studies incorporated between 5 and 625 features. Commonly used clinicopathological features included age, cancer stage, tumor grade, tumor size, estrogen receptor (ER) status, progesterone receptor (PR) status, and HER2 status. These biomarkers are central to breast cancer prognosis and are routinely collected in clinical practice, making them natural candidates for ML-based prediction models.

TL;DR: Over half the studies used private datasets, but combining public and private data achieved the highest accuracy (93.1%). Multi-dataset studies reached 95.76% accuracy versus 88.52% for single-dataset studies. The 10-year prediction window showed the best accuracy (97.95%). Key features included age, tumor size, stage, grade, and ER/PR/HER2 biomarker status.
Pages 7-8
How Studies Prepared Data and Selected Informative Features

Preprocessing techniques: The majority of studies (78.125%, n=25) acknowledged employing at least one preprocessing technique. Common methods included mean removal and unit variance scaling (standardization), transformation and resampling, data integration, SMOTE for handling imbalanced datasets, outlier detection with boxplots, one-hot encoding for categorical data, segmentation, and min-max scaling. For missing value imputation, studies used approaches ranging from deep learning and KNN-based methods to predictive mean matching (PMM) via the R mice package, multiple imputation, and manual imputation. Seven studies (21.875%) did not explicitly report their preprocessing steps.

A notable gap in data augmentation: Despite being a well-established technique for mitigating overfitting and expanding limited training data, none of the 32 reviewed studies employed data augmentation. This is a significant finding, given that medical datasets are often small and imbalanced. The authors note that the effect of incorporating augmentation methods (such as generating synthetic samples, cropping, flipping, or rotating existing data) remains an unexplored avenue for improving BC survival prediction models.

Feature selection and dimensionality reduction: Most studies (68.75%, n=22) employed feature selection techniques, while 31.25% (n=10) did not use any. Supervised methods were the dominant choice (46.875%, n=15, avg. accuracy 87.88%), followed by unsupervised techniques (21.875%, n=7, avg. accuracy 91.28%). Traditional feature selection techniques included manual selection, K-means clustering, PCA (Principal Component Analysis), mRMR (minimum Redundancy Maximum Relevance), and RFE (Recursive Feature Elimination). Modern feature selection relied on random forest importance, CNNs for automated feature extraction, and SHAP (SHapley Additive exPlanations) for identifying the most predictive variables.

Performance impact: Interestingly, studies that did not use any feature selection technique achieved a slightly higher average accuracy (91.11%) compared to those that did (89.08%). This may indicate that powerful modern classifiers can effectively handle high-dimensional data without requiring explicit feature reduction, although feature selection remains valuable for model interpretability and computational efficiency.

TL;DR: Most studies used preprocessing (78%), but zero studies applied data augmentation, a significant gap. Feature selection was employed in 68.75% of studies using methods like PCA, mRMR, RFE, and SHAP. Studies without feature selection achieved slightly higher accuracy (91.11% vs. 89.08%), suggesting modern classifiers can handle high-dimensional data effectively.
Pages 8-9
Which ML and DL Algorithms Performed Best for Survival Classification

Traditional classifiers: Traditional classification algorithms were employed in 62.5% (n=20) of studies. Among supervised traditional classifiers, SVM was the most popular method (n=12, 37.5%, avg. accuracy 89.24%), followed by decision trees (n=11, 34.37%, avg. accuracy 88.74%), neural networks (n=10, 31.25%, avg. accuracy 87.52%), Naive Bayes (n=6, 18.75%, avg. accuracy 96.20%), KNN (n=5, 15.625%, avg. accuracy 94.51%), and AdaBoost (n=4, 12.5%, avg. accuracy 93.0%). SVM's popularity stems from its strong generalization capabilities, excellent classification performance, and ability to handle high-dimensional data. Only one study used unsupervised traditional classification (K-means clustering for predicting survival curves, achieving 93.1% accuracy).

Modern classifiers: Modern classification algorithms were adopted in over 87.5% of studies (n=28, avg. accuracy 90.45%), with a notable increase since late 2018. Deep neural networks (DNNs) were frequently employed (n=12, 37.5%, avg. accuracy 91.14%). Among DNNs, CNNs were most common (n=6, avg. accuracy 90.57%), followed by dense neural networks (n=5, avg. accuracy 91.89%), LSTM networks (n=2, avg. accuracy 98.0%), and RNN (n=1, accuracy 91.0%). Bagging algorithms (n=16, 50%, avg. accuracy 90.46%) and XGBoost (n=10, 31.25%, avg. accuracy 90.77%) were also widely used. Only one study employed modern unsupervised classification using Restricted Boltzmann Machines (RBM), achieving 96.8% accuracy.

Key performance highlights: LSTM networks achieved the highest average accuracy among specific architectures at 98.0%, though they were represented in only two studies. Naive Bayes, despite being a simpler traditional method, achieved a surprisingly high average accuracy of 96.20%. XGBoost demonstrated consistently strong performance at 90.77% accuracy across 10 studies, making it one of the most reliable and widely adopted algorithms. The overall mean validation accuracy across all 32 studies was 89.73%, with individual study accuracies ranging from 72% to 99.04%.

TL;DR: SVM was the most popular traditional classifier (37.5% of studies, 89.24% accuracy), while DNNs led modern methods (37.5%, 91.14% accuracy). LSTM achieved the highest accuracy (98.0%) but in only two studies. XGBoost was widely used (10 studies, 90.77% accuracy). The overall mean accuracy across all 32 studies was 89.73%, ranging from 72% to 99.04%.
Pages 9-10
How Models Were Validated and the Critical Gap in External Testing

Evaluation metrics: Accuracy was the dominant evaluation metric, used in 78.13% (n=25) of studies. Other reported metrics included sensitivity, specificity, precision, F1-score, C-index (concordance index, commonly used in survival analysis), AUC (area under the ROC curve), and Matthew's Correlation Coefficient (MCC). Accuracy served as the primary basis for cross-study performance comparison in this meta-analysis.

Validation methods and the external validation gap: A critical finding of this review is the overwhelming reliance on internal validation. 81.25% (n=26) of studies used only internal validation, assessing model performance on the same dataset used for training. Only 6.25% (n=2) used external validation alone (testing on a completely separate dataset), and 12.5% (n=4) combined both methods. Studies that employed both internal and external validation achieved the highest mean accuracy of 93.4%, compared to 89.20% for internal-only and 89.55% for external-only validation. This heavy dependence on internal validation raises serious concerns about the generalizability of reported results to new patient populations and clinical settings.

Validation strategies: K-fold cross-validation was the dominant approach (71.88%, n=23, avg. accuracy 89.69%), followed by train/test split (21.88%, n=7, avg. accuracy 90.73%). Bootstrapping was rarely used (3.13%, n=1, accuracy 87.1%), and one study did not report its validation strategy. While K-fold cross-validation provides more robust internal estimates than a simple train/test split, it still cannot substitute for external validation on independent datasets from different hospitals, patient demographics, or time periods.

Statistical summary of validation accuracy: Across the 25 studies reporting validation accuracy (78.13% of total), model accuracy ranged from 72% to 99.04%. The average accuracy was 89.73%, with a median of 91%. The variance was 50.005 and the standard deviation was 7.07, indicating substantial heterogeneity in model performance across studies. This variability reflects differences in datasets, preprocessing approaches, algorithm choices, and evaluation methodologies.

TL;DR: A striking 81.25% of studies relied solely on internal validation, with only 6.25% using external validation alone. K-fold cross-validation was used in 71.88% of studies. Validation accuracy ranged from 72% to 99.04% (mean 89.73%, median 91%, SD 7.07). The lack of external validation is a critical limitation for clinical translation.
Pages 10-12
Traditional vs. Deep Learning vs. Hybrid: Strengths, Weaknesses, and Trends

Traditional ML strengths and weaknesses: Traditional methods like DT, SVM, and logistic regression (LR) offer strong interpretability, which is vital in clinical environments where understanding the decision-making process builds trust. They work well with smaller datasets, which are common in medical research. However, they require extensive manual feature engineering, may struggle with high-dimensional data, and generally do not achieve the same accuracy as deep learning on complex predictive tasks.

Deep learning advantages and limitations: DL methods, particularly CNNs and DNNs, demonstrated superior accuracy for breast cancer survival prediction. They autonomously extract relevant features from raw data, eliminating the need for manual feature engineering. This is especially advantageous for complex data such as medical images and genomic profiles. However, DL models are often characterized as "black boxes" lacking interpretability, require substantial amounts of labeled data, and demand significant computational resources. These limitations present real barriers to clinical adoption.

The hybrid approach as the best of both worlds: Hybrid models that integrate traditional ML and DL techniques achieved the highest average accuracy (91.73%) by leveraging the strengths of both approaches. For example, they might employ DL for automated feature extraction and traditional ML for interpretable classification. The review observed a discernible shift toward hybrid frameworks, with studies increasingly combining DL-based feature extraction (especially CNNs) with traditional classifiers. Techniques like SHAP and LIME are also being investigated to enhance the interpretability of deep learning components.

Emerging trends: The analysis identified several forward-looking developments. Transfer learning, where pre-trained models are fine-tuned on breast cancer datasets, is gaining momentum as a way to reduce data and computational requirements. Multi-modal approaches combining clinical, genomic, and imaging data are becoming more prevalent for holistic prediction. Explainable AI (XAI) techniques are receiving increased attention to address the "black box" problem. Cloud computing and federated learning are proposed as solutions to computational resource constraints and data privacy concerns.

TL;DR: Traditional ML methods offer interpretability but lower accuracy; deep learning achieves higher accuracy but lacks transparency and needs more data. Hybrid models combining both achieved the highest accuracy (91.73%) and are the dominant trend. Emerging directions include transfer learning, multi-modal data fusion, explainable AI (SHAP/LIME), and federated learning for privacy.
Pages 12-13
What the Numbers Tell Us: Pooled Results and Performance Patterns

Overall meta-analytic results: The meta-analysis extracted the best accuracy from each study and averaged performance across methodological categories. The overall mean validation accuracy across all ML models was 89.73% (n=32 studies, 26 reporting accuracy). Studies using multiple datasets achieved a substantially higher mean accuracy of 95.76% compared to 88.52% for single-dataset studies. Studies combining public and private data sources reached 93.1% average accuracy, outperforming either source alone.

Performance by algorithm category: Among specific classifier families, LSTM networks led with 98.0% average accuracy (2 studies), followed by Naive Bayes at 96.20% (6 studies), Restricted Boltzmann Machines at 96.8% (1 study), KNN at 94.51% (5 studies), and AdaBoost at 93.0% (4 studies). The widely used algorithms showed solid but somewhat lower averages: DNN at 91.14% (12 studies), XGBoost at 90.77% (10 studies), bagging at 90.46% (16 studies), SVM at 89.24% (12 studies), and DT at 88.74% (11 studies). These patterns suggest that while more complex architectures can achieve higher peak performance, simpler methods still perform competitively.

Impact of study design choices: Several design factors correlated with higher accuracy. Using multiple datasets improved performance (95.76% vs. 88.52%). Longer prediction windows (10-year) yielded higher accuracy (97.95%) than shorter ones (5-year: 86.81%). Combining supervised and unsupervised learning paradigms outperformed supervised-only approaches (92.07% vs. 88.82%). Studies that combined internal and external validation achieved the highest mean accuracy (93.4%). These correlations suggest that methodological rigor and data diversity are at least as important as algorithm selection.

Heterogeneity considerations: The substantial variance (50.005) and standard deviation (7.07) in validation accuracies across studies reflect significant heterogeneity in study designs, datasets, and evaluation approaches. This heterogeneity limits the ability to definitively identify a single superior method. The authors note that the observed differences in performance may partly stem from variations in dataset quality, feature engineering, and preprocessing rather than inherent algorithmic superiority.

TL;DR: Mean validation accuracy was 89.73% across all studies. Multi-dataset studies reached 95.76%, and 10-year prediction windows achieved 97.95%. LSTM led individual algorithms at 98.0%, but with only 2 studies. Significant heterogeneity (SD 7.07) across study designs means no single algorithm can be declared definitively superior.
Page 13
What Needs to Change Before AI Survival Models Reach the Clinic

Critical limitations identified in the literature: The review identified several persistent shortcomings across the 32 studies. The most significant is the overwhelming reliance on internal validation (81.25% of studies), which severely limits confidence in model generalizability. Many studies used outdated or inappropriate datasets whose relevance to current clinical practice is questionable. The complete absence of data augmentation across all reviewed studies represents a missed opportunity for improving model robustness. Additionally, over 21% of studies did not report their preprocessing steps, hindering reproducibility.

Limitations of this review itself: The search strategy was restricted to English-language articles, potentially excluding relevant publications in other languages. Limited access to some full texts prevented their review. The reliance on accuracy as the primary comparison metric, while practical given its widespread use, overlooks more nuanced survival analysis metrics like C-index, time-dependent AUC, and calibration measures that better capture the temporal nature of survival prediction.

Recommendations for future research: The authors outline several priorities. First, rigorous external validation on independent datasets from different hospitals and populations is essential before any model can be considered for clinical deployment. Second, incorporating data augmentation techniques could strengthen prediction models, particularly given the limited size of many medical datasets. Third, multi-modal data integration (clinical, genomic, imaging) should be pursued more systematically, as only one study used all available data modalities. Fourth, developing explainable AI models using techniques like SHAP and LIME is critical for building clinician trust.

Infrastructure and ethical considerations: The authors emphasize the need for cloud computing and federated learning to address computational resource constraints and data privacy concerns simultaneously. Establishing ethical guidelines and regulatory frameworks for responsible AI deployment in healthcare is vital, addressing data privacy, algorithmic bias, and transparency. Cost-effective AI solutions that can be widely adopted in resource-constrained settings are also needed. Extensive prospective clinical studies will be necessary to validate the real-world applicability of AI models before they can be integrated into clinical workflows for personalized breast cancer care.

TL;DR: The biggest gap is the lack of external validation (only 6.25% of studies). No studies used data augmentation. Future work must prioritize external validation on diverse populations, multi-modal data fusion (clinical + genomic + imaging), explainable AI for clinician trust, federated learning for privacy, and prospective clinical trials before any model can reach bedside use.
Citation: Javanmard Z, Zarean Shahraki S, Safari K, et al.. Open Access, 2024. Available at: PMC11747035. DOI: 10.3389/fonc.2024.1420328. License: cc by.