AI for Predicting Breast Cancer Recurrence: Systematic Review

Plain-English Explanations

Background & Motivation

Pages 1-2

Why Predicting Breast Cancer Recurrence Remains a Critical Challenge

Breast cancer remains a leading cause of mortality among women worldwide, and one of the most daunting aspects of the disease is the risk of recurrence. Roughly 30% of early-stage breast cancer patients experience a return of the disease within a decade, with many recurrences clustering in the first five years after initial diagnosis. Recurrence not only threatens survival but also severely diminishes quality of life and creates substantial socioeconomic burdens through lost work hours and elevated healthcare costs. Conventional prediction methods have struggled to capture the full complexity of factors that determine which patients will relapse.

The promise of AI: Artificial intelligence techniques, including machine learning (ML) and convolutional neural networks (CNNs), have emerged as powerful tools for analyzing large volumes of medical data. These approaches can identify subtle, non-linear patterns in clinical records, genetic profiles, and medical images that traditional statistical methods may miss. AI-driven models have already shown value in breast cancer diagnosis and prognosis, offering opportunities for personalized care, improved therapy matching, reduced adverse effects, and lower costs from unnecessary treatments.

Ethical and practical hurdles: Despite the technical promise, adopting AI in clinical practice comes with significant ethical and practical challenges. Issues of patient privacy, data quality, and model interpretability remain barriers to widespread implementation. Many high-performing AI models function as "black boxes," making it difficult for clinicians to understand and trust their predictions. The authors emphasize that AI integration in recurrence prediction requires rigorous validation to ensure safe clinical application, and that the results of such reviews can directly influence public health policies and screening strategies.

Scope of this review: This systematic review differentiates itself from previous work by focusing on the most recent developments in AI for breast cancer recurrence prediction. It evaluates the effectiveness of different AI techniques, training and testing methodologies, and evaluation metrics across studies published from 2003 through May 2023. The goal is to provide an updated, comprehensive analysis of how AI can improve clinical decision-making and patient outcomes in the context of breast cancer recurrence.

TL;DR: About 30% of early-stage breast cancer patients experience recurrence within 10 years, and conventional prediction methods are inadequate. This systematic review examines AI techniques (ML, CNNs, SVMs) for recurrence prediction using clinical data, imaging data, and combined datasets, covering studies from 2003 to 2023.

Methodology

Pages 2-4

How 62 Studies Were Selected Using the PRISMA Framework

PRISMA-guided process: The review followed the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) framework, a gold-standard methodology for conducting transparent and replicable literature reviews. The authors searched four major electronic databases: Scopus, Embase, PubMed, and Web of Science. Each database was chosen strategically: PubMed for health research, Embase for broader biomedicine coverage, Scopus for multidisciplinary reach, and Web of Science for high-impact studies across diverse disciplines.

Search strategy and eligibility: The search string combined terms for "artificial intelligence," "machine learning," "deep learning," "breast cancer," "recurrence," and "recurrence free survival" using Boolean operators, applied to Title/Abstract fields. The review included studies from 2003 (a milestone year for AI advancement and medical record digitization) through May 2023. Eligible studies had to explore AI approaches for classifying or predicting breast cancer recurrence using medical images (X-rays, CT scans, ultrasounds) or clinical/laboratory data. Studies had to be in English, in scientific format, and available in full.

Exclusion criteria: The authors excluded studies addressing diseases other than breast cancer recurrence, research using non-medical data types, publications that were reviews or conference summaries rather than original research, and studies that failed to describe their AI methods or databases in adequate detail. This strict filtering ensured that only research with clear and replicable methodologies was considered. Three researchers independently reviewed titles, abstracts, and full texts, with disagreements resolved through consensus or third-party consultation.

Selection results: The initial search across all four databases identified 459 articles. After removing duplicates with Mendeley software, 168 unique articles remained. Screening of titles and abstracts reduced the pool to 88 studies. A detailed full-text assessment by a second group of reviewers resulted in the final selection of 62 articles that met all inclusion criteria. These 62 studies form the evidentiary basis for the entire review.

TL;DR: Using the PRISMA framework across four databases (Scopus, Embase, PubMed, Web of Science), the authors started with 459 articles, removed duplicates to reach 168, screened down to 88, and ultimately selected 62 studies for analysis. Studies had to describe AI methods and databases in detail and focus specifically on breast cancer recurrence.

AI Methods Overview

Pages 4-7

The Landscape of AI Algorithms Used Across 62 Studies

Machine learning dominance: Analysis of the 62 selected studies reveals a strong prevalence of traditional machine learning techniques, though deep learning methods are growing in adoption. The most frequently used algorithms include Support Vector Machines (SVMs), Random Forest, XGBoost, Naive Bayes, and various neural network architectures. SVMs emerged as the single most popular algorithm, favored for their ability to handle high-dimensional clinical data and achieve effective generalization. Random Forest was valued for consistent performance across diverse datasets, while XGBoost was selected for its high efficiency and shorter processing times.

Deep learning approaches: Among deep learning methods, Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) stood out. CNNs are particularly effective at analyzing visual data by learning to identify and extract important spatial features from medical images. RNNs excel at managing data with temporal dependencies, processing sequences of information critical for tracking disease progression over time. Long Short-Term Memory (LSTM) networks, a specialization of RNNs, were applied for handling long-term dependencies in patient records. Multi-layer Perceptrons (MLPs) and Deep Neural Networks (DNNs) were used for general supervised learning and complex feature extraction from large datasets.

Algorithm categorization: The authors organized the algorithms into seven categories: (1) Linear algorithms like Logistic Regression for binary classification; (2) Tree-based algorithms including Decision Trees, Random Forest, XGBoost, and AdaBoost; (3) SVM-based algorithms using kernel functions for non-linearly separable data; (4) Neighborhood-based algorithms like K-Nearest Neighbors; (5) Probabilistic algorithms such as Naive Bayes for handling uncertain clinical data; (6) Artificial neural networks spanning MLPs, RNNs, LSTMs, and DNNs; and (7) Special approaches including multi-objective multi-classifier systems, Cox survival models, and eTumorMetastasis for whole exome sequencing data.

Dataset variability: A critical factor across all studies was the size and diversity of datasets. The review found datasets ranging from as few as 38 patients to as many as 79,483 patients, with the majority of studies using datasets of fewer than 2,000 patients. This wide variability directly impacts model reliability, as deep learning models in particular require large datasets to avoid overfitting and achieve robust generalization.

TL;DR: Across 62 studies, SVMs, Random Forest, and XGBoost were the most popular algorithms, with deep learning methods (CNNs, RNNs, LSTMs) growing in use. Dataset sizes ranged from 38 to 79,483 patients. The authors categorized all methods into seven groups spanning linear models, tree-based approaches, SVMs, neural networks, and specialized techniques.

Clinical Data Results

Pages 7-9

How AI Performs on Clinical Data: SVMs Lead at 87% Average, Neural Networks Reach 97.62%

Clinical data defined: Forty-three of the 62 selected studies focused specifically on clinical data, which encompasses patients' laboratory information, prescription and treatment details, and records of disease progression. The most frequently used clinical features included age, tumor grade, lymph node status, estrogen receptor status, tumor size, invasion markers, and histological type. These features were typically selected through feature selection algorithms that identify variables with the highest statistical significance or clinical relevance for predicting recurrence.

SVM performance: Support Vector Machines emerged as one of the most effective methods for clinical data, achieving an average performance of approximately 87.03%. This success is attributed to SVM's ability to operate effectively in high-dimensional spaces and provide robust classifications even with complex clinical datasets. Studies consistently reported that SVMs outperformed other classifiers when applied to clinical variables. Other traditional algorithms like Naive Bayes and Decision Trees showed moderate but lower performance, suggesting they may lack the sophistication needed for the inherent complexity of clinical recurrence data.

Neural network superiority: Studies using artificial neural networks, particularly Multi-layer Perceptrons (MLPs) and Recurrent Neural Networks (RNNs), demonstrated superior performance, especially when applied to genetic and molecular datasets. A remarkable average performance of 97.62% (measured by AUC) was observed in studies applying these techniques. This highlights the effectiveness of neural networks in capturing complex, non-linear patterns in high-dimensional data that simpler models struggle to detect. The ability of RNNs to process sequential data makes them particularly well-suited for tracking disease progression patterns over time.

Feature importance patterns: Word frequency analysis of feature descriptions across studies revealed that age, tumor grade, lymph node involvement, and hormone receptor status (estrogen and progesterone receptors) are the most commonly selected predictors. Their strong correlation with breast cancer recurrence has been validated across multiple independent studies, and feature selection algorithms consistently identify them as high-impact variables. This convergence across diverse research groups reinforces that these clinical markers carry genuine predictive power for recurrence risk.

TL;DR: Of the 62 reviewed studies, 43 used clinical data. SVMs averaged about 87% performance, while neural networks (MLPs, RNNs) reached 97.62% AUC on genetic and molecular datasets. The most predictive clinical features were age, tumor grade, lymph node status, and hormone receptor status.

Imaging Data Results

Pages 9-11

Medical Imaging and AI: CNNs Average 80% While Hybrid Approaches Push Higher

Imaging modalities: Eleven of the selected articles focused specifically on imaging data for recurrence prediction. The imaging modalities used included Digitized Whole Slide Images (WSI), Multiparametric Magnetic Resonance Imaging (mpMRI), standard MRI, and Quantitative Ultrasound (QUS). Each modality captures different aspects of tumor biology: WSI provides detailed cellular-level views of tissue architecture, mpMRI offers multi-parameter characterization of tumor structure, and QUS captures raw radiofrequency data that can provide more detailed characterization of tissue microstructure than conventional ultrasound.

CNN performance: Convolutional Neural Networks proved to be the most suitable models for imaging data, achieving an average performance of approximately 80%. CNNs excel at recognizing spatial patterns in images, which is critical for analyzing complex medical visual data. However, the review found that the inherent complexity of medical images often requires hybrid or specialized approaches to achieve optimal performance. Raw image data from different modalities presents unique challenges in terms of resolution, noise, and the subtlety of recurrence-associated features that may not be visually obvious even to trained pathologists.

ResNet architectures: The ResNet architecture, particularly ResNet-34 and ResNet-50, was frequently employed across imaging studies. ResNet-34, with its 34 convolutional layers and residual blocks, provides a good balance between computational complexity and performance, making it suitable for resource-constrained clinical environments. ResNet-50 uses a more advanced "bottleneck" design with 50 layers, including 16 residual blocks that help mitigate the vanishing gradient problem. Both architectures solve a fundamental challenge in deep learning: enabling much deeper networks without degradation in training performance.

Context matters more than architecture: The review emphasizes that an algorithm's performance on imaging data is not solely dependent on its inherent capabilities. Factors such as data collection methods, image quality, imaging modality, preprocessing techniques, and the specific characteristics of the patient population all play significant roles. CNN and XGBoost were the most frequently used algorithms for image-based studies, but their reported performance varied substantially depending on these contextual factors, underscoring the importance of standardized imaging protocols for reproducible AI research.

TL;DR: Eleven studies used imaging data (WSI, mpMRI, MRI, QUS), with CNNs averaging about 80% performance. ResNet-34 and ResNet-50 architectures were popular choices for their balance of depth and efficiency. The review stresses that image quality, modality, and preprocessing matter as much as algorithm choice for prediction accuracy.

Combined Data Results

Pages 11-13

Multi-Modal Approaches: Combining Clinical and Imaging Data Reaches 95-95.2% Performance

The multi-modal advantage: Nine articles in the review utilized a combination of clinical and imaging data, and these multi-modal approaches consistently outperformed single-modality studies. Integrating clinical records with genetic or imaging information resulted in a significant improvement in predictive model performance. Hybrid approaches that combined SVM with Artificial Neural Networks (ANN) or SVM with CNNs achieved average performance between 95.0% and 95.2%. This improvement makes biological sense: clinical data captures patient-level systemic factors (age, receptor status, treatment history) while imaging data reveals local tumor-level morphological and spatial characteristics.

Key combined features: In studies using both data types, the most important clinical variables included age, estrogen receptor status, progesterone receptor status, tumor stage and grade, Ki67 proliferation index, and histological type. For the imaging component, the specific features extracted depended heavily on the modality used, resulting in limited overlap of imaging features across different studies. This heterogeneity in imaging feature sets makes direct comparison between studies challenging but also suggests that different imaging modalities capture complementary aspects of tumor biology.

Algorithm preferences: The most frequently used algorithms in combined-data studies were Random Forest, ResNet-18, and SVM. Notably, the best performance came not from any single algorithm but from hybrid architectures that paired different model types together. The SVM + ANN and SVM + CNN combinations leveraged the complementary strengths of each algorithm: SVMs provide robust classification boundaries in high-dimensional feature spaces, while neural networks capture complex non-linear relationships in the data. This synergy proved more effective than either approach alone.

Implications for clinical practice: These findings strongly suggest that the future of AI-driven recurrence prediction lies in multi-modal, multi-algorithm approaches. By incorporating diverse data streams and combining complementary AI techniques, researchers can build models that capture a more holistic picture of each patient's recurrence risk. However, the practical challenge of integrating multiple data types in real clinical workflows, where imaging and clinical records may exist in separate systems with different formats, remains a significant implementation hurdle.

TL;DR: Nine studies combined clinical and imaging data, achieving 95.0-95.2% performance with hybrid approaches (SVM + ANN, SVM + CNN). Multi-modal models consistently outperformed single-modality approaches. Key clinical features included age, hormone receptor status, Ki67, and tumor grade, while imaging features varied by modality.

Limitations

Pages 13-14

Five Critical Limitations That Constrain Current AI Recurrence Models

Dataset diversity and generalizability: The most significant limitation identified across the reviewed studies is the lack of diversity in the datasets used. Most studies relied on data from a single institution or a specific database, which does not adequately reflect the diversity of patient populations and clinical contexts found across different geographical regions. This homogeneity limits the applicability of AI models to broader populations, raising serious concerns about whether a model trained on one demographic group will perform equally well for patients of different ethnicities, ages, or healthcare settings.

Small sample sizes and overfitting: Many reviewed studies used relatively small datasets. AI models, particularly deep learning architectures, typically require large datasets to achieve robust and reliable performance. Small sample sizes increase the risk of overfitting, where the model memorizes patterns specific to the training data rather than learning generalizable features. An overfitted model may report excellent performance on its test set but fail dramatically when deployed on new, unseen patient data in a different clinical environment.

Interpretability gap: Despite advances in model development, interpretability remains a major challenge. Many studies employed complex algorithms such as deep neural networks that function as "black boxes," obscuring the reasoning behind their predictions. This opacity undermines the confidence of healthcare professionals and hinders integration into clinical practice. Clinicians need to understand why a model predicts recurrence for a specific patient before they can responsibly act on that prediction, yet few of the 62 reviewed studies incorporated explainability methods.

Evaluation inconsistencies: The heterogeneity in evaluation metrics and validation tools across studies poses another limitation. Some studies reported accuracy, others used AUC, sensitivity, or specificity as their primary metric, making direct comparisons difficult. Additionally, many studies relied on internal validation (testing on a subset of the same dataset) rather than rigorous external validation on independent datasets. Internal validation can overstate model performance and provides weaker evidence that the model will generalize to real clinical populations.

Confounding variables: The analysis of included studies may have been affected by confounding factors not consistently addressed in the AI models. Critical clinical variables such as specific treatments received, treatment responses, cancer stage, and comorbidities may not have been included or appropriately weighted in many of the analyzed models. These omissions can lead to biased predictions and limit the practical utility of otherwise technically sound algorithms.

TL;DR: Five major limitations constrain current AI recurrence models: (1) lack of dataset diversity across institutions and demographics, (2) small sample sizes leading to overfitting, (3) "black box" models with no explainability, (4) inconsistent evaluation metrics making cross-study comparisons unreliable, and (5) uncontrolled confounding variables like treatment type and cancer stage.

Conclusions & Future Directions

Pages 14-15

The Path Forward: Multi-Modal AI, Standardized Validation, and Clinical Integration

Key findings summarized: This systematic review demonstrates that AI techniques, particularly SVMs and neural networks, have shown great potential in improving the accuracy and personalization of breast cancer recurrence prediction. SVMs provided consistent results in high-dimensional clinical data scenarios, averaging around 87% performance. Neural networks excelled in genetic and molecular datasets, reaching 97.62% AUC. CNNs and ResNet architectures were most effective for imaging data at approximately 80% average performance. Combined clinical-imaging approaches achieved the highest overall performance at 95.0-95.2%, confirming that multi-modal data integration is the most promising path forward.

The case for hybrid approaches: The review's most actionable finding is that combining different data types with hybrid AI architectures consistently delivers superior results compared to any single-modality, single-algorithm approach. SVM + ANN and SVM + CNN combinations on combined clinical and imaging data outperformed all other configurations. This suggests that future research and clinical tool development should prioritize multi-modal frameworks that can ingest diverse patient data streams and leverage complementary algorithmic strengths to produce more robust, comprehensive recurrence risk assessments.

Recommendations for future research: The authors call for several concrete improvements in the field. Researchers should prioritize increasing dataset diversity by incorporating multi-institutional, multi-ethnic patient cohorts. Larger sample sizes are essential, particularly for deep learning models that require substantial training data. A broader spectrum of algorithms should be explored, including newer architectures that may not have been captured within the review's time frame. Standardized evaluation criteria and rigorous external validation protocols are needed to enable meaningful cross-study comparisons and ensure model robustness across diverse clinical settings.

Clinical translation: The adoption of AI techniques for predicting breast cancer recurrence represents a significant advancement with the potential to transform clinical care through more accurate and personalized diagnostics. However, the authors stress that these technologies must be accompanied by rigorous and continuous validation to ensure their safety and effectiveness. Ethical considerations, including patient privacy, data governance, informed consent, and the responsible use of algorithmic predictions in treatment decisions, must be carefully addressed before any model is deployed at scale in clinical practice.

TL;DR: SVMs (87% on clinical data), neural networks (97.62% on genetic data), and combined SVM+CNN approaches (95-95.2% on multi-modal data) show the most promise. The review recommends larger, more diverse datasets, standardized evaluation metrics, external validation, and hybrid multi-modal architectures as the path toward clinically deployable AI recurrence prediction tools.

Harnessing artificial intelligence for predicting breast cancer recurrence: a systematic review of clinical and imaging data

Original Paper (PDF)