The clinical challenge: Colorectal cancer (CRC) ranks as the third most prevalent cancer worldwide and carries the second highest mortality rate across all cancers. While surgery and endoscopic resection remain the mainstay for localized disease, and fluoropyrimidine-based chemotherapy has improved outcomes in metastatic patients, the landscape of available therapies has expanded dramatically. Targeted therapy, immunotherapy, and refined surgical techniques now offer multiple treatment paths, but selecting the right strategy for each patient has become increasingly complex in the era of precision medicine.
Where AI enters the picture: Artificial intelligence, encompassing both machine learning (ML) and deep learning (DL), offers a data-driven approach to this selection problem. ML algorithms like support vector machines (SVMs) parse structured data for disease stratification and prediction. DL extends this with multi-layered neural networks capable of processing high-dimensional data such as medical images, gene expression profiles, and digital pathology slides. Cancer patients generate multimodal data, including electronic medical records, molecular profiles, radiological scans, and histopathology images, and DL techniques like convolutional neural networks (CNNs) can process these complex inputs to support personalized treatment decisions.
Scope of this review: The authors conducted a comprehensive literature search across MEDLINE, EMBASE, and Web of Science, identifying 49 articles through October 2023. The review covers AI applications across the full treatment spectrum: neoadjuvant chemoradiotherapy efficacy prediction, chemotherapy efficacy and toxicity prediction, immunotherapy and targeted therapy response prediction, endoscopic therapy selection, surgical therapy management, and long-term prognosis evaluation. Both ML and DL approaches are evaluated separately for their contributions.
Key distinction in AI types: The review carefully distinguishes between ML and DL. While ML works well with structured tabular data and clinicopathological variables, DL has proven more proficient at finding complex structures in high-dimensional data such as images. DL has outperformed conventional ML in areas like image recognition, gene expression prediction, and disease impact assessment. This distinction matters because different treatment prediction tasks require different AI approaches depending on the data modality involved.
Why prediction matters: For locally advanced rectal cancer (LARC), preoperative neoadjuvant chemoradiotherapy (nCRT) followed by surgery is the standard of care. The goal is tumor shrinkage and increased probability of complete tumor clearance. However, only about 15% to 27% of patients achieve a pathologic complete response (pCR) after nCRT. Patients who are fully cured by neoadjuvant therapy do not need bowel surgery, making accurate pCR prediction essential for sparing patients from unnecessary surgical risk and its associated complications.
MRI-based radiomics models: Multiple studies have constructed AI models using radiomic texture features extracted from MRI. A random forests (RF) model built on pretreatment T2-weighted MRI achieved an AUC of 0.712 with 70.5% accuracy on a 44-case validation set. A logistic regression classifier using recursive feature elimination on T2-weighted MRI yielded a better AUC of 0.80. When researchers combined pretreatment, mid-treatment, and post-treatment MRI radiomics, an RF model could identify nonresponders with a mean AUC of 0.83 in the validation cohort, demonstrating that longitudinal imaging data improves predictive power.
CT and pathomics approaches: Beyond MRI, a deep neural network (DNN) based on CT radiomic features predicted pCR with 80% accuracy on an external validation set. In an innovative pathomics approach, one study used multiphoton imaging to analyze collagen structural features in the tumor microenvironment. An SVM classifier built on these collagen features achieved an AUC of 0.854 on a validation dataset of 428 LARC patients, suggesting that tissue-level microstructural analysis holds strong predictive value for treatment response.
Deep learning vs. handcrafted features: An important finding came from a study comparing CNN-extracted features with hand-crafted radiomics features from diffusion-weighted MRI. The CNN-based LASSO-logistic regression model achieved a higher mean AUC of 0.73 compared to 0.64 for the handcrafted model, supporting the advantage of deep learning in automatically discovering predictive imaging patterns. Additionally, automatic DL-based tumor segmentation reached 75% accuracy, outperforming manual segmentation by radiology residents at 68%.
Predicting treatment response in liver metastases: For CRC patients with liver metastases (CRLM) who are ineligible for surgery, chemotherapy is the standard option, but identifying which lesions will respond is challenging for physicians. A radiomics model based on ML algorithms achieved a per-lesion sensitivity of 73% and specificity of 47% on portal CT scans, representing moderate but limited clinical performance. However, a delta-radiomics model, which tracks changes in radiomic features between baseline and post-first-cycle CT, achieved a much higher sensitivity of 85% and specificity of 92% in predicting nonresponsive CRLM lesions to FOLFOX chemotherapy.
Molecular biomarker-based prediction: Long noncoding RNAs (lncRNAs) have emerged as potential CRC biomarkers. One study proposed a consensus immune-related lncRNA signature using ML that identified nonresponders to fluorouracil-based adjuvant chemotherapy with an AUC of 0.854. Another study used 10 ML algorithms to construct a consensus ML-derived lncRNA signature that could characterize patients who benefited from fluorouracil-based adjuvant chemotherapy. These molecular-level approaches complement imaging-based models by capturing biological mechanisms of treatment resistance.
Toxicity prediction: AI has also been applied to predict chemotherapy side effects, a critical concern for patient safety. ML models for predicting cardiotoxicity in CRC patients receiving fluoropyrimidine-based chemotherapy found that XGBoost achieved the highest precision of 0.607 for 30-day cardiotoxicity prediction. For Irinotecan toxicity, ML models predicted leukopenia, neutropenia, and diarrhea with accuracies exceeding 75% for all three adverse events. The best individual AUCs reached 0.74 for leukopenia, 0.88 for neutropenia, and 0.95 for diarrhea.
Practical considerations: While these results are promising, practical limitations exist. The Irinotecan toxicity model required pharmacokinetic data obtained after drug administration, limiting its use as a purely pretreatment screening tool. Cardiotoxicity prediction factors are relatively few and closely tied to specific treatment regimens, constraining pretreatment predictions for individuals on the same regimen. Nevertheless, personalized adverse reaction prediction remains highly significant for medical decision-making, as it could help clinicians balance the benefits and risks of specific chemotherapy protocols for individual patients.
MSI status and immunotherapy: Microsatellite instability (MSI) is found in approximately 15% of colorectal tumors and serves as a key biomarker for immunotherapy response. Patients with MSI-high tumors are more responsive to immune checkpoint inhibitors (ICIs), making accurate MSI determination essential. Traditional methods rely on immunohistochemistry or genetic analysis of biopsy specimens, but these are costly and technically constrained. AI offers an alternative: a DL model based on MRI achieved an AUC of 0.868 when combining imaging features with clinical factors to predict MSI status in rectal cancer, far outperforming a clinical-only model (AUC 0.573).
Histopathology-based MSI detection: A multiple-instance-learning DL model called EPLA was developed to identify MSI-high vs. MSI-low/MSS status from histopathology images. Initially, EPLA had a low AUC of 0.6497 on external validation. However, after applying transfer learning to address variations in scanning protocols across clinical sites, its AUC jumped to 0.8504. This dramatic improvement highlights transfer learning as a critical technique for generalizing AI models across diverse clinical settings. Separately, a ResNet-50-based DL method achieved an AUC of 0.774 for predicting tumor mutational burden (TMB) from H&E-stained sections, offering a lower-cost alternative to standard TMB measurement.
KRAS mutation detection for anti-EGFR therapy: KRAS mutations occur in approximately 40% of CRC cases and serve as a negative biomarker for anti-EGFR targeted therapy. AI models have been developed to predict KRAS status noninvasively. An SVM classifier based on T2-weighted MRI achieved an AUC of 0.714 on external validation, while a DL model combining T2-weighted images with clinicopathological characteristics improved to an AUC of 0.841. For combined KRAS, NRAS, and BRAF mutation prediction in CRLM, a DL model using radiomics and semantic features reached an AUC of 0.79, enabling rapid selection of anti-EGFR therapy candidates.
HER2-targeted therapy response: About 70% of metastatic CRC patients with HER2 amplification or overexpression benefit from trastuzumab plus lapatinib treatment. Researchers built a DL model using pretreatment CT to distinguish responders from nonresponders in CRLM patients receiving this dual HER2-targeted therapy. The model achieved a per-lesion sensitivity of 90% but a specificity of only 42% on external validation, indicating a strong ability to identify likely responders but a need for more cases to refine the model and reduce false positives.
Predicting lymph node metastasis for endoscopic decisions: Endoscopic resection is an effective, less invasive option for early-stage CRC, but 8% of T1 and 18.5% of T2 CRC patients have lymph node metastases (LNM), which is a contraindication for endoscopic treatment. Accurate LNM prediction is therefore critical. An ML model using 45 clinicopathological factors achieved 100% sensitivity for T1 CRC LNM prediction, though specificity varied from 0% to 66% depending on the guideline comparison. For T2 CRCs, an RF-based model reached a robust AUC of 0.93 on validation, potentially helping LNM-negative patients avoid unnecessary additional surgery after endoscopic full-thickness resection.
Real-time endoscopic assistance: Beyond decision support, AI can directly assist during endoscopic procedures. A DeepLabv3-based DL model was developed to depict blood vessels and tissue structures on endoscopic images in real time, achieving a mean vessel detection rate of 85%. This capability could reduce the risk of bleeding and perforation during endoscopic submucosal dissection. However, the authors note that studies investigating AI-assisted endoscopic resection specifically for CRC remain scarce, and more primary and validation studies are needed.
Predicting surgical complications: AI models have been developed to predict several postoperative complications. For anastomosis leakage (AL), a common and serious complication of CRC surgery, SVM models using preoperative electronic health records achieved a high AUC of 0.92. A LASSO-based model using clinical data reached an AUC of 0.690 for AL prediction. For predicting perineural invasion (PNI), which correlates with higher postoperative mortality, SVM models based on preoperative CT achieved an AUC of 0.793. A Gated Recurrent Unit-based DL framework predicted postoperative wound infection (AUC 0.68) and organ space infection (AUC 0.78) in real time using dynamic and static clinical variables.
Low anterior resection syndrome: ML models including logistic regression, SVM, decision trees, RF, and artificial neural networks were tested for predicting low anterior resection syndrome following CRC surgery. Logistic regression emerged as the most practical option with a sensitivity of 0.911, suitable as a screening tool. However, the results lacked external validation. These preoperative prediction capabilities collectively support better surgical planning, risk counseling, and postoperative management for CRC patients.
Recurrence prediction: For stage II and III CRC patients, 5-year cumulative local recurrence rates after surgery are 11.0% and 23.5%, respectively. Radiomics models using preoperative CT features showed that multivariate regression combined with clinicopathological factors achieved the highest balanced accuracy of 0.78 and a Matthews correlation coefficient of 0.6. For stage IV CRC, GradientBoosting achieved the highest AUC of 0.761 among four tested ML algorithms. An innovative end-to-end multi-size convolutional neural network (MSCNN) was developed to evaluate disease-free survival (DFS) from CT, with Kaplan-Meier analysis confirming the CT signature could predict DFS (P < 0.001).
Metastasis prediction: About 18% to 25% of CRC patients without distant metastases at primary diagnosis will develop them within 5 years, with the liver being the most frequently affected organ (approximately 50% of cases over the disease course). A stacking bagging model that optimized and integrated 7 ML algorithms achieved a remarkable AUC of 0.9631 for predicting liver metastases in T1 CRC patients at primary diagnosis, providing reliable early warning for high-risk patients.
Survival prediction from clinical and molecular data: For predicting 30-day mortality after chemotherapy, a gradient-boosted trees algorithm achieved an AUC of 0.924 for CRC patients on external validation, potentially helping reduce unnecessary high-risk chemotherapy. FOLFOXai, an ML-derived molecular signature, directly predicted treatment efficacy for metastatic CRC patients on oxaliplatin-containing chemotherapy. A PET/CT-based random survival forest model achieved a C-index of 0.780 for prognosis prediction in stage III colon cancer, while a fusion model combining radiomics and deep CNN features from CT achieved AUCs of 0.76 for DFS and 0.91 for OS in stage II CRC.
Histopathology-based prognosis: Multiple CNN models using H&E-stained tissue slides demonstrated prognostic value. A "deep stroma score" was validated as an independent prognosticator correlated with OS, disease-specific survival, and relapse-free survival across external cohorts. CNN-quantified tumor-stroma ratio (TSR) correlated with increased OS (P < 0.004), and CNN-assessed mucus proportion predicted prognosis in mucinous adenocarcinoma (P < 0.008). A DL model for classifying consensus molecular subtypes (CMSs) from histopathology images achieved an AUC of 0.85 on external validation, facilitating cost-effective molecular subtyping without bulk transcriptomics.
Multi-omics frontier: A microbiome-based ML model using tissue bacterial biomarkers demonstrated superior predictive performance compared to models based on mRNA or miRNA data alone. This finding, combined with recent studies revealing mechanisms behind the relationship between microbes and tumor progression, points to multi-omics integration as a promising new direction for CRC prognostication.
Clinical Decision Support Systems: AI-based Clinical Decision Support Systems (CDSSs) have been developed to directly assist physicians in making personalized treatment decisions. A DL-based CDSS stratified CRC patients into 3 risk groups using H&E-stained tissue sections and pathological staging markers, enabling low-risk patients to avoid adjuvant chemotherapy, thereby reducing morbidity, mortality, and costs. Another system, AImmunoscore, evaluated multiple immunohistopathological images of various immune cell subtypes and was validated as an independent prognostic factor that can predict response to neoadjuvant therapy, serving as a decision tool for precision medicine.
IBM Watson and real-world CDSS deployment: IBM Watson for Oncology and similar CDSSs have been validated for providing personalized CRC treatment recommendations. In most validation studies, the primary criterion was consistency between AI and human expert recommendations. The authors note that the best setting for AI-based CDSSs is likely in centers with limited expert CRC resources. Importantly, AI-suggested regimens that differ from expert recommendations are not necessarily incorrect; they may represent evidence-based options that physicians had not considered, potentially expanding the decision-making framework.
Ongoing clinical trials: The review identified multiple registered trials (on ClinicalTrials.gov) evaluating AI in CRC treatment. Most focus on predicting neoadjuvant chemotherapy response, with estimated enrollments ranging from 100 to 1,700 patients. Others address predicting surgical complications (the PANIC trial in Switzerland with 11,000 patients), lung metastasis prediction (2,779 patients in China), and major complications after elective colectomy (130,000 patients in Canada). Additional trials focus on targeted therapy response, including cetuximab treatment response prediction and bevacizumab outcome prediction using PET/CT.
Surgical video analysis: One trial (VAMIS, 500 patients in the UK) takes a unique approach by analyzing surgical phases, skill levels, and errors from video and kinematic recordings of minimally invasive procedures. This represents an expansion of AI beyond prediction into real-time surgical quality assessment. The breadth of these trials, spanning multiple countries with diverse populations, reflects growing global investment in validating AI tools for CRC treatment decisions.
Validation gaps: A substantial proportion of the 49 included studies lacked external validation, raising concerns about overfitting and generalizability. Without testing on independent datasets from different institutions and patient populations, models may perform well in controlled settings but fail in real-world clinical deployment. The authors also highlight that only a small number of studies reported calibration, which measures agreement between estimated and true outcome risk. Models evaluated solely on discrimination metrics like AUC may appear strong but produce poorly calibrated probability estimates that mislead clinical decision-making.
The "weak AI" reality: All current AI models in CRC treatment are classified as "weak AI," meaning they do not implement tasks the way humans do but rather follow programmed algorithms to achieve satisfying results. These models play a supplementary role in clinical practice, and all eventual clinical decisions must be determined by physicians. The authors speculate that "strong AI" with consciousness and intentionality may eventually be proposed for clinical decision-making, but they caution that the ethical and validation challenges of such systems should be considered well in advance.
Ethical and legal concerns: AI models are built on patients' private and sensitive information, including identity, health status, diagnosis, and treatment data. Data security risks are real, and collecting patient information without informed consent could infringe on patient rights if data is stolen or misused. While approaches like k-anonymity privacy protection algorithms have been developed, their application is limited by time and cost. Perhaps most critically, the question of attributing responsibility for medical decision errors caused by AI remains legally undefined, creating a significant barrier to large-scale clinical adoption.
Technical advancement opportunities: The review points to several promising directions. Transformer architectures, which have shown superior performance in image recognition, could potentially improve treatment outcome prediction for CRC patients. Multimodal DL approaches that integrate high-dimensional omics data with imaging and clinical variables represent the next frontier, as DL-based multimodal methods allow for better integration than traditional ML. The authors also emphasize that studies should move beyond single-omics models toward multi-omics integration, as demonstrated by the superior performance of microbiome-based ML models over mRNA or miRNA models alone.
Sample size and study design: Many included studies had small sample sizes that limit the robustness of their conclusions. The authors call for large retrospective studies with robust external validation and note that the narrative review format itself has limitations, as it did not employ systematic review methodology. Some studies with poor results or insufficient data were excluded, introducing potential selection bias. Moving forward, adequately powered prospective multicenter trials will be essential to demonstrate that AI tools can reliably guide CRC treatment decisions in diverse clinical environments.