Effectiveness of Artificial Intelligence Models in Predicting Lung Cancer Recurrence: A Generalized Review

PMC 2024 AI 8 Explanations View Original
Original Paper (PDF)

Unable to display PDF. Download it here or view on PMC.

Plain-English Explanations
Pages 1-3
Why Predicting Lung Cancer Recurrence Remains a Critical Unmet Need

Lung cancer accounts for approximately 11.6% of all newly diagnosed cancer cases worldwide and remains the leading cause of cancer-related death. Among patients with non-small cell lung cancer (NSCLC), recurrence after initial treatment occurs at alarmingly high rates, ranging from 30% to 70% depending on the stage. Even patients diagnosed at stage I face roughly a 30% chance of relapse, while those at stage IV see rates climb to approximately 70%. These statistics underscore a fundamental gap in how we manage post-treatment surveillance and adjuvant therapy decisions.

Conventional prediction tools fall short: Traditional methods for estimating recurrence risk rely on the TNM staging system (which evaluates tumor size, lymph node involvement, and metastasis), tumor histology, and surgical margin status. While these approaches provide a foundation, they focus on static anatomical and cellular features and fail to capture dynamic biological processes such as immune evasion, epigenetic reprogramming, and tumor microenvironment interactions. For example, TNM staging alone achieves a concordance probability estimate (CPE) of only 0.61, meaning it performs only modestly better than random chance at distinguishing who will relapse from who will not.

The biomarker landscape: Several genetic markers have been linked to recurrence, including mutations in TP53, KRAS, and APC, as well as IGFR1 expression and matrix metalloproteinases (MMPs). MicroRNAs show diagnostic potential but lack standardized clinical protocols. Newer methods like next-generation sequencing (NGS) and immune checkpoint blockade (ICB) marker tracking offer deeper genomic profiling, but integrating all of these data streams into a coherent predictive framework has remained elusive without computational assistance.

Enter AI: This review evaluates how artificial intelligence models that integrate gene biomarkers, including TP53, KRAS, FOXP3, PD-L1, and CD8, can enhance recurrence prediction and improve personalized risk stratification beyond what conventional clinical tools can achieve. The authors specifically focus on the synergy between AI algorithms and genomic data, arguing that AI's ability to decode multifaceted biological drivers of recurrence represents a step change in precision oncology.

TL;DR: Lung cancer recurrence hits 30-70% of NSCLC patients post-treatment. TNM staging alone achieves a CPE of just 0.61. This review evaluates AI models integrating gene biomarkers (TP53, KRAS, FOXP3, PD-L1, CD8) to improve recurrence prediction across 18 studies covering 4,861 patients.
Pages 4-5
PRISMA-Guided Systematic Review Across Six Major Databases

The authors followed the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines for this review. Their search, conducted in September 2024, spanned six databases: PubMed, Embase, Cochrane Library, Scopus, Web of Science, and Google Scholar. Search terms were organized around three themes: AI methods (e.g., "machine learning," "neural networks," "deep learning"), lung cancer terminology (e.g., "lung carcinoma," "pulmonary tumors"), and recurrence indicators (e.g., "relapse," "recurrent disease"). Each database's unique indexing features were accommodated, with PubMed queries integrating Medical Subject Headings and Boolean operators, while Scopus and Web of Science filtered by title, abstract, and keyword fields.

Eligibility and screening: Included studies had to develop machine learning or deep learning models that predicted lung cancer recurrence using genomic biomarkers. Exclusions covered in vitro models, reviews, meta-analyses, editorials, letters, non-English articles, and non-human studies. Two trained reviewers (N.P. and M.P.) independently screened titles and abstracts via EndNote (version 21.3), then performed full-text evaluation. Discrepancies were resolved by consensus or a third arbitrator. Studies that failed to integrate genomic data into AI frameworks were systematically excluded during secondary screening.

Data extraction and synthesis: A structured template captured study authorship, publication year, research design, cancer classification, participant demographics (sample size, average age, gender distribution), and technical details including feature selection methods, training protocols, and evaluation measures (AUC/ROC, sensitivity/specificity, accuracy, and F1 scores). One researcher entered data into a spreadsheet, which was independently verified by two additional team members. Due to heterogeneity across the included studies, a formal meta-analysis was not performed. Instead, the extracted data were narratively synthesized and presented in tables and figures.

Selection results: From 3,298 initially identified articles, 2,702 remained after duplicate removal. Title screening narrowed these to 274 studies for abstract review, and 74 underwent full-text evaluation. Fifty-six were excluded for unrelatedness, leaving 18 studies (published 2019-2024) that met all inclusion criteria. These 18 studies spanned 14 countries and covered 4,861 patients total, with sample sizes ranging from 41 to 1,348 participants.

TL;DR: PRISMA-guided search across 6 databases yielded 3,298 articles, narrowed to 18 studies from 14 countries. Total patient pool: 4,861 (sample sizes 41-1,348). No meta-analysis due to study heterogeneity. Two independent reviewers screened with a third arbitrator for disputes.
Pages 5-8
18 Studies, 12 Unique ML Approaches, and a Global Research Footprint

Geographic and demographic spread: Seven of the 18 studies were conducted in China, three in the United States, and two in the UK. Single studies originated from France, India, Korea, Japan, Sweden, Canada, Iraq, Ireland, Spain, and the Czech Republic. The mean participant age ranged from 33 to 86 years. Men accounted for a significantly larger share of participants (3,228 men vs. 1,633 women in studies reporting gender). Median follow-up durations ranged from 3 to 10 years.

Cancer subtypes: NSCLC was the most researched type, appearing in seven studies, with the largest single NSCLC study containing 827 participants. Lung adenocarcinoma (LUAD) was specifically investigated in six studies totaling 3,026 patients. Small cell lung cancer (SCLC) was represented in only one study with 102 patients, a notable gap given SCLC's aggressive biology and high recurrence rates. Lung squamous cell carcinoma (LUSC) appeared in studies combining it with LUAD, but received far less dedicated attention.

Machine learning techniques: At least 12 unique ML approaches were reported. Support vector machines (SVM) and regression models (LASSO and Cox) were the most commonly used, each appearing in four studies. Other algorithms included Random Forest, gradient boosting, XGBoost, neural networks, K-Nearest Neighbors, and naive Bayes. Notably, several studies employed custom-developed models, including the Optuna-optimized XGBoost (Optuna_XGB), the Back Propagation Network optimized with an Ant Lion Optimization algorithm (BPN-ALO), and the Interpretable Biological Pathway Graph Neural Network (IBPGNET). The diversity of techniques reflects an active, exploratory phase in the field rather than convergence on a single best approach.

Data sources: Most studies drew from publicly available repositories including The Cancer Genome Atlas (TCGA) and Gene Expression Omnibus (GEO). Some studies used institutional cohorts with next-generation sequencing data (e.g., the MSK-IMPACT platform). Data types ranged from gene expression profiles and DNA methylation levels to RNA-seq data, copy number variations, and clinical characteristics.

TL;DR: 18 studies from 14 countries, 4,861 patients total. NSCLC dominated (7 studies), LUAD appeared in 6 studies (3,026 patients), SCLC only 1 study (102 patients). SVM and LASSO/Cox regression were most common (4 studies each). At least 12 unique ML methods used, including custom models like IBPGNET and Optuna_XGB.
Pages 9-11
AI Models Consistently Outperformed TNM Staging, With AUCs Reaching 0.92-0.965

Gradient boosting and XGBoost: Jones et al. (2021) developed PRecur, a gradient-boosted model combining clinicopathologic variables (tumor size, histological subtype) with genomic data (TP53 and SMARCA4 mutations). PRecur achieved a CPE of 0.73, significantly outperforming TNM classification alone (CPE = 0.61). Jiang et al. built an immune risk score using FOXP3, PD-L1 on tumor-infiltrating lymphocytes (TILs), and CD8 markers via XGBoost, achieving an AUC of 0.866. Combining that immune risk score with clinical staging improved AUCs to 0.656, 0.737, and 0.698 for 1-, 3-, and 5-year relapse-free survival, respectively. Abdu-Aljabar et al. (2023) introduced an Optuna-optimized XGBoost model that extracted BTBD6, KLHL7, and BMPR1A as predictive biomarkers, achieving 93% accuracy on one dataset and 81% on another.

Support vector machines and regression: Shen et al. (2023) developed an SVM-based combo-classifier integrating clinical and gene expression data, achieving an AUC of 92.0% (95% CI: 89.0-95.0%), which markedly outperformed models relying solely on clinical parameters. Training set sensitivity reached 89.5% with 62.5% specificity, while the validation set achieved 75.0% sensitivity and 100.0% specificity. Zhong et al. (2019) achieved an AUC of 0.95 in internal cross-validation with sensitivity of 0.88 and specificity of 0.90 using SVM with recursive feature elimination. Xu et al. (2020) reported an AUC of 96.3% using LASSO Cox regression on a 12-gene signature for LUAD recurrence prediction.

Hybrid and feature-enriched models: Timilsina et al. (2022, 2023) showed that integrating imputed aneuploidy scores with clinical features improved AUC from 0.78 to 0.79, and incorporating pathway scores pushed it to 0.80 using Random Forest. Wang et al. developed the Acetylation-Related Score (ARS) based on RBBP7 and YEATS2, which achieved AUCs of 0.679, 0.669, and 0.600 for 1-, 3-, and 5-year recurrence-free survival with a pooled hazard ratio of 1.88 (p < 0.001). Luo et al. (2020) achieved the highest reported AUC of 0.965 using LASSO and Random Forest on CpG methylation markers.

Head-to-head with conventional methods: Across all studies, AI models integrating gene expression data consistently outperformed TNM staging and single-indicator clinical models. The range of AUCs for AI-driven approaches spanned 0.73 to 0.965, compared to 0.61 for TNM staging alone. Multi-modal approaches that combined gene expression, radiomics, and clinical data consistently yielded the strongest results.

TL;DR: AI models achieved AUCs of 0.73-0.965 vs. 0.61 for TNM staging. Top performers: Xu et al. AUC 96.3% (LASSO Cox), Luo et al. AUC 0.965 (LASSO + Random Forest), Shen et al. AUC 92.0% (SVM combo-classifier, 95% CI: 89.0-95.0%), Zhong et al. AUC 0.95 with 88% sensitivity and 90% specificity.
Pages 10-12
Deep Learning Frameworks Push Beyond Traditional ML With Multi-Omics Integration

While the majority of studies relied on classical machine learning techniques, three studies advanced the field by deploying deep learning architectures capable of integrating multi-omics data. These approaches go beyond simple gene expression analysis by combining diverse biological data types, including copy number variations (CNVs), single nucleotide variants (SNVs), and imaging-derived radiomics features, into unified predictive frameworks.

Genotype-Guided Radiomics (GGR): Aonpong et al. (2021) proposed the GGR framework, which fuses handcrafted radiomics features extracted from CT images with estimated gene expressions generated through deep learning. By combining imaging and genomic modalities, GGR achieved an accuracy of 83.28% for recurrence prediction, significantly outperforming conventional radiomic methods, which reached only 78.61%. The model also reported a sensitivity of 0.95, though specificity was lower at 0.59, suggesting the model favors catching recurrences at the cost of some false positives.

IBPGNET: Xu et al. (2024) developed the Interpretable Biological Pathway Graph Neural Network, a deep learning model that integrates multi-omics data and latent biological pathway relationships for predicting LUAD recurrence. IBPGNET achieved an AUC of 0.88, an accuracy of 0.82, and an area under the precision-recall curve (AUPR) of 0.79. It outperformed Random Forest, SVM, PathCNN, and DeepOmix. The integration of SNV + amplification CNV + deletion CNV yielded the highest AUPR. The model's hierarchical visualization identified key genes and pathways, including PSMC1 and PSMD11, that contribute to LUAD recurrence and drug resistance.

BPN-ALO: Senthil et al. (2019) demonstrated a Back Propagation Network optimized with the Ant Lion Optimization algorithm, which integrated dimensionality reduction, feature optimization, and a neural network structure. This model achieved up to 99.1% accuracy, 88.6% sensitivity, and 96.8% specificity. While these numbers are impressive, the study used the UCI Machine Learning Repository rather than a clinical cohort, so generalizability to real-world patient populations requires further validation.

TL;DR: Three deep learning studies advanced multi-omics integration. GGR combined CT radiomics with gene expression for 83.28% accuracy (vs. 78.61% for conventional radiomics). IBPGNET achieved AUC 0.88 and AUPR 0.79 by integrating SNVs and CNVs via graph neural networks. BPN-ALO reached 99.1% accuracy but on a benchmark dataset, not clinical data.
Pages 11-15
Gene Biomarkers Driving Recurrence Prediction Across Four Major Categories

LUAD/NSCLC prediction biomarkers: Zhong et al. (2019) identified eight genes as pivotal for risk assessment frameworks: PDIA3, MYH11, PDK1, SDC3, RPE65, LAMC3, BTK, and UPK1B. Jones et al. (2021) found that SMARCA4 mutations, TP53, and the Fraction of Genome Altered (FGA) were significant recurrence predictors. Luo et al. (2020) demonstrated the utility of CpG methylation markers, including ART4, KCNK9, FAM83A, and C6orf10, for prognostic insights. Xu et al. (2020) identified a 12-gene signature (ACTR2, ALDH2, FBP1, HIRA, ITGB2, MLF1, P4HA1, S100A10, S100B, SARS, SCGB1A1, SERPIND1, and VSIG4) that achieved an AUC of 96.3%.

Immune-related markers: Jiang et al. (2021) found that FOXP3 expression and PD-L1 on tumor-infiltrating lymphocytes played crucial roles in predicting SCLC recurrence, with the XGBoost-derived immune risk score achieving an AUC of 0.866. Rakaee et al. (2023) demonstrated that STK11 and KEAP1 co-mutations were associated with distinct immune phenotypes impacting recurrence risk, yielding AUCs up to 0.856. These immune markers connect to major signaling pathways: NF-kB, JAK-STAT, and MAPK all play roles in cancer cell proliferation, while PD-L1 mutations enable tumor immune evasion.

Multi-omics approaches: Xu et al. (2024) showed that PSMC1, PSMD11, PRKCB, CCNE1, NRG1, ZNF521, and NGF significantly improved prediction through the IBPGNET graph neural network. Zhou et al. (2023) identified long non-coding RNAs LINC00675 and MEG3 as critical recurrence predictors. Shi et al. (2021) reported CPS1, CCR2, NT5E, ANLN, and ABCC2 as biomarkers with strong prognostic value. Wang et al. identified RBBP7 and YEATS2 as key acetylation-related genes, developing the Acetylation-Related Score (ARS) with a pooled hazard ratio of 1.88 (p < 0.001).

Tumor and immune tissue markers: Shen et al. (2023) identified MR1, BCL6, and CCL13 in tumor tissues and TBX21, IL-17RB, and GZMB in the buffy coat as key recurrence predictors. Abdu-Aljabar et al. (2023) found BTBD6, KLHL7, and BMPR1A to be highly predictive. Across all studies, AI-driven models incorporating these molecular features achieved AUC values ranging from 0.76 to 0.965, underscoring the transformative potential of combining genomic and immune markers with computational modeling.

TL;DR: Four categories of biomarkers emerged: LUAD/NSCLC genes (PDIA3, MYH11, SMARCA4, TP53), immune markers (FOXP3, PD-L1, STK11/KEAP1), multi-omics targets (PSMC1, PSMD11, lncRNAs LINC00675/MEG3), and tissue-specific markers (MR1, BCL6, TBX21). Combined with AI, these achieved AUCs of 0.76-0.965.
Pages 16-18
Treatment Context: Surgery, Chemotherapy, Immunotherapy, and How AI Could Guide Decisions

Surgical recurrence rates: Surgical resection was the primary treatment across the reviewed studies, but recurrence remains a major concern. Approximately 18-75% of patients with various stages of LUAD experienced recurrence after surgery. Among LUSC patients, roughly 21% suffered from post-surgical recurrence. Stage I NSCLC patients faced about a 30% relapse rate, rising to approximately 70% at stage IV. These wide ranges highlight the inadequacy of stage-based estimates alone for guiding individual patient management.

Chemotherapy: Patients receiving adjuvant chemotherapy after surgery had a recurrence rate of 26.2%, compared to 32.2% for surgery alone, suggesting a meaningful benefit from post-surgical chemotherapy. PD-1 and PD-L1 blockers combined with first-line chemotherapy showed promise for SCLC, with patients having high FOXP3 expression on TILs experiencing longer recurrence-free survival. Among stage 1-2 patients, approximately 42% experienced recurrence, climbing to 74% at stage 3.

Immunotherapy: Immune checkpoint blockers (ICBs) demonstrated promising results among high-risk NSCLC and LUAD patients, with ICB-treated patients experiencing better recurrence-free survival than untreated counterparts. Patients with LUAD and a low Tumor Stemness Index (TSI) had significantly lower recurrence rates, which also predicted clinical response to radiotherapy. Radiotherapy itself was associated with a low recurrence rate of 2.82% in NSCLC patients who received it, though patients receiving adjuvant radiotherapy paradoxically showed higher recurrence rates, likely because radiotherapy is typically reserved for advanced or terminal stages with inherently elevated relapse risk.

AI's therapeutic role: The authors hypothesize that AI models may become essential for identifying which patients are most likely to benefit from specific adjuvant treatments, particularly through genomic biomarker-guided selection. The ability of AI to integrate complex genetic data with clinical characteristics could enable more precise treatment planning, moving beyond one-size-fits-all adjuvant therapy toward individualized regimens that balance efficacy against treatment burden.

TL;DR: Post-surgical recurrence: 18-75% for LUAD, ~21% for LUSC, 30% for stage I NSCLC, 70% for stage IV. Adjuvant chemo reduced recurrence to 26.2% vs. 32.2% without it. ICBs improved RFS in high-risk patients. AI could guide adjuvant therapy selection through genomic biomarker integration.
Pages 19-21
Small Cohorts, Missing External Validation, and the Path to Clinical Adoption

Sample size and data heterogeneity: The included studies reported sample sizes ranging from just 41 to 1,348 patients. This variability, combined with the absence of a unified benchmark dataset, makes direct performance comparisons difficult. Some studies relied on publicly available repositories like TCGA and GEO, while others used single-center institutional cohorts. The field lacks large-scale, standardized, multi-institutional datasets that would support rigorous model training, testing, and external validation.

Validation gaps: Most models demonstrated strong internal validation through cross-validation or train/test splits, but very few underwent external testing on independent cohorts from different institutions. Without external validation, it remains unclear whether these models will perform reliably across diverse populations, healthcare systems, and resource levels. The variability in model performance across studies reflects fundamental differences in data characteristics, algorithm selection, and validation approaches, and a traditional meta-analysis was not feasible due to this heterogeneity.

Interpretability and the black-box problem: Many AI models operate as "black boxes," offering predictions without transparent reasoning. This lack of interpretability hinders clinical trust and adoption. Tools such as SHAP (Shapley Additive exPlanations) and NetShap provide insights into feature contributions and model decisions, but not all reviewed studies implemented these methods. The authors recommend broader adoption of explainability tools to enhance transparency. Additionally, the cost of genomic profiling, particularly for technologies like next-generation sequencing, poses accessibility barriers, especially in low-resource settings.

Future directions: The authors identify several priorities for advancing the field. Expanding multi-institutional datasets and including underrepresented populations will improve generalizability. Federated learning could enable institutions to pool data securely without sharing raw patient information. Advancing multi-omics integration, combining transcriptomic, proteomic, and epigenetic data with standardized fusion protocols, is critical for reproducibility. Non-invasive biomarkers such as circulating tumor DNA (ctDNA) and liquid biopsies could streamline recurrence monitoring while reducing patient burden. Finally, developing clinician-friendly AI tools with explainable outputs that integrate seamlessly into electronic health records (EHRs) will be essential for real-world clinical deployment.

Ethical considerations: The SNMMI AI Task Force's lifecycle ethics framework emphasizes protecting patient privacy through anonymization, mitigating biases in training datasets, ensuring equitable representation of marginalized populations, and transparently documenting limitations. Post-deployment, the priorities shift to preserving clinician-patient autonomy, disclosing population-specific performance gaps, preventing systemic underdiagnosis in underrepresented groups, and establishing clear accountability structures. The authors advocate embedding safeguards at every stage, from data collection through governance, to ensure AI enhances diagnostic accuracy without exacerbating health disparities.

TL;DR: Key limitations: small cohorts (41-1,348 patients), limited external validation, no unified benchmark dataset, and black-box interpretability concerns. Future priorities include federated learning, multi-omics integration with standardized protocols, non-invasive biomarkers (ctDNA, liquid biopsies), EHR-integrated explainable AI tools, and lifecycle ethics frameworks to ensure equitable deployment.
Citation: Pourakbar N, Motamedi A, Pashapour M, et al.. Open Access, 2025. Available at: PMC12153899. DOI: 10.3390/cancers17111892. License: cc by.