ML Prognostic and Metastasis Models of Kidney Cancer

Overview & Background

Pages 1-2

Why Machine Learning for Kidney Cancer Prognosis and Metastasis?

Kidney cancer originates from the urinary tubule epithelial system of the renal parenchyma and accounts for roughly 20% of all urinary system tumors. As of 2018, new cases worldwide exceeded 400,000 annually, with over 170,000 deaths. In China alone, the 2016 incidence exceeded 15,000 cases (4.02 per 100,000 population) with a mortality rate of 1.37 per 100,000. Approximately 70% of cases are localized at diagnosis and can often be cured with surgery, but the remaining 30% present with metastatic disease. Even among localized cases, 20-30% of patients experience recurrence and metastasis after nephrectomy, and immunotherapy, the most promising option for metastatic kidney cancer, succeeds in only 10-15% of patients.

Existing risk models: Traditional clinical prediction systems such as the Memorial Sloan-Kettering Cancer Center (MSKCC) model, the International Metastatic Renal-Cell Carcinoma Database Consortium (IMDC) criteria, the Leibovich score, and the University of California Los Angeles Integrated Staging System (UISS) have been used to stratify patients by recurrence risk. However, these models rely on conventional statistical inference, which makes distributional assumptions about the data and struggles with the high-dimensional complexity of modern cancer datasets.

The machine learning advantage: Machine learning methods do not require assumptions about data distribution. Instead, they learn patterns from rich datasets and generalize to predict unknown outcomes. Previous work has demonstrated this potential: Byun et al. applied deep learning to predict prognosis in nonmetastatic clear cell renal cell carcinoma, and Ji et al. used ML models to predict hepatocellular carcinoma recurrence after resection. This study aims to build on that foundation by developing and comparing eight ML models for predicting both 3-year survival and metastasis in kidney cancer patients using the SEER database.

TL;DR: With over 400,000 new kidney cancer cases per year globally and 20-30% of surgical patients experiencing recurrence, accurate prognostic tools are critical. This study developed eight ML models to predict 3-year survival and metastasis using SEER data from 12,394 patients, aiming to outperform traditional statistical approaches like MSKCC and IMDC.

Methodology

Pages 2-3

Patient Cohort and Data Preprocessing from SEER

The study drew on the Surveillance, Epidemiology, and End Results (SEER) database, which covers approximately 30% of the US population. Patients with histologically diagnosed kidney cancer and complete survival time and active follow-up data from 2004 to 2015 were included. The authors specifically selected cases with ICD-10-CM code C64.9 (malignant neoplasm of unspecified kidney), excluding renal pelvis cancers. Patients diagnosed only through autopsy or death certificates, those with missing follow-up records, and those with pathological features coded as "Blank," "N/A," or "Unknown" were excluded.

Final cohort: After filtering, the cohort comprised 12,394 eligible patients. Among them, 6,432 (51.90%) survived more than 3 years and 2,519 (20.32%) had metastases. The median age was 61 years (range 49-73), 65.84% were male, and the most common histological subtype was clear cell renal cell carcinoma (ccRCC) at 59.20%. Localized and regional stages accounted for 41.01% and 37.97% of cases, respectively, with 21.02% presenting at a distant stage. Tumor staging followed the 6th edition of the AJCC TNM system.

Data preprocessing: After cleaning, 12,082 patients were included in the survival analysis and 12,192 in the metastasis analysis. Variables were reviewed by clinicians for clinical relevance. The chi-squared test assessed differences between categorical variables and the t-test was used for continuous variables, with statistical significance set at p < 0.05. The observation period ran from the date of diagnosis until death, recurrence, or the end of data inclusion (2018). All analyses were performed in R (version 4.2.0) and RStudio (version 1.3.1093).

TL;DR: The SEER database yielded 12,394 kidney cancer patients (2004-2015), with 51.90% surviving 3+ years and 20.32% developing metastases. Median age was 61, 65.84% male, and 59.20% had clear cell RCC. After preprocessing, 12,082 and 12,192 patients were used for survival and metastasis modeling, respectively.

Feature Selection

Pages 3-4

Variable Selection: Balancing Statistical Significance and Clinical Relevance

The authors recorded a comprehensive set of variables for each patient: age at first visit, race (Black, White, other, unknown), sex, tumor size, marital status, year of birth, year of diagnosis, histologic type, Fuhrman nuclear grade (I through IV), T stage (T1-T4), lymph node status (N0, N1, N2), distant metastasis status (M0, M1), primary site surgery information, and site-specific metastasis data for bone, brain, liver, and lung. The primary endpoints were death within 36 months and tumor metastasis within 3 years of diagnosis.

Survival prediction variables: For the 3-year survival model, the finalized predictors included race, age at diagnosis, tumor size, differentiation grade, stage, histologic type, TNM staging, primary tumor surgery type, and lymph node clearance status. These variables were selected through a combination of statistical testing (chi-squared and t-tests) and clinician review, ensuring both data-driven and clinical justification for each feature.

Metastasis prediction variables: For the metastasis model, a slightly different and more focused set was used: race, gender, age at diagnosis, Fuhrman grade, histologic type, and T and N staging. Notably, variables like primary surgery type and tumor size were excluded from the metastasis model, reflecting the different clinical context, since metastasis prediction ideally relies on features available at the time of initial diagnosis rather than post-surgical information.

This dual-variable approach is notable because it acknowledges that survival and metastasis are distinct clinical questions requiring different feature sets. The inclusion of Fuhrman nuclear grade is particularly relevant, as Grade IV (undifferentiated/anaplastic) tumors appeared in 43.75% of metastatic patients compared to only 16.20% of non-metastatic patients (p < 0.001), highlighting the prognostic weight of tumor differentiation.

TL;DR: The survival model used 9 features (including surgery type and tumor size), while the metastasis model used 7 features focused on diagnosis-time data. Fuhrman Grade IV appeared in 43.75% of metastatic vs. 16.20% of non-metastatic patients (p < 0.001), underscoring the importance of tumor grade in metastasis risk.

Machine Learning Models

Pages 4-5

Eight Models Head-to-Head: From Logistic Regression to Neural Networks

The authors implemented eight distinct machine learning algorithms using Scikit-learn (version 0.23.2) with an 80/20 train-test split and 5-fold cross-validation. The models spanned a wide range of ML paradigms. Support vector machines (SVM) are supervised models that find optimal hyperplanes for classification. Logistic regression uses a logistic function to model binary outcomes and estimate class probabilities. Decision trees recursively partition data based on feature thresholds. Random forests aggregate predictions from multiple decision trees to reduce variance and overfitting.

Boosting and distance-based methods: XGBoost (Extreme Gradient Boosting) builds trees sequentially, with each new tree correcting the errors of the previous ensemble. AdaBoost (Adaptive Boosting) is a meta-algorithm that iteratively reweights misclassified samples to focus subsequent learners on harder cases. K-nearest neighbors (KNN) is a non-parametric method that classifies based on the majority vote of the k closest training examples in feature space.

Neural network approach: The multilayer perceptron (MLP) is a feedforward artificial neural network with at least three layers (input, hidden, and output), using nonlinear activation functions and backpropagation for training. Unlike linear models, MLP can capture complex nonlinear relationships in the data, making it a powerful but potentially harder-to-interpret alternative.

All models were evaluated using six metrics: accuracy, precision, sensitivity (recall), specificity, F1 score, and area under the ROC curve (AUROC). This comprehensive evaluation framework allowed the authors to assess not just overall correctness but also the balance between identifying true positives and avoiding false positives, which is critical in clinical prediction where both missed metastases and unnecessary interventions carry significant consequences.

TL;DR: Eight ML models (SVM, logistic regression, decision tree, random forest, XGBoost, AdaBoost, KNN, and MLP) were trained with 80/20 splits, 5-fold cross-validation, and evaluated across 6 metrics (accuracy, precision, sensitivity, specificity, F1, AUROC) using Scikit-learn.

Results: 3-Year Survival

Pages 5-7

Survival Prediction: Logistic Regression Leads with an AUROC of 0.741

For 3-year survival prediction, logistic regression achieved the best overall performance with an AUROC of 0.741, accuracy of 0.684, sensitivity of 0.702, specificity of 0.670, precision of 0.686, and F1 score of 0.683. This means the model correctly identified about 70% of patients who died within 3 years while maintaining a 67% rate of correctly classifying survivors.

Comparative performance: AdaBoost came close with an AUROC of 0.736, followed by MLP at 0.735 and XGBoost at 0.729. Decision tree had a higher raw accuracy (0.690) than logistic regression (0.684) but a lower AUROC (0.710), suggesting it was less consistent across different classification thresholds. KNN performed worst overall, with an AUROC of only 0.609 and accuracy of 0.607. Random forest underperformed expectations with an AUROC of 0.690 and accuracy of 0.645.

SVM and MLP specifics: SVM achieved an accuracy of 0.685 and sensitivity of 0.713 (the second-highest sensitivity) but had a lower AUROC of 0.684, indicating that while it caught more true positives, its overall discriminative ability was limited. MLP had high sensitivity (0.732) but lower specificity (0.654), meaning it was better at detecting patients who would die but more prone to false alarms.

The relatively modest AUROCs across all models (0.609-0.741) reflect the inherent difficulty of predicting 3-year survival from clinical and pathological variables alone, without incorporating molecular or genetic biomarker data. Despite these moderate numbers, logistic regression's consistent performance across all six metrics made it the clear winner for this prediction task.

TL;DR: For 3-year survival, logistic regression achieved the best AUROC (0.741), with accuracy 0.684, sensitivity 0.702, and specificity 0.670. AdaBoost (0.736) and MLP (0.735) were close runners-up, while KNN (0.609) performed worst. All models had moderate AUROCs, reflecting the complexity of survival prediction from clinical variables alone.

Results: Metastasis Prediction

Pages 7-8

Metastasis Prediction: Stronger Performance with AUROC up to 0.804

The metastasis prediction task yielded notably better results than survival prediction across nearly all models. Logistic regression again led with an AUROC of 0.804, accuracy of 0.800, sensitivity of 0.540, specificity of 0.830, precision of 0.769, and F1 score of 0.772. The higher specificity (0.830) indicates the model was good at correctly ruling out metastasis in non-metastatic patients, though the sensitivity of 0.540 means it missed nearly half of patients who actually had metastases.

Close competitors: MLP performed almost identically (AUROC 0.802, accuracy 0.800, sensitivity 0.542) and AdaBoost was close behind (AUROC 0.799, accuracy 0.797). XGBoost had the highest accuracy for metastasis (0.806) and the highest sensitivity among the top performers (0.548), but a lower AUROC of 0.747, suggesting less robust threshold-independent performance. Decision tree showed an AUROC of only 0.666 for metastasis despite an accuracy of 0.791, revealing the danger of relying on accuracy alone in imbalanced datasets (only 20.32% of patients had metastasis).

SVM anomaly: SVM performed particularly poorly on metastasis prediction, with an AUROC of just 0.573, barely better than random chance (0.50). This is striking given its acceptable performance on survival prediction (AUROC 0.684), and suggests that SVM's linear kernel may struggle with the feature interactions relevant to metastasis. KNN also performed poorly (AUROC 0.682), consistent with its distance-based approach being less suited to the mixed categorical and continuous features in this dataset.

The overall improvement in metastasis prediction (best AUROC 0.804 vs. 0.741 for survival) suggests that the clinical and pathological features used in this study, particularly T stage and N stage, carry stronger predictive signal for metastatic disease than for long-term survival, where molecular and treatment response factors likely play a larger role.

TL;DR: For metastasis prediction, logistic regression achieved the best AUROC (0.804) with 80.0% accuracy and 83.0% specificity, though sensitivity was moderate at 54.0%. MLP (0.802) and AdaBoost (0.799) were close behind. SVM collapsed to an AUROC of 0.573. Metastasis prediction outperformed survival prediction across all models.

Clear Cell RCC Subgroup

Pages 8-9

Clear Cell Renal Cell Carcinoma: Subtype-Specific Model Validation

Because clear cell renal cell carcinoma (ccRCC) accounts for 59.20% of the cohort and has distinct biological characteristics, the authors performed a subgroup analysis using the best-performing logistic regression model. For 3-year survival in ccRCC patients, the model achieved an AUROC of 0.710, accuracy of 0.658, sensitivity of 0.674, specificity of 0.649, precision of 0.661, and F1 score of 0.653. These results were slightly lower than the full-cohort survival model (AUROC 0.741), which is expected since subgroup analysis reduces sample diversity and statistical power.

Metastasis in ccRCC: For metastasis prediction in ccRCC, the logistic regression model actually performed slightly better than the full cohort, achieving an AUROC of 0.811, accuracy of 0.826, sensitivity of 0.593, specificity of 0.851, precision of 0.802, and F1 score of 0.803. The improvement in metastasis prediction (0.811 vs. 0.804) may reflect the more homogeneous biology of ccRCC, which is characterized by loss of chromosome 3p and biallelic inactivation of the VHL gene through mechanisms like loss of heterozygosity (LOH), mutation, and methylation.

Clinical context: In the multivariate logistic regression, ccRCC patients showed better outcomes with an odds ratio of 0.897 (95% CI: 0.825-0.975, p = 0.01) for survival and an odds ratio of 1.372 (95% CI: 1.235-1.525, p < 0.001) for metastasis compared to other histological types. High-frequency mutations in epigenetic regulation genes (PBRM1, SETD2, BAP1) are found in over 50% of ccRCC cases. The authors noted that genomic risk prediction tools like ClearCode34 and the 16-gene assay have already been developed for localized ccRCC, and integrating these genetic biomarkers into their ML models could further improve prediction accuracy.

TL;DR: In ccRCC-only analysis, logistic regression achieved AUROC 0.710 for survival and 0.811 for metastasis (slightly better than full-cohort metastasis AUROC of 0.804). ccRCC patients had an odds ratio of 0.897 for survival and 1.372 for metastasis vs. other subtypes. VHL gene inactivation and mutations in PBRM1, SETD2, and BAP1 characterize over 50% of ccRCC cases.

Limitations & Future Directions

Pages 9-11

Key Limitations: Single Data Source, Missing Genomics, and No External Validation

Single data source: The study relied exclusively on the SEER database, which, while large and publicly available, represents only about 30% of the US population and may not generalize to other populations, particularly non-US cohorts. The authors acknowledge this limited generalizability. Furthermore, the SEER database does not capture many treatment details (systemic therapy specifics, radiation protocols) that could influence survival outcomes, meaning these models predict prognosis based on diagnosis-time features without accounting for treatment received.

No external validation: All models were trained and tested on the same SEER dataset using an 80/20 split. No independent external validation cohort was used, which is a significant limitation for clinical deployment. The authors noted that external validation using hospital patient data is needed before these models can be considered reliable decision-support tools. Without external validation, overfitting remains a real concern, especially given that all eight models showed relatively similar performance levels.

Missing molecular and genomic data: The models used only clinical and pathological variables. Genetic biomarkers, which are increasingly important in kidney cancer prognostication, were entirely absent. Biomarker panels like ClearCode34 and the 16-gene assay have shown prognostic value in localized ccRCC and could meaningfully enhance model performance. The authors specifically call for incorporating genetic biomarkers in future iterations and evaluating them in independent populations.

Moderate performance ceiling: The best AUROC achieved was 0.804 for metastasis and 0.741 for survival. While useful, these numbers leave substantial room for improvement. The sensitivity for metastasis prediction (0.540) is particularly concerning for clinical use, as missing nearly half of metastatic patients would be unacceptable in a screening or triage context. Future work should explore ensemble methods, feature engineering from imaging or genomic data, and multi-institutional datasets. The authors envision these models eventually being integrated into hospital information systems to provide real-time decision support for kidney cancer care.

TL;DR: Key limitations include reliance on a single data source (SEER), no external validation, absence of molecular/genomic biomarkers, and moderate sensitivity (54.0% for metastasis). Future work should incorporate genetic data (ClearCode34, 16-gene assay), validate on independent hospital cohorts, and aim for integration into hospital information systems.

Machine Learning-Based Prognostic and Metastasis Models of Kidney Cancer

Original Paper (PDF)

Plain-English Explanations