Kidney Cancer Diagnosis & Surgery Selection by ML

Overview & Motivation

Pages 1-2

Why Kidney Cancer Needs Better AI-Assisted Diagnosis and Surgical Guidance

Kidney cancer is the 14th most common malignancy worldwide (9th among men), with over 430,000 new cases diagnosed globally in 2020. It is closely linked to chronic kidney disease (CKD), which affects roughly 800 million people and carries mortality rates between 7 and 12% depending on region. The disease is typically diagnosed through blood tests, urinalysis, imaging, and occasionally biopsy, but these methods are lengthy, complex, and often unreliable.

Once a kidney tumor is detected, clinicians face a critical decision: should the patient undergo partial nephrectomy (removing only the tumor and a margin of healthy tissue) or radical nephrectomy (removing the entire kidney)? Getting this wrong in either direction carries serious consequences. A partial procedure on a patient who truly needed full removal could leave behind cancer, while radical surgery on a less severe case could consign the patient to lifelong dialysis or the need for a future transplant without sufficient cause.

The authors point out that previous AI studies using contrast-enhanced CT images to classify kidney tumors have largely ignored the integration of patient clinical metadata, such as demographics, comorbidities, and tumor staging. This study proposes a combined approach that fuses CT image analysis with clinical data to both classify kidney cancer subtypes and guide surgical decisions for renal cell carcinoma (RCC) patients.

TL;DR: Kidney cancer affects 430,000+ people per year, and choosing between partial and radical nephrectomy is a high-stakes decision. Previous AI approaches used CT images alone. This study integrates CT scans with patient clinical metadata to improve both tumor subtype classification and surgical procedure selection for RCC.

Dataset & Clinical Context

Pages 2-4

The KiTS21 Dataset: 300 Patients with CT Scans and Rich Metadata

The study uses the publicly available KiTS21 (Kidney and Kidney Tumor Segmentation Challenge 2021) dataset, which contains contrast-enhanced CT scans from n = 300 subjects who underwent either partial or radical nephrectomy between 2010 and 2018. Subjects came from M Health Fairview and Cleveland Clinic medical centers. Each case includes 3D-CT volumes with expert-annotated ground-truth segmentation masks for kidney, tumor, and cyst regions.

The dataset is heavily dominated by clear cell RCC (ccRCC) at 203 cases (67.7%), followed by papillary RCC (pRCC) at 28 cases (9.3%), chromophobe RCC (chRCC) at 27 cases (9%), and oncocytoma (ONC) at 16 cases (5.3%). Regarding malignancy, 275 cases (91.7%) were malignant and only 25 (8.3%) were benign. For surgical procedures, 188 patients (62.6%) underwent partial nephrectomy while 112 (37.3%) received radical nephrectomy. The majority of surgeries were robotic (57.3%), followed by open (26.3%) and laparoscopic (16.3%).

Patient demographics show a median age of 60 years, a median BMI of 29.82 kg/m², and a 60/40 male-to-female ratio. The median tumor diameter was 4.2 cm with a median volume of 34.93 cm³. For pathological staging, 60% of cases were T-stage 1, while 23.3% had reached T-stage 3, indicating that the dataset contains a realistic mix of early and advanced kidney cancers.

TL;DR: The KiTS21 dataset provides 300 patients with contrast-enhanced CT scans and clinical metadata from two major medical centers. The data is heavily imbalanced toward ccRCC (67.7%) and malignant cases (91.7%). Of the surgical procedures, 62.6% were partial nephrectomy and 37.3% radical nephrectomy, predominantly performed robotically.

Methodology - Image Processing

Pages 5-7

From Raw CT Volumes to Tumor ROIs: Scope Reduction and YOLO Detection

The pipeline begins by extracting 64,603 2D slices from the 300 3D-CT volumes. A critical preprocessing step limits the Hounsfield Unit (HU) range to (-200, 500), which focuses the image contrast on kidney tissue by clipping extreme radiodensity values. Approximately 64.5% of all slices (41,675) contain no kidney at all because some scans capture the entire body from head to toe, creating a massive amount of irrelevant data.

To handle this, the authors developed a scope reduction step using a novel architecture called DenseAUXNet201. This is a modified version of DenseNet201 that adds auxiliary loss outputs after each of the three intermediate dense blocks and concatenates intermediate features with the final layer output before passing through an MLP classifier. The auxiliary losses help optimize feature flow through the network. DenseAUXNet201 outperformed ResNet152, InceptionV3, and MobileNetV2, achieving 98.02% accuracy in separating kidney from non-kidney slices, with only 1,277 misclassified cases out of 64,603 total slices.

After scope reduction, the authors used YOLOv7 for region of interest (ROI) extraction to locate kidney, tumor, and cyst bounding boxes within the remaining slices. They tested six YOLO variants across two generations (YOLOv5s/m/l/x and YOLOv7/v7x). While YOLOv5 variants achieved better precision, YOLOv7 delivered superior mean average precision (mAP) at IoU thresholds of 0.5 and 0.5-0.95, indicating more efficient detection of small and challenging tumor regions. For kidney ROIs, YOLOv7 achieved mAP of 0.988 at IoU 0.5; for tumor ROIs, mAP was 0.756 at IoU 0.5.

TL;DR: A custom DenseAUXNet201 model filters out 64.5% of irrelevant non-kidney CT slices at 98.02% accuracy. YOLOv7 then extracts tumor ROIs with mAP of 0.756 (tumors) and 0.988 (kidneys) at IoU 0.5, outperforming YOLOv5 variants on challenging small-region detection.

Methodology - Deep Learning Classification

Pages 7-9

Tumor Subtype Classification: Deep CNN Models on Extracted ROIs

With tumor ROIs extracted, the next step is classifying each case into one of four subtypes: ccRCC, pRCC, chRCC, or ONC. The five angiomyolipoma (AML) cases were excluded due to insufficient samples for deep learning. Three ccRCC cases with extremely small or distorted tumor ROIs were also removed, leaving n = 272 cases for the classification task. The dataset was split into five subject-wise folds with 60 test subjects per fold (20%).

The same architectures used for scope reduction were tested here: ResNet152, DenseNet201, InceptionV3, MobileNetV2, and DenseAUXNet201. To address the severe class imbalance (ccRCC dominates at 67.7%), training images in minority classes were augmented using random rotation (-90 to 90 degrees) and random vertical/horizontal flips. Image-level predictions were converted to subject-level results through majority voting across all slices for each patient.

DenseAUXNet201 again outperformed all other architectures. The custom auxiliary losses and feature concatenation from intermediate layers provided a significant performance boost. From the penultimate layer of DenseAUXNet201, 4,480 image features were extracted per image and compressed to 20 features via principal component analysis (PCA) for later combination with clinical data.

TL;DR: Four kidney cancer subtypes (ccRCC, pRCC, chRCC, ONC) were classified from extracted tumor ROIs across 272 patients using five CNN architectures. DenseAUXNet201 was the top performer. Its 4,480 latent features were compressed to 20 via PCA for fusion with clinical metadata.

Methodology - Combined Approach & Feature Engineering

Pages 9-11

Fusing CT Image Features with Clinical Metadata for Better Classification

The clinical metadata from KiTS21 includes 29 raw features divided into preoperative, intraoperative, and postoperative categories. Critically, only preoperative features were used since only presurgical information would be available to guide real-world decision-making. These include tumor size (radiographic and pathologic), demographics (gender, age, BMI), lifestyle factors (alcohol, smoking, tobacco history), and the presence of 19 common comorbidities such as myocardial infarction and congestive heart failure.

The authors converted TNM staging (tumor, node, metastasis) into Fuhrman nuclear grade, the standard four-stage severity scale for kidney cancer. For example, a tumor at T-stage 4 with no metastasis is classified as Fuhrman stage IV, while any case with metastasis (M = 1) is automatically stage IV regardless of T and N values. This yielded 171 stage I, 17 stage II, 56 stage III, and 15 stage IV patients among the 259 malignant cases.

Feature engineering involved removing highly correlated features (threshold of 0.85 cross-correlation), which eliminated "pathologic size" due to its redundancy with "radiographic size." The remaining 48 features (20 PCA-compressed image features plus 28 clinical features) were ranked using three techniques: XGBoost, Random Forest, and Extra Trees. Only the top 20 features were retained for optimal performance. Thirteen classical ML algorithms were then tested, including logistic regression, SVM, kNN, Random Forest, XGBoost, LightGBM, AdaBoost, and MLP.

TL;DR: Only preoperative clinical features were used to reflect real-world conditions. TNM staging was converted to Fuhrman grade. After removing correlated features and applying PCA to image features, the top 20 of 48 combined features were selected via XGBoost, Random Forest, and Extra Trees ranking, then fed into 13 classical ML classifiers.

Results - Scope Reduction & ROI Extraction

Pages 12-15

Pipeline Performance: Filtering Slices and Detecting Tumor Regions

For scope reduction, DenseAUXNet201 achieved the best performance across all metrics: 98.02% accuracy, 98.03% precision, 98.02% recall, 97.13% specificity, and 98.03% F1-score. Out of 64,603 total 2D-CT slices, only 1,277 were misclassified (909 kidney slices mistakenly labeled as non-kidney, and 368 non-kidney slices mistakenly labeled as kidney). ScoreCAM heatmap visualization confirmed that the model correctly focused on kidney regions in slices containing kidneys and on alternative tissue features in non-kidney slices.

For ROI extraction using YOLOv7, kidney detection was highly reliable with mAP of 0.988 at IoU 0.5 and 0.918 at IoU 0.5-0.95. Tumor detection was more challenging but still strong at mAP 0.756 at IoU 0.5 and 0.525 at IoU 0.5-0.95. Cyst detection performed poorly (mAP 0.278 at IoU 0.5) due to the tiny size of cyst structures and limited training examples, but cyst detection was not a primary goal of the study.

The qualitative evaluation showed that YOLOv7 generated bounding boxes that closely matched the ground-truth annotations for both kidney and tumor regions. Performance curves (F1, precision-confidence, precision-recall, and recall-confidence) demonstrated robust detection across varying confidence thresholds, with kidney detection consistently outperforming tumor and cyst detection as expected given the relative sizes of these structures.

TL;DR: DenseAUXNet201 filtered non-kidney slices at 98.02% accuracy with only 1,277 errors out of 64,603 slices. YOLOv7 detected kidney ROIs at mAP 0.988 and tumor ROIs at mAP 0.756 (IoU 0.5). ScoreCAM heatmaps confirmed the models learned to focus on the correct anatomical regions.

Results - Tumor Subtype Classification

Pages 16-18

Combining CT Features with Clinical Data Reaches 85.66% Classification Accuracy

For image-only tumor subtype classification, DenseAUXNet201 outperformed all other deep learning models. When clinical metadata was integrated with the extracted image features, XGBoost emerged as the best-performing classical ML classifier, achieving 85.66% accuracy, 84.18% precision, 85.66% recall, and 84.92% F1-score. Random Forest and CatBoost also performed well, but XGBoost was selected for its superior F1-score.

Feature ranking by Random Forest revealed that the most influential feature for tumor classification was the tumor class predicted from images, followed by the malignancy marker from clinical data. Tumor size, age, BMI, gender, and habitual features such as smoking and drinking history were also ranked as important predictors. This confirms that both imaging and clinical information contribute meaningfully to classification.

The combined approach significantly improved ONC classification (which is benign) and pRCC classification compared to image-only analysis. However, chRCC identification worsened when clinical features were added, as the model confused it with ccRCC. This degradation was attributed to the severe class imbalance in the dataset that could not be fully mitigated through augmentation alone. The ccRCC class, being vastly overrepresented, tended to absorb borderline chRCC predictions.

TL;DR: The combined image + clinical metadata approach achieved 85.66% accuracy and 84.92% F1-score using XGBoost. Adding clinical data improved ONC and pRCC classification but worsened chRCC identification due to class imbalance. Tumor class from images and the malignancy marker were the top-ranked features.

Results - Surgical Procedure Selection

Pages 18-20

Predicting Partial vs. Radical Nephrectomy at 90.63% Accuracy

For the binary classification of partial versus radical nephrectomy in malignant RCC cases, the study used 28 clinical features (after removing redundant pathologic size). Feature ranking by Random Forest identified radiographic tumor size/volume and Fuhrman cancer stage as the two most important predictors, followed by patient BMI, age, smoking habits, and history of chronic kidney diseases. This aligns with clinical practice, where surgeons choose radical approaches for large tumors at advanced stages and for elderly patients with high BMIs.

Among 14 classical ML classifiers tested, logistic regression was selected as the best model due to its simplicity combined with high performance. It achieved 90.63% accuracy, 90.83% precision, 90.61% recall, and 90.50% F1-score. SVM, LDA, ridge regression, and MLP also performed comparably. The confusion matrix showed that only 7 partial nephrectomy cases were misclassified as radical, while 17 radical cases were misclassified as partial, across all 256 malignant patients.

Importantly, the model's performance was consistent across all three RCC subtypes (ccRCC, chRCC, and pRCC), confirming that tumor subtype did not bias the surgical procedure classifier. The XGBoost feature ranking additionally highlighted the presence of solid metastatic tumors and chronic obstructive pulmonary disease (COPD) as relevant factors, both of which are clinically linked to kidney pathology and influence surgical decision-making in practice.

TL;DR: Logistic regression predicted partial vs. radical nephrectomy at 90.63% accuracy and 90.50% F1-score using only preoperative clinical data. Tumor volume and Fuhrman stage were the top predictors. Only 24 of 256 malignant cases were misclassified (7 partial-as-radical, 17 radical-as-partial), with consistent performance across all RCC subtypes.

Limitations & Conclusions

Pages 20-21

Dataset Constraints and the Path Toward Clinical Application

The primary limitation is the generality of the KiTS dataset. With only 300 patients drawn from two medical centers, the dataset lacks geographic and demographic diversity. The severe class imbalance (ccRCC at 67.7% versus chRCC at 9% and ONC at 5.3%) limited the model's ability to accurately classify minority subtypes, particularly chromophobe RCC. The five AML cases were too few for any deep learning application and had to be excluded entirely.

The authors note that their approach could be significantly improved by collecting a large, objective-driven dataset specifically designed for this task. Such a dataset should be more balanced across tumor subtypes and cancer stages, ideally containing thousands of diverse cases. It should also include additional clinical biomarkers, detailed patient medical histories, surgical-table parameters, and a broader range of comorbidities to make the AI tool more reliable and robust for a wider population.

Despite these constraints, the study demonstrates that integrating clinical metadata with CT image features measurably improves kidney cancer classification over imaging alone. The surgical procedure prediction component, achieving over 90% accuracy with a simple logistic regression model using only preoperative data, shows particular promise for real-world clinical deployment. The feature ranking results validate the model by matching known clinical decision-making criteria, where tumor volume, cancer stage, patient age, BMI, and smoking history all align with factors that surgeons already consider when choosing between partial and radical nephrectomy.

TL;DR: The KiTS dataset's 300 patients and heavy class imbalance are the main limitations. A larger, balanced, multi-center dataset with richer clinical features would improve performance. Despite this, the combined approach outperforms image-only classification, and the 90.63% surgical prediction accuracy using only preoperative data represents a promising step toward clinical AI-assisted nephrectomy planning.

Kidney Cancer Diagnosis and Surgery Selection by Machine Learning from CT Scans Combined with Clinical Metadata

Original Paper (PDF)

Plain-English Explanations