Using deep learning to differentiate among histology renal tumor types in computed tomography scans

Journal of Medical Imaging 2024 AI 8 Explanations View Original
Original Paper (PDF)

Unable to display PDF. Download it here or view on PMC.

Plain-English Explanations
Page 1
Why Classifying Renal Tumor Subtypes From CT Scans Matters

Renal cell carcinoma (RCC) is one of the most common cancers in Taiwan, and its incidence continues to rise. Computed tomography (CT) is the standard tool for diagnosing and staging renal tumors, with prior studies reporting staging accuracy around 91%. However, CT interpretation depends heavily on radiologist experience, and inter-rater and intra-rater variability can significantly affect diagnostic accuracy. Studies have found that 6.4% to 40.4% of renal tumors classified as malignant based on preoperative CT turned out to be benign after surgical resection, leading to unnecessary procedures.

Renal tumor biopsy (RTB) provides a tissue-based alternative, but it carries risks including tumor cell seeding, bleeding, fistula formation, pseudoaneurysm, infection, and pneumothorax. Biopsies are also nondiagnostic in roughly 11-14% of cases, which limits their routine clinical use. Meanwhile, treatment options for kidney cancer have expanded beyond surgery to include active surveillance, targeted therapy, and immunotherapy. Being able to classify tumor subtypes noninvasively from imaging alone would be a major clinical advantage.

Most existing deep learning models for renal tumors are limited to binary classification, such as benign vs. malignant, or clear cell RCC vs. non-clear cell RCC. Many of these studies also used small cohorts of fewer than 200 patients. This study aimed to go beyond binary classification by training convolutional neural network (CNN) models to distinguish among the five most common renal tumor subtypes: angiomyolipoma (AML), oncocytoma, clear cell RCC (ccRCC), chromophobe RCC (chRCC), and papillary RCC (pRCC).

TL;DR: CT is the standard for renal tumor diagnosis, but 6.4-40.4% of tumors deemed malignant preoperatively are actually benign. Biopsy is nondiagnostic in 11-14% of cases and carries procedural risks. This study trained deep learning models for 5-way subtype classification rather than simple binary classification.
Pages 2-3
Patient Population and Inclusion Criteria

This retrospective study was approved by the Institutional Review Board of Chang Gung Memorial Hospital, Linkou Branch, Taiwan (IRB No. 201901321B0). Between January 2008 and September 2018, the researchers enrolled 691 patients who had been diagnosed with renal tumors and undergone surgical resection. Patients were excluded if they lacked preoperative CT scans or had only non-enhanced CT. Additional exclusion criteria included renal cysts, polycystic kidney disease, maintenance hemodialysis, tumors smaller than 1 cm, and severe imaging artifacts.

The final cohort comprised 554 patients: 328 males (59.2%) and 226 females (40.8%) with a median age of 56 years (IQR: 47-66 years). The subtype distribution was as follows: ccRCC (n = 246, 44.4%), chRCC (n = 124, 22.4%), pRCC (n = 83, 15%), AML (n = 67, 12%), and oncocytoma (n = 34, 6.1%). The median largest tumor diameter was 53.5 mm (IQR: 36-74 mm). This distribution reflects the known epidemiology of renal tumors, where ccRCC is the most common malignant subtype and oncocytoma is comparatively rare (3-7% of solid renal tumors).

Many patients were referred from external hospitals, so contrast-enhanced CT images came from multiple institutions. Standard scanning parameters included 5 mm slice thickness, contrast agent injection rate of 1-2 mL/sec, contrast dose of 1-2 mL/kg, and whole-abdomen coverage with a non-contrast phase followed by an enhanced phase taken 80-120 seconds after injection. For patients with bilateral or multiple tumors, pathology reports were correlated with CT images, and cases with disagreement between pathology and imaging were excluded.

TL;DR: From 691 initial patients, 554 met inclusion criteria: ccRCC (n = 246), chRCC (n = 124), pRCC (n = 83), AML (n = 67), and oncocytoma (n = 34). Median age was 56, median tumor size was 53.5 mm, and CT images came from multiple institutions.
Pages 2-3
Image Preprocessing, Augmentation, and Data Splits

Renal tumor outlines on axial nephrographic-phase CT images were manually segmented by two urologists, who defined regions of interest (ROIs). The CT images were then converted to PNG format using the default abdominal imaging window of Chang Gung Medical Center, mapping Hounsfield Units (HUs) in the range of -115 to 227 onto 8-bit PNG pixel values (0 to 255). This range was chosen to clearly image abdominal organs, though it meant the model could not learn features from tissue densities outside this window. After conversion, the renal tumor was cropped using a minimal bounding rectangle.

The 554 patients were randomly split into a training set (90%, n = 501) and a testing set (10%, n = 53). The test set contained the following distribution: AML (n = 6), oncocytoma (n = 3), ccRCC (n = 24), chRCC (n = 12), and pRCC (n = 8). The training set was further divided at an 8:2 ratio for 5-fold cross-validation.

Data augmentation: To address class imbalance, images from underrepresented groups (AML and oncocytoma) were augmented to approximately 50% of the count in the largest group (ccRCC). Augmentation techniques included horizontal flipping, vertical flipping, and rotation. After augmentation, the training dataset expanded to 4,238 CT images: AML (966 images), oncocytoma (881 images), ccRCC (1,811 images), chRCC (1,087 images), and pRCC (642 images). Importantly, only the training data was augmented; the original test data remained unmodified for unbiased performance evaluation.

TL;DR: Two urologists manually segmented tumors. HU range of -115 to 227 was mapped to 8-bit PNG values. A 90/10 train-test split was used. Data augmentation (flipping, rotation) expanded training images from the original count to 4,238 total, addressing class imbalance between subtypes.
Pages 3-4
Inception V3, ResNet-50, and Transfer Learning Strategy

The researchers trained two well-established CNN architectures: Inception V3 (311 layers) and ResNet-50 (175 layers). Both models were developed using Python 3.8.5 and TensorFlow 2.5.0, with initial weights pretrained on ImageNet. The study systematically explored how many layers to "unfreeze" for fine-tuning, rather than using a fixed approach. For Inception V3, the team tested 0 (pure transfer learning), 20, 40, 60, 80, 100, 120, 140, 160, 180, 200, 220, 240, 260, and 280 trainable layers. For ResNet-50, they tested 0, 25, 50, 75, 100, 125, and 150 trainable layers.

Both models were trained with a learning rate of 10^-5 for 30 epochs. This systematic layer-by-layer exploration of fine-tuning depth is notable because it demonstrated that pure transfer learning (zero trainable layers) performed poorly on this medical imaging task. Inception V3 with zero trainable layers achieved only 0.689 average accuracy and 0.727 weighted precision. ResNet-50 with zero trainable layers achieved only 0.717 average accuracy and 0.760 weighted precision. Allowing additional layers to be retrained significantly improved performance.

Patient-level aggregation: Since each patient had multiple 2D CT slices through the tumor, the study employed a pixel-weighted voting scheme to produce a final per-patient classification. Each image's predicted class probabilities were multiplied by the number of pixels in the cropped ROI for that slice. The products were summed across all slices, and the class with the highest cumulative score became the patient-level prediction. This approach gave greater weight to slices with larger tumor cross-sections.

TL;DR: Inception V3 (311 layers) and ResNet-50 (175 layers) were fine-tuned from ImageNet weights. Pure transfer learning yielded poor results (accuracy of 0.689 for Inception V3, 0.717 for ResNet-50). A pixel-weighted voting scheme aggregated slice-level predictions into patient-level classifications.
Pages 4-5
Inception V3 Performance Across Fine-Tuning Depths

The Inception V3 model achieved its best performance when 220 of its 311 layers were set as trainable. At this configuration, the model reached a peak accuracy of 0.830, peak weighted precision (WP) of 0.885, peak macro F1-score of 0.786, and peak weighted F1-score of 0.833. These represent single-fold peak values; the 5-fold cross-validation averages were slightly lower but consistent.

5-fold cross-validation averages at 220 trainable layers: accuracy of 0.804 +/- 0.019, weighted precision of 0.847 +/- 0.021, macro F1-score of 0.757 +/- 0.028, and weighted F1-score of 0.813 +/- 0.0176. The relatively tight standard deviations across folds suggest stable model performance rather than overfitting to a particular data split.

The gap between the macro F1-score (0.757) and the weighted F1-score (0.813) is worth noting. Macro F1 gives equal weight to every class regardless of sample size, while weighted F1 accounts for class frequency. The lower macro F1 indicates that the model performed less well on the minority classes (oncocytoma and AML) compared to the more common subtypes, reflecting the persistent challenge of class imbalance even after augmentation.

TL;DR: Inception V3 at 220 trainable layers: peak accuracy 0.830, 5-fold average accuracy 0.804 +/- 0.019, weighted precision 0.847 +/- 0.021, macro F1 0.757 +/- 0.028, weighted F1 0.813 +/- 0.018. The gap between macro and weighted F1 reflects weaker performance on rare subtypes.
Pages 5-6
ResNet-50 Performance and Comparison With Prior Work

ResNet-50 achieved its highest accuracy of 0.849 using just 50 trainable layers out of 175, outperforming Inception V3's peak accuracy of 0.830. The 5-fold cross-validation average accuracy was 0.811 +/- 0.027 (50 trainable layers). The highest weighted precision (0.887) came at 150 trainable layers, with an average of 0.865 +/- 0.015. The highest macro F1-score (0.813) was achieved using 75 trainable layers, with an average of 0.753 +/- 0.040. The highest weighted F1-score (0.852) was achieved with 50 trainable layers, averaging 0.838 +/- 0.027.

Interestingly, ResNet-50 needed far fewer trainable layers to reach peak accuracy compared to Inception V3 (50 vs. 220), suggesting that its residual connections allowed more efficient adaptation to this medical imaging domain. The overall finding is that ResNet-50 was slightly more accurate (0.849 vs. 0.830 peak, 0.811 vs. 0.804 average) and also achieved higher weighted F1-scores (0.838 vs. 0.813).

Comparison with prior studies: Most earlier deep learning work on renal tumors was limited to binary classification. Lee et al. achieved 76.6% accuracy distinguishing AML from ccRCC. Baghdadi et al. reached 95% accuracy differentiating oncocytoma from chRCC. Zhou et al. achieved 97% accuracy for benign vs. malignant classification. While these binary results are higher in absolute terms, they solve a much simpler clinical problem. In multi-class renal subtype discrimination, Uhlig et al. used radiomic features with extreme gradient boosting (XGBoost) and achieved an AUC of only 0.72 across the same five subtypes. The current study's deep learning approach substantially outperformed this radiomic/machine learning baseline.

TL;DR: ResNet-50 achieved 0.849 peak accuracy (50 trainable layers) and 0.811 +/- 0.027 average accuracy, slightly outperforming Inception V3. For the same 5-class task, prior radiomic methods achieved only AUC 0.72. Binary classification studies achieved higher absolute numbers but addressed much simpler clinical questions.
Pages 6-7
Single-Center Design, HU Range, and Class Imbalance

Single-center cohort: All patients came from a single tertiary center (Chang Gung Memorial Hospital), even though some were referrals from other hospitals. This limits the generalizability of results to broader populations, different scanner types, and varying imaging protocols across institutions. External validation on independent multi-center datasets was not performed.

Hounsfield Unit windowing: The study used an HU range of -115 to 227 for image preprocessing, which is the standard abdominal window at their institution. However, this means the models could not learn features from tissue densities outside this range. Some renal tumor characteristics, such as fat content in AML or calcifications, may produce HU values outside this window, potentially limiting classification performance for certain subtypes.

Persistent class imbalance: Despite augmentation, oncocytoma (n = 34 patients, only 3 in the test set) and AML (n = 67, only 6 in the test set) remained underrepresented. The lower macro F1-scores compared to weighted F1-scores across both models confirm that the minority classes were harder to classify accurately. With only 3 oncocytoma patients in the test set, per-class performance estimates for this subtype have very wide confidence intervals.

Manual segmentation: Tumor ROIs were drawn manually by two urologists, which is not scalable for clinical deployment. Automatic segmentation would be needed for real-world use. Additionally, the study included only five renal tumor subtypes, and rarer histologic variants were not represented.

TL;DR: Key limitations include single-center design (no external validation), a fixed HU range of -115 to 227 that may miss some tumor features, persistent class imbalance (only 34 oncocytoma patients total, 3 in the test set), manual segmentation that is not clinically scalable, and coverage of only five subtypes.
Page 7
Paths Toward Clinical Applicability

Multi-center validation: The most critical next step is validating these models on datasets from multiple medical centers with different CT scanners, protocols, and patient demographics. Without this, the clinical utility of an 80-85% accuracy 5-class model remains theoretical. Multi-site data would also increase sample sizes for the underrepresented subtypes, particularly oncocytoma.

Automated segmentation: Replacing manual tumor delineation with an automated segmentation model (such as a U-Net or similar architecture) would be essential for practical deployment. The current workflow requires urologists to manually outline each tumor, which is time-consuming and introduces its own variability. An end-to-end pipeline that takes a CT scan and outputs subtype classification without manual input would be far more useful in clinical practice.

Expanded subtype coverage and advanced architectures: Future work could incorporate additional rare renal tumor subtypes and explore more modern architectures beyond Inception V3 and ResNet-50, such as EfficientNet, Vision Transformers (ViT), or 3D CNNs that use volumetric information rather than individual 2D slices. Incorporating multi-phase CT data (corticomedullary, nephrographic, and excretory phases) as separate input channels could also improve discriminative power, particularly for subtypes that show distinct enhancement patterns across phases.

The study's finding that transfer learning alone was insufficient, and that substantial fine-tuning was needed, is an important insight for the field. It suggests that medical imaging features for renal tumor classification diverge significantly from the natural image features learned by ImageNet-pretrained models, and future studies should budget for extensive fine-tuning or consider pretraining on large medical imaging datasets instead.

TL;DR: Next steps include multi-center external validation, automated segmentation to replace manual ROI drawing, expanded subtype coverage, and exploration of modern architectures (EfficientNet, Vision Transformers, 3D CNNs). The finding that pure transfer learning was inadequate (accuracy of 0.689-0.717) underscores the need for domain-specific fine-tuning.
Citation: Kan HC, Lin PH, Shao IH, et al.. Open Access, 2025. Available at: PMC11866614. DOI: 10.1186/s12880-025-01606-3. License: cc by.