KidneyNeXt: A Lightweight Convolutional Neural Network for Multi-Class Renal Tumor Classification in Computed Tomography Imaging

Cancers 2023 AI 8 Explanations View Original
Original Paper (PDF)

Unable to display PDF. Download it here or view on PMC.

Plain-English Explanations
Pages 1-3
Why Automated Renal Tumor Classification on CT Is a Clinical Priority

Kidney cancer is a major global health burden. In 2018, more than 400,000 new cases were reported worldwide, and that number is projected to reach approximately 475,400 annually by 2030. Renal cell carcinoma (RCC) is the most prevalent malignant subtype, accounting for roughly 85% of all adult kidney tumors. Its histological subtypes, including clear cell, papillary, and chromophobe variants, each carry distinct prognoses and treatment responses. Clinically, renal masses are grouped into three broad categories: malignant tumors requiring surgical or systemic treatment, benign tumors (such as oncocytoma and angiomyolipoma) that are usually noninvasive, and normal tissue or simple cysts that need no intervention. Reliably distinguishing among these three categories on imaging is essential to avoid unnecessary surgery and to guide patient-specific treatment plans.

The CT interpretation bottleneck: Computed tomography is the workhorse imaging modality for evaluating renal masses, but interpretation depends heavily on clinician experience. Inter-observer variability is a well-documented problem, and the overlapping visual characteristics of benign, malignant, and normal kidney tissue on CT make manual classification time-consuming and error-prone. These limitations have motivated the development of AI-driven classification tools that can streamline image analysis and reduce diagnostic ambiguity.

Prior deep learning approaches and their gaps: Over the past decade, researchers have tested a wide range of architectures for kidney tumor classification, including EfficientNet, U-Net, Swin Transformers, VGG16, ConvLSTM-Inception hybrids, and 3D Trans-ResUNet models. Many of these delivered high accuracy on individual datasets, but they share common weaknesses: high computational cost due to large parameter counts, reliance on single-center or single-dataset evaluations, and a focus on binary classification or segmentation rather than multi-class tumor typing. For example, Khan et al. (2025) achieved 99.30% accuracy with a ConvLSTM-Inception model on a two-class dataset, but performance dropped to 91.31% on a four-class problem. Loganathan et al. (2025) reached 98.87% with EACWNet but struggled with low precision in the stone class.

What KidneyNeXt aims to solve: The authors set out to build a CNN that is simultaneously lightweight, high-performing, and generalizable across multiple datasets. The model targets multi-class renal tumor classification (benign, malignant, normal, cyst, stone) rather than binary detection, and it was evaluated on three geographically and clinically distinct CT datasets to test robustness. The architecture prioritizes parameter efficiency, with only approximately 7.1 million trainable parameters, compared to the much heavier ensemble or transformer-based alternatives in the literature.

TL;DR: Kidney cancer affects over 400,000 people per year globally, with RCC making up 85% of cases. CT-based classification suffers from inter-observer variability. Prior deep learning models are often parameter-heavy and validated on single datasets. KidneyNeXt targets multi-class renal tumor classification with only 7.1 million parameters, tested across three diverse CT datasets.
Pages 4-6
Three CT Datasets Spanning Turkey, Bangladesh, and Jordan

Collected Dataset (3,199 images): The first dataset was retrospectively curated from the archives of Elazig Fethi Sekin City Hospital in Turkey, under ethics approval (Session No: 2025/6-20). CT scans were cropped, anonymized, and saved in PNG format. An experienced nephrologist and radiologist independently classified images into three categories. The malignant group contained 1,147 images (mean patient age 60.23 +/- 7.95 years), the benign group had 1,919 images (mean age 65.27 +/- 5.22 years), and the control group included 1,133 images (mean age 60.18 +/- 6.25 years). This dataset provided a clinically curated, single-institution baseline for model evaluation.

Kaggle CT KIDNEY Dataset (12,446 images): The second dataset is publicly available on Kaggle, originally collected from the Picture Archiving and Communication System (PACS) of several hospitals in Dhaka, Bangladesh. It contains four classes: 3,709 cyst images, 5,077 normal images, 1,377 stone images, and 2,283 tumor images. Both coronal and axial sections from contrast-enhanced and non-contrast CT studies were included. DICOM scans were segmented, anonymized, and converted to JPG format using a lossless method. Each image was independently verified by a radiologist and a medical technologist.

KAUH Jordan Dataset (7,770 images): The third dataset came from King Abdullah University Hospital (KAUH) in Jordan and includes four categories: benign (2,660 images, mean age 60.35 +/- 8.92 years), malignant (1,540 images, mean age 61.75 +/- 11.23 years), normal with cyst (1,330 images, mean age 62.84 +/- 13.17 years), and control (2,240 images, mean age 52.47 +/- 14.21 years). Each image underwent diagnostic validation for accuracy.

Preprocessing: Across all three datasets, images were resized to 224 x 224 pixels in RGB format. No data augmentation was applied. Each dataset was split into 80% training and 20% testing, with 30% of the training portion reserved for validation. The multi-source strategy was deliberate: by training and evaluating on datasets that differ in acquisition protocols, image quality, and patient demographics, the authors aimed to reduce dataset-specific learning bias and demonstrate real-world generalizability.

TL;DR: KidneyNeXt was evaluated on three CT datasets totaling 23,415 images: a clinical Turkish dataset (3,199 images, 3 classes), the Kaggle CT KIDNEY dataset from Bangladesh (12,446 images, 4 classes), and the KAUH Jordan dataset (7,770 images, 4 classes). All images were resized to 224 x 224 with no augmentation and split 80/20 for training/testing.
Pages 7-9
KidneyNeXt: Multi-Branch Convolutions with Grouped Processing and Hierarchical Depth

The KidneyNeXt architecture is a hierarchically structured CNN with four processing stages and a design philosophy centered on parallel feature extraction. The network begins with a stem layer consisting of two parallel convolutional layers (4 x 4 kernels), each followed by batch normalization (BN) and GELU activation. The outputs of these two branches are averaged to form the base feature map with dimensions 56 x 56 x 96. This dual-path stem is intended to capture complementary low-level features from the input image before passing them to deeper layers.

KidneyNeXt blocks: Each of the four main processing stages (KidneyNeXt 1 through KidneyNeXt 4) processes the feature map through four parallel streams: 3 x 3 max pooling, 3 x 3 average pooling, and two separate grouped convolutions. The outputs of all four streams are concatenated and then compressed back to the target channel depth via a 1 x 1 convolution. For example, in the first stage the concatenated output has 384 channels (4 x 96), which is compressed back to 192 channels for the next stage. Channel sizes double at each stage: 96, 192, 384, and 768, while spatial dimensions halve (56 x 56 down to 7 x 7). Residual connections and compressed transition layers are incorporated between stages to improve learning capacity and reduce overfitting.

Classification head: After the fourth block, the final 7 x 7 x 768 tensor passes through global average pooling (GAP) to produce a 768-dimensional feature vector. This vector is fed into a fully connected layer followed by a softmax function for class prediction. The entire architecture contains approximately 7.1 million trainable parameters, making it substantially lighter than many alternatives. For context, EfficientNet-B7 has around 66 million parameters, and transformer-based models like Swin-T have roughly 28 million.

Design rationale: The combination of max pooling and average pooling within each block captures both sharp local features (edges, boundaries) and smooth regional patterns (texture gradients). Grouped convolutions reduce parameter count while still learning diverse filter representations. The hierarchical doubling of channels ensures the network builds increasingly abstract feature maps at each stage without an explosion in computational cost.

TL;DR: KidneyNeXt uses a dual-path stem (4 x 4 convolutions), four hierarchical blocks with parallel max pooling, average pooling, and grouped convolutions, and a GAP-to-softmax classification head. Channel sizes progress from 96 to 768 across stages. The full architecture has only 7.1 million parameters.
Pages 9-10
Two-Stage Training: ImageNet Pretraining Followed by Fine-Tuning on Kidney CT

The training protocol followed a two-stage approach combining transfer learning with task-specific fine-tuning. In the first stage, KidneyNeXt was pretrained on the ImageNet-1K dataset for 90 epochs using the AdamW optimizer with a learning rate of 1 x 10^-3 and a weight decay coefficient of 1 x 10^-4. This pretraining phase initialized the model's convolutional filters with general-purpose visual features (edges, textures, shapes) learned from over one million natural images spanning 1,000 categories.

Fine-tuning on kidney CT data: After pretraining, the model was fine-tuned on the combined kidney CT dataset. This stage used the Stochastic Gradient Descent with Momentum (SGDM) optimizer, with a momentum value of 0.9, over 30 epochs. The initial learning rate was set to 0.01, batch size was 128, and L2 regularization was applied with a coefficient of 1 x 10^-4. Data shuffling occurred at the beginning of each epoch to improve stochastic sampling. Neither checkpointing nor early stopping was used during training.

Hardware and software: All experiments were conducted on a workstation running Windows 11 with a 13th-generation Intel Core i9-13900K processor, 128 GB of RAM, 1 TB SSD storage, and an NVIDIA GeForce RTX 4080 Super GPU. The entire pipeline, from preprocessing to training and validation, was implemented in MATLAB 2023b. The choice of MATLAB rather than Python-based frameworks (PyTorch, TensorFlow) is notable, as it may affect reproducibility for researchers more familiar with the dominant deep learning ecosystems.

TL;DR: KidneyNeXt was pretrained on ImageNet-1K for 90 epochs (AdamW, lr=0.001) and fine-tuned on kidney CT for 30 epochs (SGDM, momentum 0.9, lr=0.01, batch size 128). Training ran on an RTX 4080 Super GPU in MATLAB 2023b with no early stopping or checkpointing.
Pages 10-15
Near-Perfect Classification Across All Three Datasets

Collected Dataset (3 classes): On the clinical Turkish dataset, KidneyNeXt achieved an overall accuracy of 99.76% and a macro-averaged F1 score of 99.71%. The benign class scored a perfect 100% across precision, recall, and F1 score with zero false positives and zero false negatives. The control class reached 99.56% F1 (225 of 227 correctly classified, with 2 misclassified as malignant). The malignant class achieved 99.57% F1 with 99.13% precision and 100% recall. All three ROC curves yielded an AUC of 1.00, indicating perfect class separability on this dataset.

Kaggle CT KIDNEY Dataset (4 classes): Performance on this larger public benchmark was even higher: 99.96% overall accuracy and a 99.94% macro-averaged F1 score. The normal class achieved perfect scores (100% precision, recall, and F1). The tumor class reached 99.94% F1 (99.88% precision, 100% recall). Only two misclassifications occurred across the entire 4,480-image test set: one cyst image was misclassified, and one stone image was misclassified. The cyst class still achieved 99.92% F1, and the stone class reached 99.90% F1.

KAUH Jordan Dataset (4 classes): On this third dataset, the model achieved 99.74% overall accuracy and a 99.72% macro-averaged F1 score. The normal class again scored a perfect 100% across all metrics. The malignant class reached 99.84% F1 (100% precision, 99.68% recall). The benign class scored 99.62% F1, and the cyst class scored 99.44% F1. A total of four misclassifications occurred across the 1,554-image test set: two benign images were misclassified, one cyst image, and one malignant image.

Training dynamics: Across all three datasets, the training and validation accuracy curves showed rapid convergence, surpassing 95% within the early iterations and stabilizing quickly. Loss curves converged to minimal values without divergence between training and validation sets, suggesting no overfitting despite the absence of early stopping or checkpointing mechanisms.

TL;DR: KidneyNeXt achieved 99.76% accuracy (F1 99.71%) on the clinical dataset, 99.96% accuracy (F1 99.94%) on the Kaggle dataset, and 99.74% accuracy (F1 99.72%) on the KAUH Jordan dataset. All ROC curves on the clinical dataset reached AUC 1.00. Total misclassifications across all three test sets were minimal (2, 2, and 4 respectively).
Pages 12-13
Deep Feature Validation and Grad-CAM Interpretability

Traditional classifier validation: To independently confirm the quality of features learned by KidneyNeXt, the authors extracted deep features from the model's final fully connected layer and fed them into several traditional machine learning classifiers using 10-fold cross-validation on the Collected Dataset test partition. A standard SVM achieved 100.00% accuracy on these extracted features. Efficient Linear SVM, Ensemble methods, KNN, and Neural Network classifiers all exceeded 99.80% accuracy. This secondary analysis demonstrates that the feature representations learned by KidneyNeXt are highly discriminative, even when separated from the CNN's own classification head.

Misclassification analysis: The only misclassifications on the Collected Dataset involved two control images incorrectly predicted as malignant. The authors visualized these misclassified cases and attributed the errors to visual ambiguity, specifically subtle texture similarities between certain normal kidney structures and pathological tissue patterns. These cases highlight the importance of incorporating uncertainty estimation tools in clinical deployment scenarios.

Grad-CAM visualizations: Gradient-weighted Class Activation Mapping (Grad-CAM) was applied to generate class-specific heatmaps showing which image regions most influenced the model's predictions. In malignant cases, the activated regions were broader and asymmetrical, consistent with tumoral spread patterns. In benign cases, attention was more localized and well-defined, correlating with the typically circumscribed nature of benign lesions. These visualizations provide clinical plausibility for the model's decision-making and serve as a transparency mechanism for radiologist review.

TL;DR: Deep features from KidneyNeXt's FC layer achieved 100% accuracy via SVM in 10-fold cross-validation, confirming high feature discriminability. Grad-CAM heatmaps showed the model focuses on clinically relevant regions: broad asymmetric activation for malignant cases and localized attention for benign lesions.
Pages 17-19
How KidneyNeXt Stacks Up Against 16 State-of-the-Art Models

The authors benchmarked KidneyNeXt against 16 recent deep learning studies on kidney CT classification. Several key comparisons stand out. Islam et al. (2022) achieved 99.30% accuracy with a Swin Transformer on the same Kaggle dataset, but KidneyNeXt surpassed this with 99.96%. Khan et al. (2025) reported 99.30% accuracy on a two-class problem with a ConvLSTM-Inception hybrid, but that model's accuracy dropped to 91.31% on a four-class task, whereas KidneyNeXt maintained above 99.7% across both three-class and four-class datasets. Loganathan et al. (2025) reached 98.87% with EACWNet but suffered from low precision in the stone class, a problem KidneyNeXt did not exhibit (100% precision for stones on the Kaggle dataset).

Transformer and ensemble comparisons: Rehman et al. (2025) combined Swin ViT with DeepLabV3+ for 99.20% accuracy, but at significantly higher computational cost. Ayogu et al. (2025) reached 99.67% with an ensemble of InceptionV3, CCT, SwinT, VGG16, EANet, and ResNet50, still below KidneyNeXt's 99.96% on the same dataset. Kashyap et al. (2025) reported 99.60% with a Vision Transformer using transfer learning, again lower than KidneyNeXt. Hossain et al. (2025) achieved 99.75% with EfficientNet-B7 combined with ROI extraction and pixel reduction, but with a lower macro F1 of 98.78% compared to KidneyNeXt's 99.94%.

Lightweight advantage: A critical differentiator is parameter efficiency. Many competing models rely on parameter-heavy architectures: EfficientNet-B7 has approximately 66 million parameters, and ensemble methods multiply this further. KidneyNeXt achieves its top-tier results with only 7.1 million parameters, translating to faster inference times and lower memory requirements. This makes the model more practical for deployment on standard clinical hardware rather than requiring dedicated GPU servers.

Multi-dataset generalization: Most prior studies validated on only one or two datasets. KidneyNeXt was evaluated across three datasets from three different countries (Turkey, Bangladesh, Jordan), with different imaging protocols and patient demographics, and maintained consistently high performance across all of them. This multi-source validation is a meaningful step toward demonstrating real-world clinical applicability.

TL;DR: KidneyNeXt outperformed 16 state-of-the-art models, achieving 99.96% accuracy on the Kaggle dataset versus 99.30% for Swin Transformer, 98.87% for EACWNet, and 99.67% for a six-model ensemble. It does this with only 7.1M parameters and validated across three international datasets, a combination no prior study achieved.
Pages 19-20
Single-Modality Design, Missing Clinical Metadata, and the Path to Clinical Integration

No multi-center framework: Although the model was tested on three datasets from different geographic sources, the datasets were not collected under a coordinated multi-center study protocol. This means differences in CT scanner manufacturers, contrast protocols, and slice thickness were not systematically controlled or documented. The lack of a prospective multi-institutional design could introduce population bias and limit generalizability to clinical environments with substantially different imaging practices.

Imaging-only input: KidneyNeXt relies exclusively on CT imaging data. Important clinical variables such as patient sex, body mass index (BMI), renal function indicators (e.g., eGFR, creatinine), and tumor staging information were not incorporated into the model. In clinical practice, these variables significantly influence diagnostic confidence and treatment decisions. The absence of clinical metadata constrains the model's ability to account for inter-individual variability and limits its usefulness as a standalone diagnostic tool.

No external prospective validation: All three datasets were used in a retrospective setting with pre-labeled images. No prospective clinical trial or real-time deployment study was conducted. The extremely high accuracy figures (above 99.7% across all datasets) may partly reflect the relatively standardized nature of the curated images compared to the noise, motion artifacts, and ambiguous cases encountered in daily clinical practice. Prospective validation on consecutive, unselected patient scans would provide a more realistic assessment of clinical performance.

Future directions: The authors outline three priorities for future work. First, external validation using datasets collected from institutions across varied geographic locations and ethnic populations to test fairness and robustness. Second, integration of structured clinical metadata (demographics, laboratory values) into the classification pipeline to improve interpretability and diagnostic relevance. Third, practical deployment studies including inference time analysis, PACS compatibility testing, and system interoperability assessments, followed by pilot usability evaluations with clinicians in real clinical settings.

TL;DR: Key limitations include no coordinated multi-center protocol, reliance on imaging alone without clinical metadata (BMI, eGFR, staging), and no prospective validation. The 99.7%+ accuracy figures may not fully reflect real-world conditions. Future work targets external validation across diverse populations, clinical metadata integration, and PACS-compatible deployment studies.
Citation: Maçin G, Genç F, Taşcı B, Dogan S, Tuncer T.. Open Access, 2025. Available at: PMC12295850. DOI: 10.3390/jcm14144929. License: cc by.