Kidney tumors, especially renal cell carcinoma (RCC), are a major public health concern. RCC accounts for roughly 90% of all kidney cancers, with subtypes including clear cell RCC, papillary RCC, and chromophobe RCC, along with benign lesions such as angiomyolipomas and oncocytomas. In the United States alone, an estimated 76,000 new kidney cancer cases and 13,780 deaths were recorded in 2021, with a 1.1% annual increase in incidence over the past five years.
Computed tomography (CT) is the primary imaging modality for kidney tumor diagnosis due to its high spatial resolution and ability to detect small tumors. However, conventional diagnostic methods rely heavily on radiologists' manual assessments, which are labor-intensive, subjective, and prone to both intra-observer variability (the same radiologist interpreting the same image differently at different times) and inter-observer variability (different radiologists disagreeing on the same image). These inconsistencies, combined with the growing volume of imaging data, create an urgent need for automated systems.
The authors identify five key gaps in existing AI approaches: (1) limited dataset diversity reducing generalizability, (2) poor model reproducibility across trials, (3) lack of workflow-compatible designs for clinical integration, (4) insufficient classification granularity that stops at benign versus malignant without identifying specific subtypes, and (5) limited interpretability that undermines clinician trust. This study directly targets all five gaps with a hierarchical, clinically aligned framework.
Several deep learning studies have targeted kidney tumor detection using the same KAUH dataset that this paper uses. Alzu et al. built 2D-CNN models including CNN-6, ResNet50, and VGG16 on 8,400 CT images from 120 patients, achieving detection accuracies of 97%, 96%, and 60%, and a classification accuracy of 92%. Praveen et al. fused ResNet and ResNeXt architectures on a 2,170-image subset, reaching 94% accuracy. Kaur et al. used a Sequential CNN with data augmentation, achieving 97.69% training accuracy and 95.31% validation accuracy.
Outside the KAUH dataset, Qadir et al. combined DenseNet-201 for feature extraction with Random Forest classification, reaching 99.719% accuracy on 12,446 CT images. Almuayqil et al. introduced KidneyNet, a CNN integrated with Grad-CAM for interpretability, achieving 99.88% accuracy, 99.92% specificity, and 99.76% sensitivity on chronic kidney disease diagnosis.
Despite these impressive single-run accuracies, the authors note that none of these studies report confidence intervals, variance measures, or results from repeated trials. Most emphasize algorithmic accuracy while overlooking robustness, reproducibility, and clinical workflow integration. Furthermore, existing models typically stop at binary classification (benign versus malignant) without addressing the clinically important task of malignant subtype differentiation needed for treatment planning.
The study used a dataset of renal CT scans from King Abdullah University Hospital (KAUH) in Jordan, comprising 8,400 images from 120 adult patients (aged 30 to 80) who underwent CT scans for suspected kidney masses between 2020 and 2021. Ethical approval was obtained from the KAUH IRB, and all patient data were anonymized. The dataset included 60 patients with kidney tumors (38 benign, 22 malignant) and 60 normal cases, some presenting with cysts, stones, or hydronephrosis.
The imaging data featured contrast-enhanced CT scans for assessing tumor vascularity, non-contrast scans for baseline anatomy, and multiphase CT imaging (non-contrast, corticomedullary, and nephrographic phases) for tumor differentiation. Images were converted from DICOM to JPEG format, with 70 images selected per patient. Radiologists annotated each image as normal, benign, or malignant, and recorded metadata including tumor stage (I through IV), location (upper, middle, or lower kidney), and subtype. Approximately 60% of patients received contrast material while 40% did not due to contraindications.
The dataset was organized hierarchically across four classification levels: (1) Normal versus Tumor with 51 normal and 60 tumor cases, (2) Tumor Type with 38 benign and 22 malignant cases, (3) Benign Tumor Subtypes including adenoma (28 cases), angiomyolipoma (8 cases), and rare subtypes, and (4) Malignant Tumor Subtypes with 10 RCC and 12 secondary metastasis cases. This four-level structure mirrors the clinical diagnostic workflow.
The framework's first stage uses a specialized encoder called RAD-DINO-MAIRA-2 to extract discriminative features from CT scans. This encoder inherits the Vision Transformer (ViT) architecture, which processes input images by dividing them into smaller patches. Each patch is linearly projected into a lower-dimensional embedding space. For an input image of height H, width W, and C channels, the image is split into N patches of size P x P, producing a sequence of patch embeddings with dimension D.
A learnable CLS (classification) token is prepended to the patch embedding sequence, and the combined sequence passes through transformer layers that include multi-head self-attention (MSA) and feed-forward neural networks (FFN). The self-attention mechanism computes attention weights using query, key, and value matrices, allowing the model to capture global relationships across all image patches simultaneously. The encoder was originally optimized on chest CT scans, and the knowledge gained was then transferred to kidney CT scans through a knowledge transfer process.
The encoder produces two key outputs: the last hidden state (including the CLS token embedding and patch embeddings) and flattened mean embeddings. These high-dimensional embeddings capture essential image features including shape, texture, and intensity patterns critical for distinguishing tumor types. The embeddings are stored in a lookup table and normalized using z-score normalization (zero mean, unit standard deviation) before being split into training and testing sets using an 80/20 split by case number (not individual images) to prevent data leakage.
In the training and optimization stage, the study evaluated a total of 32 machine learning classifiers implemented using the Scikit-Learn Python library. These spanned a wide range of algorithm families: conventional classifiers such as Support Vector Machines (SVMs) and k-Nearest Neighbors (k-NN), ensemble methods like Gradient Boosting and AdaBoost, probabilistic models like Gaussian Process, linear models like Passive Aggressive, and neural network-based classifiers such as Multi-Layer Perceptrons (MLPs).
Five optional normalization techniques were applied to the embedding data, including StandardScaler, MinMaxScaler, RobustScaler, and QuantileTransformer. The optimization procedure involved adjusting hyperparameters such as learning rate, regularization strength, and network depth using grid search and random search methods. Cross-validation was used to verify model reliability across different data splits.
The rationale for testing multiple algorithms rather than committing to a single one is that different classifiers have distinct strengths: SVMs excel in high-dimensional spaces, Random Forest handles noisy data well, and MLPs can model complex non-linear relationships. By evaluating all 32 classifiers at each hierarchical level independently, the framework identifies the best-performing model for each specific classification task. Performance was assessed using accuracy, precision, recall, F1-score, and specificity, with results visualized through ROC curves, precision-recall curves, and confusion matrices.
The framework was validated across 25 independent trials to establish confidence intervals and assess reproducibility. The maximum performance results (best of 25 trials) were outstanding. For Normal vs. Tumor classification (Level 1), the Passive Aggressive classifier with RobustScaler achieved 95.50% accuracy, 96.55% precision, 96.67% recall, 95.87% F1-score, and 96.08% specificity.
For Tumor Type classification (Level 2), the Gaussian Process classifier with QuantileTransformer achieved a perfect 100% across all metrics: accuracy, precision, recall, F1, and specificity. For Benign Tumor Subtype classification (Level 3), the KNeighbors classifier with QuantileTransformer reached 97.66% accuracy, 97.37% precision, 97.51% recall, 97.51% F1, and 99.30% specificity. For Malignant Tumor Subtype classification (Level 4), the MLP classifier with RobustScaler also achieved a perfect 100% across all metrics.
The overall framework achieved a maximum accuracy of 98.29%, with an average precision of 98.48%, recall of 98.55%, F1-score of 98.34%, and specificity of 98.84%. The fact that two of the four hierarchical levels reached perfect classification (Gaussian Process for tumor type, MLP for malignant subtypes) demonstrates that decomposing a complex multi-class problem into sequential sub-tasks dramatically improves performance at each stage.
While peak performance shows what the framework can achieve at its best, the mean performance across 25 trials reveals its consistency and robustness. At Level 1 (Normal vs. Tumor), the Passive Aggressive classifier maintained a mean accuracy of 92.86%, with 93.39% precision, 93.47% recall, 93.39% F1, and 92.16% specificity. At Level 2 (Tumor Types), the Gaussian Process classifier achieved a mean accuracy of 95.82%, with 94.26% recall, 95.87% F1, and an impressive 97.58% specificity.
At Level 3 (Benign Tumor Subtypes), KNeighbors maintained a mean accuracy of 94.67%, with 94.53% precision, 94.60% recall, and 94.83% F1. At Level 4 (Malignant Tumor Subtypes), the MLP classifier achieved a mean accuracy of 95.51%, with 93.74% recall, 93.27% F1, and 94.40% specificity. The overall framework delivered a mean accuracy of 94.72% across all levels and trials.
Statistical analyses using box plots and violin plots across all 25 trials confirmed the framework's stability. The narrow interquartile ranges and minimal outliers indicated high reproducibility. The violin plots showed symmetrical distributions with high density around median values, confirming no significant skewness. This level of statistical rigor, with 25 repeated trials and variance analysis, sets this study apart from prior work that reported only single-run results without confidence intervals.
The hierarchical classification strategy was designed to mirror actual clinical diagnostic reasoning. Radiologists typically first determine if a lesion is present, then assess whether it is benign or malignant, and finally identify the specific subtype if malignant. By structuring the AI framework to follow this same logical progression, the system gains several advantages over traditional flat (single-stage) classifiers.
Improved performance through task decomposition: Breaking a complex multi-class problem into simpler sequential binary or multi-class sub-problems reduces the classification burden on each individual model. Distinguishing normal tissue from any tumor (Level 1) is far simpler than classifying all subtypes simultaneously in a single model. This decomposition is why classifiers achieved perfect 100% scores at Levels 2 and 4. Class imbalance mitigation: In the KAUH dataset, some benign subtypes have very few samples (e.g., lipoma with only 1 case). A flat model would be biased toward majority classes, but the hierarchical approach isolates rare classes into later stages where they compete against fewer alternatives.
Enhanced clinical utility: Instead of a single opaque prediction, the system provides intermediate results at each level (e.g., "Tumor Detected" then "Malignant" then "RCC"), allowing clinicians to follow the reasoning step by step. Each level can also be analyzed independently to identify failure modes. Scalability: New tumor subtypes can be added to existing levels without retraining the entire model. If future research identifies new biomarkers relevant to a specific stage, the corresponding classifier can be updated independently, making the framework adaptable to evolving medical knowledge.
A major limitation of deep learning models in medical imaging is their "black-box" nature, which undermines clinician trust. This framework addresses interpretability at multiple levels. The Gaussian Process classifier (Level 2) inherently provides uncertainty estimates (variance) alongside predictions, meaning high-uncertainty cases can be automatically flagged for human review. The Passive Aggressive classifier (Level 1) is a linear model variant whose feature weights reveal which embedding dimensions are most discriminative for tumor detection. The MLP classifier (Level 4) can be explained using techniques like Layer-wise Relevance Propagation (LRP) or Integrated Gradients.
The authors outline several future directions to further strengthen interpretability: integration of Grad-CAM (Gradient-weighted Class Activation Mapping) on the RAD-DINO-MAIRA-2 encoder outputs to generate heatmaps highlighting influential anatomical regions, application of SHAP (SHapley Additive exPlanations) values to quantify feature contributions, and development of a clinical feedback loop where physicians can provide input on AI predictions to refine the model's focus areas over time.
Additional future work includes extending the dataset to incorporate more diverse populations and imaging modalities, refining the framework for real-time clinical deployment, and undertaking multi-center validation studies using independent datasets from collaborating institutions. The authors also highlight the need for domain adaptation techniques to ensure robust performance across diverse clinical settings, and the consideration of ethical issues around data privacy and algorithmic bias for responsible AI implementation in healthcare.