Kidney Tumor Semantic Segmentation Using Deep Learning Survey

Plain-English Explanations

Overview & Clinical Background

Pages 1-3

Why Kidney Tumor Segmentation Matters: Scale of the Problem

Kidney cancer ranks among the top 10 malignancies in both men and women, with a lifetime probability of about 1 in 75 (1.34%). Renal cancer (RC) affects over 400,000 individuals each year globally, and more than 175,000 deaths are attributed to it according to the Global Cancer Observatory. Renal cell carcinoma (RCC) carries the third-highest disease rate among urological cancers, behind prostate and bladder cancer. In the United States alone, RCC is the seventh most frequent cancer in men and the ninth in women, with 48,780 new cases and 27,300 deaths reported annually. Clear cell RCC accounts for approximately 80-90% of all kidney cancers, and the worldwide incidence rate has risen by 2% per year over the past two decades.

The segmentation challenge: Distinguishing between benign kidney tumors and malignant renal cell carcinoma on radiography can be extremely difficult, yet the majority of kidney tumors turn out to be cancerous. Manual segmentation by expert radiologists is time-consuming and suffers from significant intra-rater variability (the same person segmenting differently at different times) and inter-rater variability (different specialists producing different segmentations). This inconsistency directly impacts treatment planning, since radical nephrectomy (removing the entire kidney) versus partial nephrectomy (removing only the tumor) depends heavily on accurate delineation of the tumor boundaries.

CT imaging as the standard: Computed tomography (CT) is the preferred imaging modality for kidney tumors because it produces high-resolution images with excellent contrast and spatial resolution. CT imaging is frequently used in clinics for therapy planning and kidney tumor segmentation. The paper focuses primarily on CT-based deep learning segmentation, though it acknowledges that ultrasound and MRI have complementary roles. CT results can also help classify benign versus malignant lesions, making accurate automated segmentation a critical tool for improving kidney cancer diagnosis and treatment.

TL;DR: Kidney cancer affects over 400,000 people per year worldwide with 175,000+ deaths. Clear cell RCC makes up 80-90% of cases. Manual tumor segmentation on CT is slow and inconsistent, creating a strong need for automated deep learning approaches.

Methodology

Pages 4-5

What This Survey Covers and How It Was Organized

This survey reviews the state-of-the-art in deep learning methods for kidney tumor semantic segmentation. The authors note that while medical image segmentation using deep learning has grown rapidly, relatively few review articles have specifically examined kidney segmentation strategies. The paper catalogues methods across multiple dimensions: the type of imaging modality (primarily CT), the segmentation approach (one-stage, two-stage, and hybrid), the network architecture (U-Net, V-Net, Alex-Net, boundary-aware FCN, cascaded networks), and the evaluation metrics used to assess performance.

Historical context: Medical image analysis evolved from rule-based systems in the 1970s-1990s (edge detectors, line filters, region growing, mathematical modeling) to supervised machine learning approaches using hand-crafted features around the 1990s (active shape models, atlas techniques, statistical classifiers). Deep learning, particularly convolutional neural networks (CNNs), began to dominate after Alex-Net won the December 2012 ImageNet competition by a large margin. CNNs have been under development since the late 1970s (Fukushima), were first applied to medical images in 1995, and had their first major real-world application in 1998 for handwritten digit recognition.

Semantic segmentation defined: The authors distinguish four types of image segmentation: manual, semi-automatic, fully automatic, and semantic. Semantic segmentation is pixel-level classification that assigns every pixel in an image to a meaningful category. Deep learning approaches for semantic segmentation are further divided into region-based methods (extracting free-form regions, then classifying them), FCN-based methods (learning direct pixel-to-pixel mappings), and semi-supervised methods (reducing annotation burden). Additional categories include encoder-based, recurrent neural network-based, upsampling/deconvolution-based, and CRF/MRF-based methods.

TL;DR: This survey covers deep learning methods for kidney tumor segmentation on CT images, organized by architecture type (U-Net, V-Net, cascaded, hybrid) and segmentation strategy (one-stage vs. two-stage). It traces the field from 1970s rule-based systems through modern CNNs.

One-Stage Methods

Pages 5-7

Single-Model Approaches: Predicting Kidney and Tumor Labels in One Pass

One-stage methods predict multi-class segmentation results directly from whole images using a single model. Myronenko et al. presented a boundary-aware fully convolutional network (FCN) for kidney and tumor segmentation from arterial phase 3D CT images. Efremova et al. combined U-Net and LinkNet-34 with ImageNet-pretrained ResNet-34 to reduce convergence time and overfitting. Guo et al. proposed RAU-Net specifically for renal tumor segmentation, using a cross-entropy function to help identify positive samples, though generalizability remained limited.

KiTS challenge results: Isensee et al. designed a U-Net that performed well on the KiTS2019 dataset by either reducing the number of layers or increasing residual blocks for regularization. Causey et al. proposed the Arkansas AI-Campus model, an ensemble of U-Net variants that placed in the top five of the KiTS19 Competition among US teams, performing consistently on both local and independent test data. Yang et al. achieved an average Dice coefficient of 0.931 for kidney segmentation and 0.802 for tumor segmentation using a 3D fully convolutional network with a pyramid pooling module.

Newer innovations: Shen et al. proposed COTRNet, which uses a transformer to capture long-range dependencies for accurate tumor segmentation, inspired by the DETR architecture for representing global characteristics. Heo et al. built a one-stage model for the KiTS21 Challenge using U-Net with a combined Focal and Dice Loss to address class imbalance in 3D abdominal CT images. Christina et al. employed a strategic clinical-data-based sampling approach, training a baseline 3D U-Net with random sampling and then using LASSO regression to identify the clinical features most strongly related to segmentation success.

TL;DR: One-stage methods segment kidneys and tumors in a single pass. Top performers include Yang et al. (Dice 0.931 kidney, 0.802 tumor) and Causey et al. (top-5 in KiTS19). Newer models like COTRNet introduce transformers for capturing long-range spatial dependencies.

Two-Stage & Hybrid Methods

Pages 6-8

Cascaded Pipelines: Localize First, Then Segment

Two-stage rationale: Two-stage methods address the foreground/background imbalance problem by first detecting the volume of interest (VOI) and then segmenting the target organs from within that region. Cruz et al. used deep CNNs with image processing techniques to delimit kidneys in CT images, achieving up to 93.03% accuracy but noting that further improvements were needed. Zhang et al. studied a cascaded two-stage framework using 3D FCN that first locates the kidney and removes irrelevant background before performing fine segmentation.

Multi-stage refinement: Hou et al. proposed a triple-stage self-guided network where a low-resolution net finds the VOI from down-sampled CT, then full-resolution and tumor refine nets extract precise kidney and tumor borders, though the approach consumed significant computational resources. Zhao et al. developed MSS U-Net (multi-scale supervised 3D U-Net) combining deep supervision with exponential and logarithmic loss to improve training efficiency. Lv et al. offered a three-step automated approach based on 3D U-Net achieving average Dice scores of approximately 0.93 for kidneys, 0.57 for tumors, and 0.73 for cysts, indicating that tumor and cyst accuracy remained insufficient.

Hybrid approaches: Abdul Qayyuma et al. designed a hybrid 3D residual network with squeeze-and-excitation (SE) blocks to acquire spatial information, tested across multiple datasets for kidney, liver, and related malignancy segmentation. Cheng et al. enhanced 3D SEAU-Net by aggregating residual networks, dilated convolutions, SE networks, and attention mechanisms, decomposing the multi-class task into two binary segmentations. Cruz et al. applied post-processing to a DeepLabv3+ 2.5D model with DPN-131 encoder, incorporating normalization, proportional dataset distribution, and DART for improved results.

Key finding across methods: Xiao et al. used a two-stage ResUnet 3D architecture on the KiTS21 benchmark and achieved a mean Dice of 0.6543 for kidney masses, 0.6543 for kidney masses and tumor combined, and 0.4658 for mean surface Dice. These numbers highlight the difficulty of the segmentation task, particularly for smaller or more ambiguous tumor regions.

TL;DR: Two-stage methods first localize the kidney region, then segment tumors within it. Kidney Dice scores often exceed 0.93, but tumor Dice varies widely (0.57 to 0.88). Hybrid models combine architectures like ResNet, SE blocks, and attention mechanisms for improved multi-class segmentation.

Network Architectures

Pages 9-13

U-Net, V-Net, and Beyond: The Building Blocks of Kidney Segmentation

U-Net: Developed at the University of Freiburg for biomedical image segmentation, U-Net uses an encoder-decoder architecture with a contracting path (capturing context) and a symmetric expanding path (enabling precise localization). It processes a 512 x 512 image in under one second on a modern GPU. The architecture uses 3 x 3 convolutional layers, 2 x 2 max pooling, ReLU activation, and a final 1 x 1 convolutional layer. Fabian Isensee's nnU-Net, a self-configuring variant based on 3D U-Net, won multiple medical image segmentation competitions, supporting the claim that "a well-trained U-Net is difficult to surpass." U-Net remains the dominant architecture in the kidney tumor segmentation literature.

V-Net: V-Net extends U-Net with a volumetric design suited for organs and tumors that are difficult to recognize on CT, such as the prostate or kidney. Unlike standard pooling, V-Net uses convolutions for downsampling to avoid losing critical features. It includes up-convolution layers in the upsampling phase and horizontal linkages (skip connections) that pass encoder features to the decoder. V-Net introduced the Dice loss function as an alternative to standard cross-entropy, directly optimizing the metric most commonly used for evaluating segmentation quality.

CNN fundamentals: All these architectures rely on CNN building blocks: convolutional layers (learnable filters typically 3 x 3 or 3 x 3 x 3), pooling layers (reducing feature map size while retaining critical features, commonly 2 x 2 max pooling), and fully connected layers (performing final classification). The dropout approach is used to reduce overfitting by randomly deleting nodes and connections during training. Common optimizers include Adam and SGD, with ReLU as the dominant activation function across the surveyed methods.

Specialized architectures: Boundary-Aware FCN transforms single-network segmentation into a multitask problem with two separate upsampling paths: one for the tumor territory and one for the tumor border. The two results are fused through additional convolutional layers. Cascaded networks feed the output of one CNN as input to the next, with hierarchical segmentation first identifying the organ ("rough segmentation") and then performing precise tumor segmentation. The survey also covers Alex-Net (eight layers, five convolutional plus three fully connected) and feature pyramid networks (FPN) for multi-scale object recognition.

TL;DR: U-Net dominates kidney tumor segmentation with its encoder-decoder design and skip connections. V-Net adds volumetric processing and Dice loss. Boundary-Aware FCN splits the task into tumor region and tumor border segmentation. Most models use 3D convolutions, Adam optimizer, ReLU activation, and batch normalization.

Datasets & Evaluation

Pages 14-16

KiTS Challenges and How Segmentation Performance Is Measured

Benchmark datasets: The KiTS19 and KiTS21 challenges are the dominant benchmarks in kidney tumor segmentation research. KiTS19 provides 300 total cases split across various studies, with typical training/validation/test splits of 210/60/30 or 240/30/30 depending on the research group. KiTS21 similarly uses 300 cases. Some researchers also used private or alternative datasets with 113-140 total cases. The majority of papers surveyed relied on these public challenge datasets, which include contrast-enhanced CT scans with expert annotations for kidneys, tumors, and in KiTS21, cysts as well.

Primary metrics: Dice Similarity Coefficient (DSC) is the most widely used evaluation metric. DSC measures the overlap between the predicted segmentation and the ground truth, ranging from 0 (no overlap) to 1 (perfect match). It is calculated as 2TP / (2TP + FP + FN). The Jaccard Index measures statistical similarity of segmented regions compared to hand-drawn delineations. Hausdorff Distance (HD) measures the maximum boundary error between predicted and ground truth surfaces, with values reported from 5.10 to 33.469 across studies. Additional metrics include sensitivity (true positive rate), specificity (true negative rate), accuracy, precision, surface distance, volume overlap, and relative volume difference.

Semantic segmentation metrics: For semantic segmentation specifically, the three most commonly used measures are pixel accuracy (proportion of correctly classified pixels), mean intersection over union (mIoU, averaging the IoU across all classes), and mean per-class accuracy. These metrics can produce divergent results because there is no universal definition of what constitutes "successful" segmentation, particularly when class sizes are imbalanced, as in kidney scans where the tumor is a small fraction of the total image volume.

TL;DR: KiTS19 (300 cases) and KiTS21 (300 cases) are the primary benchmarks. Dice Similarity Coefficient is the standard metric. Hausdorff Distance ranges from 5.10 to 33.47 across methods. Class imbalance (tiny tumors in large CT volumes) makes evaluation challenging.

Performance Results

Pages 20-23

How the Models Actually Performed: Kidney vs. Tumor Dice Scores

KiTS19 results: On the KiTS19 benchmark, kidney Dice scores ranged from 0.852 to 0.980 across 21 methods, with the top performers including Santini et al. (0.980 using multi-stage U-Net with ResNet), the V-Net model by Mu et al. (0.977), and Isensee et al.'s nnU-Net (0.974). Tumor Dice scores showed much wider variation, ranging from 0.32 to 0.868. The best tumor Dice was achieved by the hybrid SE model of Abdul Qayyuma et al. (0.868), followed by the V-Net approach (0.865) and Isensee et al. (0.851). Composite Dice scores (combining kidney and tumor) ranged from 0.887 to 0.912.

KiTS21 results: Performance on the more challenging KiTS21 dataset (which added cyst segmentation) was generally lower. Kidney Dice ranged from 0.916 to 0.975, with Yasmeen et al. achieving 0.975. Tumor Dice ranged from 0.39 to 0.881, with Yasmeen et al. again leading at 0.881. The gap between kidney and tumor Dice scores was even more pronounced, reflecting the difficulty of segmenting smaller and more irregularly shaped tumors. The mean surface Dice, a stricter boundary-accuracy metric, was particularly low at 0.4658 in the Xiao et al. two-stage approach.

Other metrics reported: For sensitivity and specificity, Zhao et al. (MSS U-Net) reported 0.913 sensitivity and 0.914 specificity. The hybrid SE model achieved 0.862 sensitivity and 0.894 specificity. Cruz et al. reported the highest specificity at 0.998 but with lower sensitivity at 0.842. Jaccard Index values ranged from 0.716 to 0.756 across reported studies. Accuracy ranged from 0.957 to 0.997. The V-Net and U-Net family architectures consistently emerged as the strongest overall performers.

The survey highlights a consistent pattern: kidney segmentation has largely reached clinical-grade accuracy (Dice above 0.95), while tumor segmentation remains substantially harder and more variable. The border between tumor and kidney parenchyma is ambiguous on CT imaging, and smaller tumors are particularly difficult to detect and delineate accurately.

TL;DR: Kidney Dice scores reached 0.980 on KiTS19, but tumor Dice ranged widely from 0.32 to 0.868. V-Net and U-Net variants were the top performers. On KiTS21, the best tumor Dice was 0.881. Kidney segmentation is near clinical-grade, but tumor segmentation remains a significant challenge.

Technical Strategies

Pages 17-18

Preprocessing, Post-Processing, and Data Augmentation Pipelines

Preprocessing: Nearly all methods apply preprocessing before feeding data to the deep learning network. Common techniques include bias field correction (particularly the N4ITK method, an improved version of the N3 non-parametric non-uniform normalization approach), intensity normalization to achieve a consistent distribution across patients and acquisitions, and zero-mean unit-variance normalization to de-bias features. Cropping to remove background pixels is also widely used. The authors emphasize that placing raw images directly into a deep neural network without preprocessing considerably reduces performance, and in some cases proper preprocessing is critical to the model functioning at all.

Post-processing: The output of deep neural networks is not always directly usable for clinical decision-making. Conditional random fields (CRF) and Markov random fields (MRF) are applied to combine model predictions with low-level image information like local pixel interactions and edges, effectively eliminating false positives. However, these approaches are computationally intensive. Connected component analysis identifies and removes unnecessary blobs using thresholding. Morphological operations such as erosion and dilation are applied around segmentation boundaries to reduce false positives.

Data augmentation: Given the scarcity of annotated medical imaging data, augmentation is essential. Common techniques include flip, rotation, shift, shear, zoom, brightness adjustment, and elastic distortion. The authors note that data augmentation provides benefits equivalent to model architecture updates when dealing with limited training data, while being far simpler to implement. The problem of limited access to large kidney tumor datasets makes augmentation particularly critical for this application domain.

TL;DR: N4ITK bias correction, intensity normalization, and cropping are standard preprocessing steps. CRF, MRF, connected component analysis, and morphological operations are used for post-processing. Data augmentation (flip, rotate, elastic distortion) is essential given the scarcity of annotated kidney CT data.

Limitations

Pages 24-25

What Holds Kidney Tumor Segmentation Back

Data scarcity and annotation burden: The absence of large-scale medical training datasets is identified as a primary reason for poor segmentation performance. Annotating even a single CT volume requires substantial time from a well-trained radiologist, and the work is prone to both intra- and inter-rater variability. This means the "ground truth" itself is imperfect. Deep learning models require large quantities of training data to avoid overfitting, which Isensee's group described as a network learning a function with very high variance in order to memorize training data rather than generalize.

Computational constraints: DL techniques demand massive computation, and GPU hardware is essential. Most researchers are constrained by GPU memory (typically 12 gigabytes), which limits batch sizes and model complexity. The use of 3D deep learning models significantly increases computational and memory requirements compared to 2D approaches. Training multiple networks in cascaded architectures further compounds the resource demands.

Generalizability concerns: Test images should ideally come from the same platform as training images, which limits real-world deployment. The KiTS challenge datasets were obtained from individuals sharing the same geographic region and healthcare system, raising questions about whether these algorithms would perform well across different institutions, scanner manufacturers, and patient populations. No successful study at the time of this survey had unified segmentation algorithms and transfer learning across different tumor types.

Architectural limitations: While the surveyed algorithms have improved year over year, their robustness continues to fall behind expert radiologist performance. End-to-end models reduce error accumulation from multi-stage processing and simplify the pipeline, but a single model with high integration reduces flexibility, operability, and interoperability. Single models may also require more training data to achieve comparable outcomes, creating a tension between pipeline simplicity and practical performance.

TL;DR: Key limitations include scarce annotated data, imperfect ground truth labels, GPU memory constraints (typically 12 GB), single-institution training data, and algorithms that still lag behind expert radiologist performance. Generalizability across different scanners and populations remains unproven.

Future Directions

Pages 25-26

Where Kidney Tumor Segmentation Research Is Headed

Multi-modal imaging: The authors recommend incorporating additional imaging modalities such as magnetic resonance imaging (MRI) and contrast-enhanced ultrasound (CEUS) alongside CT to improve diagnostic accuracy. Currently, the field is almost entirely CT-focused, but combining modalities could provide complementary information that helps resolve ambiguous tumor boundaries and improve classification of tumor subtypes.

Architectural simplification and training efficiency: Future research should avoid overly complicated architectures and instead focus on reducing training time for deep learning models. Ensemble approaches and U-Net-based models show significant potential for improving the state of the art when combined with proper preprocessing, weight initialization, and sophisticated training schemes. The survey notes that well-tuned U-Nets consistently perform competitively against more complex novel architectures.

Multi-institutional validation: Extending these systems beyond the sampled populations from KiTS challenges to multi-institutional cohorts with prospectively generated test sets is essential. The authors call for testing on data from diverse geographic regions and healthcare systems to establish true clinical generalizability. Properly analyzing 3D slice data and compressing models as the number of network parameters grows are also identified as key technical challenges.

Transfer learning and data strategies: Although individual tumor features vary, there are similarities across tumor types that could be exploited through cross-tumor transfer learning. Unsupervised learning methods and improved data augmentation strategies, potentially combining data warping and oversampling techniques through search algorithms, represent promising research directions. The layered architecture of deep neural networks provides multiple opportunities for enhancement through learned data representations at different levels of abstraction.

TL;DR: Key future directions include multi-modal imaging (adding MRI and CEUS to CT), simpler and faster architectures, multi-institutional validation, cross-tumor transfer learning, and advanced data augmentation. Well-tuned U-Nets remain hard to beat and should serve as strong baselines.

Kidney Tumor Semantic Segmentation Using Deep Learning: A Survey of State-of-the-Art

Original Paper (PDF)