Bladder Segmentation in CT Using Deep Learning

Plain-English Explanations

Overview & Background

Pages 1-3

Why Automated Bladder Segmentation Matters for Cancer Detection

Clinical motivation: Bladder cancer is the fourth most common cancer diagnosed in men. In 2015, the American Cancer Society estimated approximately 74,000 new cases (56,320 in men and 17,680 in women) and 16,000 deaths in the United States alone. Multidetector row CT urography (CTU) is the imaging modality of choice for evaluating urinary tract abnormalities. Each CTU scan generates on average 300 axial slices (range: 200 to 600), and the radiologist must visually inspect the entire urinary tract for lesions. The resulting workload creates substantial variability, with reported detection sensitivities ranging from 59% to 92% across radiologists.

Role of segmentation in CAD: The authors at the University of Michigan are developing a computer-aided detection (CAD) system for bladder cancer in CTU. Accurate bladder segmentation is a critical prerequisite for such a system. The segmented bladder defines the search region for subsequent lesion detection steps. Any lesion excluded from the segmented region will be missed entirely, while non-bladder structures included in the segmented region increase false positives. Previous bladder segmentation efforts on MRI and cone-beam CT used small datasets (6 to 22 patients) and could not generalize well to CTU imaging.

Key challenges in CTU segmentation: Bladders may be partially or fully opacified by intravenous (IV) contrast material, creating stark internal boundaries between contrast-enhanced and non-contrast regions. The boundary between the bladder wall and surrounding soft tissue has very low contrast, making delineation difficult. Bladders also appear in a wide variety of shapes and sizes across patients. These factors make CTU bladder segmentation significantly more challenging than segmentation in other imaging modalities.

Prior work by this group: Hadjiiski et al. previously developed the conjoint level set analysis and segmentation system (CLASS), which segments contrast-enhanced and non-contrast regions of the bladder separately using two manually placed bounding boxes, then joins them. Cha et al. further improved CLASS with model-guided refinement and energy-driven wavefront propagation. However, CLASS required two user inputs and struggled with the strong boundary between contrast and non-contrast regions. This study proposes a new approach using deep-learning CNNs to overcome these limitations.

TL;DR: Bladder cancer CAD systems need accurate segmentation as a prerequisite for lesion detection. CTU generates 300+ slices per scan, and radiologist sensitivity varies from 59% to 92%. This study introduces a DL-CNN approach to replace the previous two-box CLASS method with a single-input system that handles contrast boundaries more effectively.

Dataset & Study Design

Pages 3-5

Patient Cohort and Data Acquisition

Study population: The study used 173 patients who underwent CTU followed by cystoscopy and biopsy, collected retrospectively from the Department of Radiology at the University of Michigan with IRB approval. These were split into a training set of 81 cases and a test set of 92 cases, balanced by case difficulty. The training set contained 42 focal mass-like lesions (40 malignant, 2 benign), 21 wall thickenings (16 malignant, 5 benign), and 18 normal bladders. The test set contained 43 focal lesions (42 malignant, 1 benign), 36 wall thickenings (23 malignant, 13 benign), and 13 normal bladders.

Imaging protocol: All CTU scans were acquired with GE Healthcare LightSpeed MDCT scanners. Excretory phase images were obtained 12 minutes after the first bolus of a split-bolus IV contrast injection and 2 minutes after the second bolus of 175 ml of nonionic contrast material (300 mg iodine/ml). Images were reconstructed at slice intervals of 1.25 or 0.625 mm using 120 kVp and 120 to 280 mA. Most bladders were partially filled with IV contrast (61 of 81 training, 85 of 92 test), though some were completely filled or had no visible contrast.

Reference standards: An experienced radiologist provided 3D hand-segmented contours for all 173 cases (RS1), outlining the bladder on every 2D CT slice where it was visible, producing a total of 16,197 slices. A second reader provided independent outlines for a subset of cases with lesions (41 training, 50 test cases, totaling 8,420 slices) as a second reference standard (RS2). This dual-reader design allowed the authors to assess interobserver variability and evaluate the computer segmentation performance relative to human disagreement.

TL;DR: The study used 173 CTU patients (81 training, 92 test) with a mix of malignant lesions, wall thickenings, and normal bladders. Two independent radiologist reference standards covering 16,197 slices allowed assessment of both segmentation accuracy and interobserver variability.

DL-CNN Architecture & Training

Pages 4-7

Deep-Learning CNN for Bladder Likelihood Map Generation

Network architecture: The authors adapted the DL-CNN architecture developed by Krizhevsky et al. (the AlexNet-era network) for classifying 2D ROIs as inside or outside the bladder. The network consists of five main layers: two convolution layers, two locally connected layers, and one fully connected layer. The first convolution layer filters input images with 64 kernels of size 5x5. The second convolution layer applies an additional 64 kernels of 5x5. The locally connected layers use 64 and 32 kernels of 3x3, respectively. The fully connected layer outputs two values fed into a Softmax layer, producing a likelihood score between 0 and 1. Rectified linear units (ReLU) serve as the activation function, and local response normalization aids generalization.

ROI extraction and training: For each axial slice in the training set, 32x32-pixel ROIs inside and outside the bladder were extracted using the radiologist's hand-outlines. An ROI was labeled "inside" if over 90% of its area fell within the outlined bladder, and "outside" if less than 5% was within the bladder. ROIs falling between these thresholds were excluded. This process generated approximately 160,000 balanced ROIs from the training cases. The network was trained for 1,500 iterations, but the model at 1,000 iterations was selected because it showed similar classification error rates while producing comparable or better likelihood maps. Training took approximately 5.5 hours on a Tesla C2075 GPU. The training classification error rate was 0.054.

Likelihood map generation: For each axial CTU slice containing the bladder, a volume of interest (VOI) approximately enclosing the bladder is marked by a single user input. The trained DL-CNN is applied to every voxel within this VOI: a 32x32-pixel ROI centered at each voxel is extracted and input to the network, which outputs the likelihood of that ROI being inside the bladder. This voxelwise collection of scores forms the bladder likelihood map. The map is then thresholded at a score of 0.85 (determined by histogram analysis of training data) to generate a binary bladder mask.

ROI size comparison: Three ROI sizes were evaluated: 16x16, 32x32, and 64x64 pixels. The 16x16 ROIs produced finer detail but hindered initial contour generation. The 64x64 ROIs lost important structural details and tended to misclassify large lesions as outside the bladder. The 32x32-pixel ROI provided the best balance, yielding the highest segmentation accuracy across all metrics.

TL;DR: A five-layer DL-CNN (two convolution, two locally connected, one fully connected) was trained on 160,000 ROIs of 32x32 pixels to classify regions as inside or outside the bladder. The network achieved a training error rate of 0.054 and generated voxelwise likelihood maps thresholded at 0.85 to initialize segmentation.

Level Set Segmentation Pipeline

Pages 7-8

From Likelihood Map to Final Bladder Contour via Level Sets

Four-stage pipeline: The segmentation system proceeds through four stages: (1) preprocessing, (2) initial segmentation from the DL-CNN mask, (3) 3D level set refinement, and (4) 2D level set refinement. In preprocessing, smoothing, anisotropic diffusion, gradient filters, and rank transform of gradient magnitude are applied to the VOI in 3D. These produce gradient magnitude and gradient vector images used during level set propagation.

Initial contour generation: The binary bladder mask (DLMask) is generated by thresholding the likelihood map at 0.85. An ellipsoid with axes 1.5 times the width and height of the VOI is placed at the centroid of the mask. The intersection of the mask and ellipsoid defines the object region, preventing leakage into structures above the bladder (such as the pelvic bone). Morphological dilation (spherical structuring element, 2-voxel radius), 3D flood fill, and morphological erosion (same element) connect neighboring components and extract the initial segmentation surface.

Cascading 3D level sets: Three 3D level sets with predefined parameters are applied sequentially to the initial surface. The level set equation includes advection, propagation, and curvature terms with coefficients alpha, beta, and gamma. The first level set slightly expands and smooths the initial contour (10 iterations). The second level set, running for 150 iterations, drives the contour toward sharp edges while expanding slightly in low-gradient regions. The third level set (10 iterations) further refines edge adherence. A final 2D level set (100 iterations) is applied slice-by-slice to refine the 3D contours.

Significance of the approach: A major contribution of this pipeline is that the DL-CNN likelihood map can overcome the strong boundary between contrast-enhanced and non-contrast bladder regions, which has been a persistent problem for gradient-based segmentation methods. By providing a seamless mask that treats both regions as "inside the bladder," the likelihood map allows the level sets to propagate across the entire bladder without being blocked by the internal contrast boundary. This eliminates the need for separate bounding boxes for each region.

TL;DR: The DL-CNN likelihood map is thresholded at 0.85, intersected with an ellipsoid to prevent leakage, then refined by three cascading 3D level sets and a final 2D level set. This pipeline overcomes the contrast boundary problem that blocked previous gradient-based methods, requiring only one user input instead of two.

Haar Feature Baseline

Pages 8-9

Haar Features with Random Forest as a Comparison Method

Feature extraction: To benchmark the DL-CNN, the authors also generated bladder likelihood maps using 59 Haar features extracted from the same 32x32-pixel ROIs. Haar features capture edge, line, and checkerboard-pattern information at multiple scales (8x8 through 32x32 pixels). The feature set was selected based on representative bladder boundary shapes after experimentation on training cases, rather than using all possible Haar features, which would be computationally prohibitive.

Random forest classifier: A random forest classifier with 100 trees was trained on the same 160,000 training ROIs used for the DL-CNN. For each ROI, the 59 Haar feature values were input to the classifier, which output a likelihood score for the ROI being inside the bladder. The collection of voxelwise scores formed the Haar-feature-based likelihood map. A different threshold of 0.56 (versus 0.85 for DL-CNN) was experimentally determined for binary mask generation, reflecting the different score distributions between the two methods.

Identical downstream processing: After the binary mask was generated from the Haar-feature-based likelihood map, the remainder of the segmentation pipeline (ellipsoid intersection, morphological operations, cascading level sets) was identical to the DL-CNN approach. This controlled experimental design isolates the effect of the likelihood map quality on final segmentation accuracy, allowing a fair comparison between the DL-CNN and Haar feature representations.

TL;DR: A random forest classifier with 100 trees trained on 59 Haar features from the same 160,000 ROIs served as the baseline. The Haar-based map used a threshold of 0.56 versus 0.85 for the DL-CNN. Identical level set processing downstream ensured a fair comparison of the two feature representations.

Results

Pages 9-12

Segmentation Performance Across All Methods

DL-CNN with level sets (test set): The primary method achieved a volume intersection ratio of 81.9% +/- 12.1%, percent volume error of 10.2% +/- 16.2%, absolute volume error of 14.0% +/- 13.0%, average minimum distance of 3.6 +/- 2.0 mm, and Jaccard index of 76.2% +/- 11.8%. On the training set, all metrics were stronger: volume intersection ratio 87.2%, average distance 3.0 mm, and Jaccard index 81.9%. Of the 92 test cases, 66.3% achieved a volume intersection ratio above 80%, and 79.3% had absolute volume error below 20%.

Haar features with level sets (test set): The Haar-feature-based approach produced a volume intersection ratio of 74.3% +/- 12.7%, absolute volume error of 20.5% +/- 15.7%, average distance of 5.7 +/- 2.6 mm, and Jaccard index of 66.7% +/- 12.6%. All metrics except volume error showed statistically significant differences compared to the DL-CNN approach on the test set. Without level set refinement, the DL-CNN initial contour alone (Jaccard 66.2%) still outperformed the Haar initial contour (Jaccard 55.6%), confirming the superior quality of the DL-CNN likelihood maps.

CLASS with LCR (test set): The previous method achieved a volume intersection ratio of 78.0% +/- 14.7%, absolute volume error of 18.2% +/- 15.0%, average distance of 3.8 +/- 2.3 mm, and Jaccard index of 73.9% +/- 13.5%. DL-CNN with level sets outperformed CLASS with LCR on all five metrics for both training and test sets. For the training set, differences in volume intersection ratio, absolute volume error, average distance, and Jaccard index were statistically significant (p-values of 0.01, 0.007, 0.01, and 0.002 respectively, by two-tailed paired t-test at alpha = 0.01 with Bonferroni correction). For the test set, differences in volume intersection ratio (p = 0.004), volume error (p = 0.001), and absolute volume error (p = 0.005) were significant.

Lesion inclusion: Critically for CAD purposes, DL-CNN with level sets better enclosed lesions within the segmented region. In the training set, 50 out of 59 lesions (84.7%) were enclosed better or comparably versus CLASS with LCR. In the test set, 64 out of 78 lesions (82.1%) showed improvement or equivalent performance. The DL-CNN approach also better handled non-contrast regions without leaking into adjacent organs, which is essential for avoiding false positives in downstream detection.

Interobserver comparison: When evaluated against both reference standards (RS1, RS2) on the lesion subset, the DL-CNN showed consistent performance: Jaccard index of 76.4% versus RS1 and 75.1% versus RS2 on the test set. The interobserver agreement between the two human readers was higher (Jaccard 86.1% on test set), placing the computer segmentation approximately 10 percentage points below human-to-human agreement, which the authors consider acceptable for defining the CAD search region.

TL;DR: DL-CNN with level sets achieved a Jaccard index of 76.2% and volume intersection ratio of 81.9% on 92 test cases, outperforming both the Haar-feature baseline (Jaccard 66.7%) and the previous CLASS with LCR method (Jaccard 73.9%). The improvements were statistically significant, and 82.1% of lesions were better enclosed compared to CLASS.

Parameter Sensitivity & Robustness

Pages 12-13

Effect of Pooling Strategy, ROI Size, and Network Parameters

Maximum vs. average pooling: Replacing maximum pooling with average pooling in the DL-CNN while keeping all other parameters constant degraded segmentation performance. With average pooling on the test set, the volume intersection ratio was 81.0% (versus 81.9% for max pooling), average distance was 4.5 mm (versus 3.6 mm), and Jaccard index was 72.1% (versus 76.2%). The differences were statistically significant for volume error, average distance, and Jaccard index, confirming that maximum pooling provides better feature preservation for this task.

ROI size sensitivity: The 16x16-pixel ROI produced finer boundary details in the likelihood map but yielded a volume intersection ratio of only 79.2% and Jaccard index of 72.6% on the test set, because the fine details interfered with initial contour generation for the whole bladder. The 64x64-pixel ROI performed worst (volume intersection ratio 67.1%, Jaccard 62.8%) due to excessive smoothing, loss of structural detail, and misclassification of large lesions as external to the bladder. The 32x32-pixel ROI clearly offered the optimal tradeoff between spatial detail and global context.

Network kernel sensitivity: The number of convolution kernels in the first two layers was varied between 32, 64, and 96. The resulting changes across metrics were modest: volume intersection ratio varied by 0.5% to 1.9%, absolute volume error by 0.2% to 9.3%, average distance by 0.6% to 10.1%, and Jaccard index by 0.1% to 2.2%. These results demonstrate that the segmentation system is robust within a reasonable range of architectural parameters and does not require extensive hyperparameter tuning to achieve stable performance.

TL;DR: Maximum pooling outperformed average pooling with statistically significant differences. The 32x32-pixel ROI was optimal, beating both 16x16 (Jaccard 72.6%) and 64x64 (Jaccard 62.8%) alternatives. Varying convolution kernel counts between 32 and 96 produced only 0.1% to 2.2% variation in Jaccard index, confirming robustness.

Limitations & Future Directions

Pages 13-15

Current Constraints and Next Steps for the Segmentation System

Failure cases: Several test cases performed well below average. Some failures resulted from poor image quality caused by large patient body habitus or the presence of hip prostheses, which degrade CT image quality. Other significant errors occurred in patients with advanced bladder cancer spreading into neighboring organs, causing the segmentation to leak into those areas. Additionally, the DL-CNN sometimes assigned relatively high likelihood scores to non-bladder structures such as the femoral heads, contributing to oversegmentation, or assigned low scores to portions of the bladder itself, leading to undersegmentation.

Runtime considerations: The DL-CNN training required approximately 5.5 hours for 160,000 ROIs and 1,500 iterations on a Tesla C2075 GPU (an older model chosen for compatibility). Once trained, generating the bladder likelihood map for a single case took approximately 4 minutes. The VOI marking and level set segmentation required an additional 2 to 5 minutes depending on bladder size, bringing total runtime to approximately 10 minutes per case. This is comparable to the 4 to 10 minutes required by CLASS with LCR. The authors note that the processes had not been optimized, and faster hardware would reduce these times.

Comparison to external methods: Direct comparison with other published methods is difficult due to differences in datasets and difficulty levels. The only roughly comparable quantitative result comes from Chai et al., who achieved Jaccard indices of 70.5% (automatic) and 77.7% (semiautomatic) using 8 patients for training and 22 for testing on cone-beam CT. The DL-CNN method surpassed the automatic result and achieved comparable performance to the semiautomatic method, while using a substantially larger and more diverse test set of 92 patients.

Future directions: The authors identify several priorities for future work. Improving the DL-CNN to reduce errors from hip prostheses, large patient size, and advanced cancer invasion is a key goal. Optimizing the segmentation process and hardware to reduce runtime will be important for clinical deployment. Most critically, the authors emphasize the need to ensure that bladder lesions are consistently included within the segmented boundaries, as this directly impacts downstream CAD detection sensitivity. This study represents a step toward a reliable bladder segmentation component for a complete CAD system targeting urothelial lesion detection in CT urography.

Broader significance: This 2016 study was among the early applications of deep learning to organ segmentation in CT urography. The core insight that a DL-CNN can generate a seamless likelihood map bridging the contrast boundary, a longstanding challenge for gradient-based methods, has influenced subsequent work in bladder and pelvic organ segmentation. The combination of learned feature representations with classical level set refinement demonstrated a practical hybrid approach that leverages the strengths of both paradigms.

TL;DR: Failure cases arose from poor image quality (large patients, hip prostheses) and advanced cancer invading adjacent organs. Runtime was approximately 10 minutes per case, comparable to the previous method. The DL-CNN surpassed external automatic methods (Jaccard 76.2% vs. 70.5%) and matched semiautomatic approaches, while future work targets improved lesion inclusion and hardware optimization.