Bladder Cancer Segmentation in CT with Deep Learning

Plain-English Explanations

Overview

Pages 1-2

Why Bladder Cancer Treatment Response Needs Better Measurement Tools

Clinical motivation: Bladder cancer is the fourth most common cancer in men, with an estimated 76,960 new cases and 16,390 deaths in the USA in 2016. The standard treatment for invasive disease is radical cystectomy (surgical removal of the bladder), but roughly 50% of patients who undergo this procedure develop metastatic disease within two years. Neoadjuvant chemotherapy with methotrexate, vinblastine, doxorubicin, and cisplatin (MVAC) can improve outcomes by treating micrometastases before surgery, but the side effects are severe, including neutropenic fever, sepsis, mucositis, nausea, and alopecia. Because no reliable method currently exists to predict whether a patient will respond to chemotherapy, early and accurate assessment of treatment response is critical so that clinicians can stop ineffective treatment and pursue alternatives.

Limitations of current response criteria: The clinical standard for measuring tumor response relies on the World Health Organization (WHO) criteria, which use the product of the longest diameter and its perpendicular, and the Response Evaluation Criteria In Solid Tumors (RECIST), which use only the longest diameter. Both are fundamentally 1D or 2D measurements applied to a 3D problem. These approaches can be inaccurate, especially for irregularly shaped tumors, and suffer from significant inter-observer and intra-observer variability. The full 3D volumetric information available in CT scans is not used, even though gross tumor volume (GTV) has been shown to predict outcomes more effectively.

Study overview: This pilot study by Cha et al. from the University of Michigan applied a deep-learning convolution neural network (DL-CNN) to automatically segment bladder tumors in CT images for treatment response assessment. The study used a dataset of 62 cases with 64 tumors forming 74 temporal pairs (pre- and post-treatment scans). The DL-CNN approach was compared against the authors' previous method called auto-initialized cascaded level sets (AI-CALS), radiologists' manual contours, and the WHO and RECIST criteria for predicting complete response (pathological stage pT0) to chemotherapy.

TL;DR: This 2016 pilot study applied a deep-learning CNN to segment bladder tumors in CT scans of 62 patients, aiming to improve treatment response assessment over the 1D/2D WHO and RECIST criteria by using full 3D volumetric information. About 50% of cystectomy patients develop metastases within two years, making accurate early response prediction essential.

Data and Training

Pages 2-4

Dataset Construction and DL-CNN Training Strategy

Patient cohort: The dataset comprised 62 cases collected retrospectively from the University of Michigan Department of Radiology, with Institutional Review Board approval. All patients underwent CT examination before and after chemotherapy, followed by cystoscopy, biopsy, or radical cystectomy. CT scans were acquired on GE Healthcare LightSpeed MDCT scanners using 120 kVp and 120-280 mA, with reconstruction intervals of 0.625, 1.25, 2.5, or 5 mm and pixel sizes ranging from 0.586 to 0.977 mm. Of the 62 patients, 27% (17/62) showed pathological stage pT0 after surgery, indicating complete response to treatment.

Reference standards: Two radiologists provided independent measurements. The first radiologist (27 years of CT bladder experience) manually outlined the full 3D contour for all 62 cases (reference standard 1) and measured the longest diameter and its perpendicular. The second radiologist (17 years of experience) independently outlined bladder tumors for a subset of 29 cases (reference standard 2) and performed WHO and RECIST measurements for all 62 cases. This dual-reader setup allowed the study to assess consistency across reference standards.

ROI extraction and training: The DL-CNN was based on the cuda-convnet architecture by Krizhevsky et al., adapted from the authors' prior work on whole bladder segmentation. For each axial CT section, overlapping 16 x 16 pixel ROIs were extracted from regions around the cancer marked by the radiologist. An ROI was labeled as "inside" if more than 80% fell within the hand-outlined cancer boundary, and "outside" only if it was completely outside the cancer. ROIs that were ambiguous were excluded. The two classes were balanced, resulting in approximately 65,000 ROIs for training. Training used leave-one-case-out cross-validation, meaning for each of the 62 partitions, all ROIs from one case were removed and the DL-CNN was trained on the remaining data.

Network architecture: The DL-CNN consisted of 5 main layers: two convolution layers (each with 64 kernels of 5 x 5 pixels), two locally connected layers (with 64 and 32 kernels of 3 x 3 pixels, respectively), and one fully connected layer. The fully connected layer output two values to a softmax layer that converted them to a probability range of 0 to 1, representing the likelihood that an input ROI was inside or outside the tumor. The network was trained for 1,500 iterations per partition, which was sufficient for the error rate to converge. Training each partition took approximately 1.5 hours on an Nvidia Tesla K20 GPU.

TL;DR: The DL-CNN was trained on approximately 65,000 balanced ROIs (16 x 16 pixels) extracted from CT scans of 62 patients using leave-one-case-out cross-validation. The 5-layer cuda-convnet architecture with softmax output was trained for 1,500 iterations per partition (~1.5 hours each on an Nvidia Tesla K20 GPU). Two radiologists provided independent reference standards.

Segmentation Pipeline

Pages 4-5

From Likelihood Maps to Final Tumor Boundaries

Likelihood map generation: For each left-out case, the trained DL-CNN was applied to a manually marked volume of interest (VOI) in the CT scan that approximately enclosed the bladder cancer. For every voxel within the VOI, a 16 x 16 pixel ROI centered at that voxel was extracted from the corresponding axial section and fed into the DL-CNN. The network produced a likelihood score indicating the probability that each voxel was inside the tumor. The stack of 2D likelihood maps across all axial sections formed a 3D likelihood map for the entire VOI. This process was applied to both the pre-treatment and post-treatment scans for each case.

Binary mask and morphological processing: The likelihood map identified tumor regions effectively, but the tumor boundary was not sharply demarcated. To convert the continuous likelihood map into a segmentation, a threshold of 0.60 was applied, experimentally determined to produce reasonable binary masks compared to the radiologist's manual segmentation. The resulting binary cancer mask then underwent morphological processing: a dilation filter with a spherical structuring element of 2 voxels radius, a 3D flood fill algorithm, and an erosion filter (also 2-voxel radius) to smooth the mask and connect neighboring components into an initial segmentation surface.

Level set refinement: The initial surface was refined using level sets governed by advection, propagation, and curvature terms. A 3D level set was first applied with 20 iterations to bring the contour toward sharp edges while slightly expanding it in low-gradient regions. This was followed by a 2D level set applied to every section with 10 iterations for further refinement. The authors noted that unlike their previous work on whole bladder segmentation (where high contrast allowed aggressive level set refinement), bladder cancer segmentation involved low contrast between the tumor and bladder interior because contrast material is generally not used for chemotherapy patients. Therefore, only a few iterations of level sets were applied to avoid leakage or incorrect shrinkage.

Comparison method (AI-CALS): The auto-initialized cascaded level sets (AI-CALS) method, developed in the authors' prior work, served as the baseline comparison. AI-CALS uses cascaded level sets with automatic initialization but does not incorporate deep learning for the initial tumor region estimation. The DL-CNN approach effectively replaced the initialization step of AI-CALS with a learned likelihood map, providing a more robust starting point for the level set refinement.

TL;DR: The DL-CNN generated a 3D likelihood map by scoring every voxel in the VOI. A threshold of 0.60 converted this into a binary mask, which was smoothed by morphological operations and refined by 3D and 2D level sets. Low contrast in non-contrast CT made aggressive level set refinement impractical, so only minimal iterations were used to avoid segmentation leakage.

Evaluation Metrics

Pages 5-6

How Segmentation Quality and Response Prediction Were Measured

Average minimum distance (AVDIST): This metric quantified the spatial agreement between the automatic segmentation and the radiologist's hand-drawn contours. For each point along one contour, the minimum Euclidean distance to the closest point on the other contour was computed. This process was performed in both directions (automatic-to-manual and manual-to-automatic), and the two averages were combined. A lower AVDIST indicated better agreement between the segmentation and the reference standard.

Jaccard index: The 3D Jaccard index measured volumetric overlap between the reference and segmented tumor volumes. Defined as the ratio of the intersection to the union of the two volumes, a Jaccard index of 1.0 would indicate perfect overlap, while 0.0 indicates completely disjoint volumes. This complemented the distance-based metric by capturing overall volume agreement rather than just boundary proximity.

ROC analysis for treatment response: The primary clinical endpoint was prediction of complete response (pathological stage pT0) after surgery. The change in gross tumor volume (GTV) between pre-treatment and post-treatment CT scans was calculated for each segmentation method (DL-CNN, AI-CALS, and manual outlines). The area under the receiver operating characteristic curve (AUC) was used to estimate the accuracy of predicting pT0 based on the percentage volume change. The WHO criteria (2D) and RECIST (1D) estimates from both radiologists were also evaluated using AUC for direct comparison with the 3D volumetric approaches.

TL;DR: Segmentation accuracy was measured by average minimum distance (lower is better) and Jaccard index (higher is better, max 1.0). Clinical utility was assessed by AUC for predicting complete response (pT0) using the volume change between pre- and post-treatment scans, compared against WHO (2D) and RECIST (1D) estimates.

Results

Pages 6-8

Segmentation Accuracy and Treatment Response Prediction

Segmentation performance (full dataset): Across all 126 lesions (pre- and post-treatment combined), the DL-CNN achieved an average minimum distance of 4.7 +/- 2.1 mm compared to 5.5 +/- 3.2 mm for AI-CALS (P = .001). The Jaccard index was 36.3 +/- 17.7% for DL-CNN versus 33.8 +/- 15.1% for AI-CALS (P = .058, approaching significance). When broken down by treatment phase, the DL-CNN showed statistically significant superiority for pre-treatment lesions: average minimum distance of 4.8 +/- 2.3 mm versus 6.1 +/- 3.6 mm (P < .001), and Jaccard index of 39.5 +/- 17.1% versus 34.7 +/- 15.8% (P = .015). For post-treatment lesions, the two methods performed comparably with no significant differences.

Subset analysis with two readers: In the 29-case subset where both radiologists provided manual outlines, the DL-CNN and AI-CALS performances were consistent regardless of which reference standard was used. For example, the DL-CNN achieved average minimum distances of 4.6 +/- 1.8 mm (vs. reference standard 1) and 4.8 +/- 3.2 mm (vs. reference standard 2) across both treatment phases. None of the paired differences between DL-CNN and AI-CALS reached statistical significance in this smaller subset, likely due to insufficient sample size.

Treatment response prediction (AUC): For predicting complete response (pT0), the GTV change calculated from DL-CNN segmentation achieved an AUC of 0.73 +/- 0.06, compared to 0.70 +/- 0.07 for AI-CALS and 0.70 +/- 0.06 for the radiologist's hand outlines. The differences between these three volumetric methods did not reach statistical significance. Critically, all 3D volume-based methods outperformed the traditional clinical criteria: the WHO criteria yielded AUCs of 0.63 +/- 0.07 and 0.61 +/- 0.06 for the two radiologists, while RECIST produced AUCs of 0.65 +/- 0.07 and 0.63 +/- 0.06.

Qualitative observations: Visual inspection of segmentation examples showed characteristic failure modes for each method. The DL-CNN tended to undersegment post-treatment lesions, particularly when the cancer had become part of the bladder wall after treatment. AI-CALS tended to oversegment by leaking into the bladder lumen, especially when the boundary between tumor and bladder interior was indistinct. Both methods struggled with post-treatment lesions that had shrunk considerably, as these small residual tumors were inherently more difficult to delineate.

TL;DR: DL-CNN significantly outperformed AI-CALS for pre-treatment segmentation (P < .001 for distance, P = .015 for Jaccard). For treatment response prediction, DL-CNN achieved AUC 0.73, versus 0.70 for AI-CALS and manual outlines. All 3D volumetric methods substantially outperformed WHO criteria (AUC 0.61-0.63) and RECIST (AUC 0.63-0.65).

Discussion

Pages 8-9

Why 3D Volume Outperforms 1D/2D Criteria and Where DL-CNN Falls Short

Superiority of 3D volumetric measurement: The results demonstrated that 3D gross tumor volume (GTV) changes provided more accurate treatment response prediction than the 2D WHO criteria or 1D RECIST measurements used in current clinical practice. This finding makes intuitive sense: bladder tumors are often irregularly shaped, and reducing a complex 3D structure to one or two linear measurements inevitably discards spatial information. The DL-CNN segmentation, by producing a full 3D contour, captured tumor shape changes that single-diameter measurements could not reflect. The AUC advantage of the volumetric methods over WHO and RECIST ranged from 0.07 to 0.12 points, a clinically meaningful difference for treatment decision-making.

DL-CNN versus AI-CALS: The DL-CNN showed clear advantages over AI-CALS for pre-treatment lesions, which tend to be better defined with more distinct boundaries. For post-treatment lesions, where tumors have shrunk and boundaries become less distinct due to treatment effects, both methods performed comparably. The key advantage of the DL-CNN approach was its ability to learn tumor appearance patterns directly from the training data, producing a likelihood map that served as a more robust initialization for level set refinement compared to the handcrafted initialization of AI-CALS.

Contrast limitations: A fundamental challenge in this application is that contrast material is generally not used for CT scans of chemotherapy patients. This results in low contrast between the tumor and the bladder interior, making segmentation substantially harder than whole bladder segmentation (where the authors previously achieved high accuracy). The level sets, which work by following gradient edges, did not perform well under these low-contrast conditions, which is why the authors limited them to only a few iterations for smoothing rather than aggressive boundary refinement.

Automated versus manual effort: The clinical significance of automated segmentation extends beyond accuracy. Manual 3D contouring of bladder tumors section by section is extremely time-intensive and labor-intensive for radiologists. Even though the DL-CNN and manual segmentation produced comparable AUC values for response prediction (0.73 versus 0.70), the automated approach eliminates the substantial workload of manual delineation, making 3D volumetric assessment practical for routine clinical use where it would otherwise be infeasible.

TL;DR: 3D volumetric measurement outperformed WHO (2D) and RECIST (1D) criteria by 0.07-0.12 AUC points for response prediction. DL-CNN excelled on pre-treatment lesions but matched AI-CALS on post-treatment scans. The lack of contrast material in chemotherapy CT scans created low tumor-bladder contrast, limiting level set refinement. Automation makes 3D assessment clinically feasible by eliminating labor-intensive manual contouring.

Limitations and Future Directions

Page 9

Study Limitations and Plans for Expanding DL-CNN Segmentation

Small sample size: With only 62 cases and 27% (17/62) complete responders, the study had limited statistical power. This likely explains why the AUC difference between DL-CNN (0.73) and AI-CALS (0.70) did not reach statistical significance, and why comparisons in the 29-case subset showed no significant differences. The authors acknowledged that testing on a larger dataset with wider ranges of tumor sizes and types would be necessary to validate the generalizability of the method. They stated plans to continue enlarging the dataset.

Single-reader reference standard: Only one radiologist provided 3D hand-segmented contours for the full 62-case dataset, while a second radiologist outlined only 29 cases. This limited the ability to study inter-observer and intra-observer variability in manual bladder cancer segmentation. Additional independent segmentations from multiple radiologists would be needed to properly characterize the variability in the reference standard itself, which is important because any automated method is ultimately evaluated against an imperfect human reference.

Post-treatment segmentation challenges: Both the DL-CNN and AI-CALS methods showed notably weaker performance on post-treatment lesions compared to pre-treatment ones. Post-treatment tumors that have responded to chemotherapy often shrink dramatically and become integrated into the bladder wall, making boundaries extremely difficult to identify. This is precisely the scenario where accurate segmentation matters most (to determine whether treatment is working), yet it is where current methods are least reliable. Further development specifically targeting post-treatment tumor segmentation remains a key priority.

Future directions: The authors outlined plans to investigate whether radiomic features extracted from the segmented bladder cancers, combined with GTV change, could further improve treatment response assessment. Radiomics involves extracting quantitative features (texture, shape, intensity patterns) from medical images that capture information not visible to the human eye. This approach could potentially enhance the predictive power beyond what volume change alone provides. The study also noted that further development of both the DL-CNN and AI-CALS methods is needed, with room for improvement particularly in post-treatment tumor segmentation.

Broader significance: This 2016 pilot study was among the early applications of deep learning to bladder cancer segmentation in CT. The finding that automated 3D volumetric assessment outperforms the standard 1D and 2D clinical criteria (WHO, RECIST) for predicting treatment response has important implications for clinical practice. If validated on larger cohorts, DL-CNN-based segmentation could enable routine 3D tumor monitoring during chemotherapy, helping clinicians make more informed decisions about whether to continue, modify, or discontinue treatment for individual patients.

TL;DR: Key limitations include the small 62-case dataset, single-reader reference standard for most cases, and poor post-treatment segmentation performance. Future work will explore radiomic feature extraction from segmented tumors and validation on larger, more diverse cohorts. The study represents an early, promising application of deep learning to 3D bladder cancer volumetric assessment that could eventually replace the limited 1D/2D WHO and RECIST criteria in clinical practice.