Deep Learning for Bladder Cancer Treatment Response

Plain-English Explanations

Overview

Pages 1-3

Why Predicting Treatment Response Matters in Bladder Cancer

Bladder cancer is the fourth most common cancer in men. Radical cystectomy (complete surgical removal of the bladder) is the gold standard for treating localized muscle-invasive bladder cancer (MIBC), but roughly 50% of patients develop metastases within two years after surgery and ultimately die. Neoadjuvant chemotherapy, given before surgery, has been shown to improve survival by treating micrometastases and improving resectability. The standard regimen involves 12 weeks of methotrexate, vinblastine, doxorubicin, and cisplatin (MVAC) followed by radical cystectomy. In clinical trials, this approach increased the probability of finding no residual cancer at surgery.

However, MVAC chemotherapy carries substantial toxicity, including leucopenia, sepsis, mucositis, nausea, vomiting, and alopecia. Because no reliable method exists for predicting which individual patients will respond to chemotherapy, some patients endure these adverse effects without any therapeutic benefit, while also missing the window for alternative treatments as their condition deteriorates. Early prediction of treatment failure would allow physicians to discontinue ineffective chemotherapy sooner, reducing unnecessary morbidity, improving quality of life, and lowering costs.

If a patient can be reliably identified as having a complete pathological response (stage T0), the option of bladder preservation may be considered instead of cystectomy, dramatically reducing morbidity. This study by Wu et al. from the University of Michigan explored whether deep learning convolutional neural networks (DL-CNNs) could predict treatment response using pre- and post-treatment computed tomography (CT) scans, comparing different network architectures and transfer learning strategies against the performance of experienced radiologists.

TL;DR: About 50% of MIBC patients develop metastases after cystectomy, so neoadjuvant chemotherapy with MVAC is standard. But MVAC is toxic, and there is no reliable way to predict who will respond. This study tested whether DL-CNNs on CT scans could predict complete response (T0) to help guide treatment decisions and potentially spare some patients from surgery.

Dataset

Pages 3-4

Patient Cohort, CT Imaging, and Hybrid ROI Construction

The study collected pre- and post-treatment CT scans of 123 patients with a total of 129 bladder cancers undergoing chemotherapy, with IRB approval. After chemotherapy, each patient underwent cystectomy, and the pathological cancer stage from the surgical specimen served as the ground truth. 33% of patients achieved complete response (stage T0), meaning no residual cancer was found at surgery, while the remainder had persistent disease (stage > T0).

CT scans were acquired on GE Healthcare LightSpeed MDCT scanners at 120 kVp with 120 to 280 mA. Pixel sizes ranged from 0.586 to 0.977 mm, and slice thickness ranged from 0.5 to 7.5 mm. Bladder lesions on pre- and post-treatment scans were segmented using a previously developed auto-initialized cascaded level sets system. Regions of interest (ROIs) were extracted as 32 x 16 pixel images from the segmented lesions, and pre- and post-treatment images were combined into hybrid pre-post image pairs (h-ROIs) of 32 x 32 pixels.

The data was split into three sets: a training set of 77 lesions from 73 patients (19 T0, 58 > T0) forming 94 lesion pairs and 6,209 hybrid ROIs; a validation set of 10 lesions (5 T0, 5 > T0) forming 10 pairs and 521 ROIs; and a test set of 42 lesions from 41 patients (12 T0, 30 > T0) forming 54 pairs. Two experienced radiologists, blinded to clinical outcomes, independently rated each test pair for the likelihood of being stage T0.

TL;DR: 123 patients (129 cancers, 33% complete responders) provided pre- and post-chemotherapy CT scans. Lesions were segmented and combined into 32x32-pixel hybrid pre-post ROI pairs. The training set contained 6,209 hybrid ROIs from 94 lesion pairs, and the test set included 54 pairs evaluated by two blinded radiologists as the human baseline.

Network Architecture

Pages 4-6

DL-CNN Architecture Based on AlexNet and Structural Variations

The base DL-CNN architecture was derived from AlexNet and implemented in TensorFlow. It consisted of two convolutional layers (C1 and C2), two locally connected layers (L3 and L4), and one fully connected layer (FC10). Within C1 and C2, convolution filtering used 64 kernels of size 5x5 with stride 1, followed by local response normalization and max pooling with a 3x3 filter at stride 2. L3 used 64 kernels of size 3x3, L4 used 32 kernels of size 3x3, and the FC10 output was a softmax layer producing a likelihood score from 0 (> T0) to 1 (T0).

Three structural modifications were tested, each altering the max pooling filter sizes, strides, and padding in the C1 and C2 layers. DL-CNN-1 increased the C1 max pooling filter to 5x5 and reduced the C2 max pooling to 2x2 with stride 1. DL-CNN-2 changed the C1 convolution stride to 2 and the C2 max pooling to 2x2 with stride 1. DL-CNN-3 changed the C1 max pooling padding from "valid" to "same" and set the C2 max pooling filter to 4x4 with stride 2. These modifications explored how early-layer feature extraction influenced classification of treatment response.

The network was relatively small compared to deeper architectures like GoogLeNet Inception or ResNet, which the authors noted they would investigate in the future when a larger dataset becomes available. The compact design was a deliberate choice to avoid overfitting given the limited training data of 6,209 hybrid ROIs.

TL;DR: The base DL-CNN used an AlexNet-derived architecture with two convolutional layers (C1, C2), two locally connected layers, and a softmax output, implemented in TensorFlow. Three structural variants modified the max pooling filter sizes, strides, and padding in C1 and C2. The compact design was chosen to prevent overfitting on the small dataset.

Transfer Learning

Pages 5-6

Transfer Learning from CIFAR-10 and Layer Freezing Strategies

Transfer learning is a widely used technique in medical imaging where training datasets are typically small. Instead of training a CNN from scratch with random weights, the network is first pre-trained on a large dataset from a different domain and then fine-tuned on the target medical imaging task. This study used the CIFAR-10 image set, which contains 60,000 32x32 images across 10 natural-object classes (airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck), to provide initial pre-trained weights.

The authors explored layer freezing, where certain layers have their weights locked during fine-tuning to preserve learned features. Three freezing configurations were tested on the base DL-CNN structure: freezing only C1, freezing C1 and C2 together, and freezing C1, C2, and L3 together. The rationale is that early layers in neural networks tend to learn universal visual features like edges and curves, while deeper layers learn features more specific to the target domain (in this case, bladder lesion patterns).

All models were trained for 10,000 epochs, with validation AUC recorded every 100 epochs. The epoch number where validation AUC peaked (typically around 2,000) was selected, and a final model was then trained on the combined training and validation sets up to that epoch before deployment on the test set. Training for 10,000 epochs took approximately 8.3 hours on an NVIDIA GeForce GTX 1080 Ti GPU, and deployment on the test set took less than one minute per case.

TL;DR: The DL-CNN was pre-trained on CIFAR-10 (60,000 natural images) and fine-tuned on bladder CT data, a standard transfer learning approach for small medical datasets. Three layer-freezing strategies were tested (C1 only, C1+C2, C1+C2+L3) to study how preserving early learned features affects performance. Training took about 8.3 hours on a GTX 1080 Ti GPU.

Results

Pages 6-8

AUC Performance Across Network Variants and Transfer Learning

The base DL-CNN with randomly initialized weights achieved a test AUC of 0.73 for predicting complete response (T0). When the same architecture was pre-trained on CIFAR-10 with transfer learning (no frozen layers), the test AUC improved to 0.79, demonstrating that pre-trained weights provide a meaningful advantage even when the source domain (natural images) differs substantially from the target domain (bladder CT scans).

Among the structural variants, DL-CNN-2 achieved the best performance with a test AUC of 0.86, followed by DL-CNN-1 at 0.72 and DL-CNN-3 at 0.69. The only statistically significant difference was between DL-CNN-2 and DL-CNN-3 (p = 0.007 by DeLong test, p = 0.006 by ROC-kit). DL-CNN-2 modified the C1 convolution stride to 2 and used smaller C2 max pooling, suggesting that these particular changes to early feature extraction were beneficial for distinguishing treatment response patterns.

For layer freezing, freezing only C1 produced a test AUC of 0.81, slightly better than the unfrozen baseline (0.79). Freezing C1 and C2 together yielded 0.78, and freezing C1, C2, and L3 dropped to 0.71. This progressive decline follows the principle that early CNN layers capture universal features (edges, curves) that may not need fine-tuning, but deeper layers must adapt to the specific domain. None of the layer-freezing differences reached statistical significance.

The two radiologists achieved AUCs of 0.76 and 0.77, respectively. At a clinically selected specificity of 80%, sensitivities ranged from 41.7% to 75.0% and accuracies from 64.1% to 78.9% across all models. DL-CNN-2 achieved the highest sensitivity of 75.0% and accuracy of 78.9% at 80% specificity.

TL;DR: Transfer learning improved the base DL-CNN from AUC 0.73 (random weights) to 0.79 (CIFAR-10 pre-trained). The best structural variant (DL-CNN-2) reached AUC 0.86, outperforming both radiologists (AUC 0.76 and 0.77). Freezing only the first convolutional layer slightly improved performance (AUC 0.81), but freezing too many layers degraded it.

Comparison

Pages 8-10

Comparison with Radiologists and Radiomics Methods

The study compared DL-CNN results against both human readers and prior radiomics-based machine learning methods from a study by Cha et al. The two experienced radiologists, blinded to treatment outcomes, achieved AUCs of 0.76 and 0.77 when evaluating pre- and post-treatment CT scans displayed side by side on medical-grade monitors. The DL-CNN from the earlier Cha et al. study, trained on a smaller cohort (82 patients, 87 cancers), achieved AUC 0.73, matching the base DL-CNN with random weights in this study.

Cha et al. also evaluated two radiomics feature-based classification methods: RF-SL (radiomics features with segmented lesion-level analysis) and RF-ROI (radiomics features with ROI-level analysis). RF-SL achieved AUC 0.77, comparable to the radiologists, while RF-ROI achieved AUC 0.69. These radiomics methods extracted morphological, gray-level, and texture features from pre- and post-treatment scans and predicted response based on estimated feature changes.

The best-performing model in the current study, DL-CNN-2, with its AUC of 0.86, outperformed all comparison methods: both radiologists, both radiomics methods, and the earlier DL-CNN. However, DL-CNN-2 only reached statistical significance against DL-CNN-3 (the worst-performing variant), highlighting the challenge of achieving significant differences with a 54-pair test set. The authors noted that the DL-CNN approach has the advantage of automatically learning discriminative features directly from the image data, bypassing the manual feature engineering required by radiomics methods.

TL;DR: DL-CNN-2 (AUC 0.86) outperformed both radiologists (AUC 0.76, 0.77), radiomics methods RF-SL (AUC 0.77) and RF-ROI (AUC 0.69), and the prior DL-CNN by Cha et al. (AUC 0.73). The DL-CNN learns features automatically from images, avoiding the manual feature engineering needed for radiomics, though the small test set limited statistical significance.

Limitations and Future Work

Pages 10-13

Study Limitations, Clinical Implications, and Next Steps

The most significant limitation is the relatively small dataset of 123 patients with 129 cancers. The training set contained only 77 lesions forming 94 pairs, and the test set had 42 lesions forming 54 pairs. This small sample size likely contributed to the lack of statistically significant differences between most model comparisons and may limit generalizability to other patient populations. Only two radiologists provided comparison ratings, which is insufficient to fully characterize the variability of human reader performance.

The CT scans were acquired with non-uniform pixel sizes (0.586 to 0.977 mm) and slice thicknesses (0.5 to 7.5 mm). While this variability mirrors real clinical conditions and helps the network handle diverse imaging parameters, it may also introduce bias during training. The authors suggested future work on voxel size matching using interpolation. Additionally, CIFAR-10 is not a medical imaging dataset, so transfer learning from CT-specific pre-trained networks could yield further improvements.

The clinical implications are substantial. Accurate prediction of pathological response to neoadjuvant chemotherapy could enable earlier discontinuation of toxic MVAC regimens for non-responders and could support bladder preservation decisions for complete responders, avoiding the morbidity of radical cystectomy. The authors envision a computerized decision support system (CDSS-T) that integrates with clinical workflows to provide noninvasive, objective treatment monitoring.

Future directions include collecting larger datasets, exploring deeper architectures such as GoogLeNet Inception and ResNet that require more training data, using CT-specific pre-trained weights instead of CIFAR-10, and validating models across multiple institutions. The compact DL-CNN design was necessary given current data constraints, but the study demonstrated that even relatively simple deep learning architectures can match or exceed radiologist performance for this challenging treatment response prediction task.

TL;DR: The study's main limitations are the small dataset (123 patients), non-uniform CT acquisition parameters, and only two radiologist comparisons. Future work targets larger cohorts, deeper networks like Inception and ResNet, CT-specific transfer learning, and multi-institution validation. Clinically, accurate response prediction could spare non-responders from toxic chemotherapy and enable bladder preservation for complete responders.