Bladder cancer (BCa) is the tenth most common cancer worldwide, with approximately 573,000 new cases and 213,000 deaths reported in 2020. Early diagnosis and treatment are crucial for reducing morbidity and mortality. The current gold standard for diagnosis relies on transurethral resection of bladder tumor (TURBT) and cystoscopy, both of which are invasive and expensive. As a result, noninvasive imaging techniques including magnetic resonance imaging (MRI), computed tomography (CT), and positron emission tomography (PET) play an increasingly important role in BCa detection.
Deep learning (DL) is a subfield of machine learning (ML) that uses multi-layered neural networks to automatically learn relevant features from data, eliminating the need for manual feature selection that classical ML algorithms require. DL has achieved significant success in large-scale image classification, speech recognition, and other complex computational domains. Medical images contain vast amounts of data with valuable signals that exceed human analytical capacity, making DL a natural fit for this domain.
This 2022 review by Li et al., authored by researchers from Shanghai Jiao Tong University School of Medicine and the AI Institute at Shanghai Jiao Tong University, represents the first comprehensive review of DL applications specifically in BCa imaging. The authors conducted a literature search across PubMed, Web of Science, and IEEE Xplore using terms including "Bladder Cancer," "Deep Learning," and "Medical Imaging." Eligibility criteria required papers to be in English, non-review, primarily related to BCa, discussing DL, and involving imaging data. The review covers four major application areas: bladder segmentation, diagnosis, staging, and treatment response prediction.
Clinical motivation: Radiologists face significant difficulty making accurate BCa diagnoses based on imaging alone because of the complex and variable imaging features of BCa. Non-muscle-invasive bladder cancer (NMIBC) accounts for approximately 75% of BCa cases, while muscle-invasive bladder cancer (MIBC) accounts for 25%. MIBC carries a poor prognosis, with a 5-year survival rate of approximately 45-68% after radical cystectomy, and survival time for patients with metastases generally does not exceed 2 years. These stark outcomes underscore the urgency for better diagnostic imaging tools.
Medical image segmentation is a foundational step in BCa diagnosis and staging. Accurate segmentation of the bladder wall and tumor regions allows quantitative assessment of tumor size, shape, and extent of invasion. However, the bladder is a hollow organ that undergoes substantial variation in position, shape, and volume. Combined with complex noise and artifacts in medical images, this makes automated segmentation a challenging task. Most early DL studies focused only on bladder wall segmentation because high variability in tumor shape and intensity makes it difficult to distinguish between the bladder wall and a tumor.
U-Net is one of the most successful fully convolutional architectures for medical image segmentation. In 2018, Dolz et al. enhanced U-Net with progressive dilated convolutional modules, where increasing dilation rates provided larger receptive fields to leverage multi-scale contextual information. Trained on T2-weighted MRI from 60 BCa patients, their model achieved mean Dice similarity coefficient (DSC) values of 0.98 for the bladder inner wall, 0.84 for the outer wall, and 0.69 for the tumor region, outperforming the original U-Net, E-Net, and ERF-Net. Inference time for the entire 3D volume remained under 1 second.
PiPNet (Pyramid in Pyramid Network): In 2019, Liu et al. proposed PiPNet, built on U-Net with atrous spatial pyramid pooling (ASPP) using four parallel atrous convolutions with increasing dilation rates. The model generated three prediction masks from the last three feature map layers to compute an overall loss function for multi-scale feature extraction. Depthwise separable convolution improved efficiency. Evaluated on T2W MR images from 47 BCa patients, PiPNet achieved DSC values of 0.89 for the outer wall and 0.95 for the tumor, outperforming SegNet, U-Net, and Dolz's model.
CPA-Unet: Yu et al. (2022) developed the Cascade Path Augmentation Unet, employing a two-stage segmentation strategy. The first stage used U-Net for rough segmentation, then the segmented image was concatenated with the original image and fed into a Path Augmentation structure (PA-Unet) based on the Path Aggregation Network. A hybrid loss function combining dice and cross-entropy losses further improved performance. CPA-Unet achieved DSC values of 0.98 for the inner wall, 0.82 for the outer wall, and 0.87 for the tumor, with better multi-scale feature extraction and small target classification than U-Net, Prog Dilated, and PiPNet.
Beyond U-Net: While U-Net variants improve network performance through more elaborate architecture design, the review notes they do not take advantage of the unique geometric characteristics of BCa data. The bladder's hollow structure and distinctive shape offer information that generic architectures cannot exploit. DeepMedic, another well-known CNN architecture for medical image segmentation, was adapted for BCa to better leverage this geometric information.
Hammouda et al. (2019) adopted a dual pathway 2D CNN based on DeepMedic to segment T2-weighted MRI images. In addition to MRI image data, they incorporated subject-specific adaptive shape prior (ASP) information derived from co-aligning MRI images and ground truth using affine and B-spline transformations. This combination of adaptive shape and contextual information yielded remarkable DSC values of 0.99 for the bladder inner wall, 0.98 for the outer wall, and 0.97 for the tumor, with Hausdorff distances of just 0.17mm (inner wall), 0.18mm (outer wall), and 0.25mm (tumor) on a 20-patient leave-one-out cross-validation dataset.
3D extension (2020): Hammouda et al. further extended their approach to 3D bladder segmentation using T2W MRI on 17 patients. The proposed 3D CNN used two branch networks: the first segmented the bladder wall with the tumor, and the second extracted only the bladder. A 3D ASP model was mixed with the original training data for the second network, and outputs were refined using a fully connected conditional random field (CRF). The CRF effectively reduced isolated small regions and holes caused by local minima during training and noise in input images. DSC values were 0.98 for the inner wall, 0.97 for the outer wall, and 0.96 for the tumor.
Evaluation metrics discussion: The review highlights an important methodological concern: different studies adopted different evaluation metrics, making direct comparison difficult. Most articles used the popular Dice similarity coefficient (DSC), but some employed the Jaccard index, average distance (AVDIST), average symmetric surface distance (ASSD), or Hausdorff distance (HD). The authors advocate for consistent adoption of both DSC and HD to facilitate meaningful comparisons. They also note that metrics related to clinical application, such as model computation time, should be included. Prior to deep learning, the best tumor segmentation DSC using traditional methods (level-set) was 86.3%, compared to 97.05% with deep learning, and bladder wall segmentation improved from 87.28% to over 90% consistently.
Clinical importance of staging: BCa is classified into non-muscle-invasive bladder cancer (NMIBC) and muscle-invasive bladder cancer (MIBC) based on whether the cancer invades the muscle layer. Early and accurate differentiation between these two categories is crucial because MIBC carries a significantly worse prognosis. Previously, the combination of artificial intelligence and radiomics replaced traditional methods of manually defining the region of interest (ROI) and extracting image features for BCa diagnosis and staging.
Small DL-CNN approach: Yang et al. (2021) proposed a compact DL-CNN containing four convolutional and max-pooling layers to differentiate NMIBC from MIBC. Trained on 1,200 CT images from 369 patients, the small DL-CNN minimized overfitting risk with a sensitivity of 0.722 and specificity of 1.000. For comparison, they also developed eight well-known pretrained models (trained on ImageNet), among which VGG16 and VGG19 showed the highest performance with AUROC values ranging from 0.762 to 0.997. However, the study required an additional artificial enhancement step before feeding data into the model, preventing fully automated processing.
FGP-Net multicenter study: Zhang et al. (2021) conducted one of the rare multicenter DL studies in BCa, using CT urography images from 441 patients across two medical centers (183 training, 110 validation, 73 internal test, 75 external test). They developed a novel 3D DL-CNN called Filter-guided Pyramid Network (FGP-Net), which incorporated dense blocks for enhanced feature transmission and discriminative filter learning (DFL) modules for class-specific patch detection. The network achieved AUC 0.861 and accuracy 0.795 on the internal test set, and AUC 0.791 and accuracy 0.747 on the external cohort. Although the final performance needed improvement, the DL model provided slightly better, more objective, and more stable results than two radiologists.
ResNet18 and VI-RADS: Liu et al. (2022) adopted ResNet18 with a super-resolution module and non-local attention module for MRI-based BCa diagnosis and staging, achieving a sensitivity of 94.74% on a dataset of 75 patients (51 training, 8 validation, 16 test). Taguchi et al. (2021) used a CNN-based denoising deep learning reconstruction (dDLR) to improve the signal-to-noise ratio in high-spatial-resolution MRI images, demonstrating the potential of DL in indirectly assisting BCa diagnosis through the Vesical Imaging Reporting and Data System (VI-RADS).
Clinical need: Neoadjuvant chemotherapy has been shown to improve overall survival for BCa patients, but not all patients benefit and some suffer severe side effects. Early assessment of tumor size changes and treatment response is essential for personalized treatment planning. Current clinical assessment tools, including WHO criteria and RECIST (Response Evaluation Criteria in Solid Tumors), are inaccurate because they do not address 3D measurements, and results are heavily influenced by observer experience, particularly for tumors with complex and irregular shapes.
Pioneering work by Cha et al.: In 2016, Cha et al. developed a network with 2 convolution layers, 2 locally connected layers, and 1 fully connected layer (based on AlexNet) to segment and measure gross tumor volume (GTV) from CT images for predicting treatment response in 62 patients using leave-one-out cross-validation. They achieved an AUC of 0.73. In 2017, the same group developed a similar DL-CNN to predict the response to neoadjuvant chemotherapy in 82 patients, pairing ROIs extracted from pre- and post-treatment tumor regions to form 6,700 image pairs, and again achieved AUC 0.73.
Network optimization by Wu et al. (2019): Building on Cha's work, Wu et al. developed seven DL-CNN variants by modifying filter size, stride, and padding in convolution and max pooling layers. Only one variant (DL-CNN-2, with modified C1 convolution stride and C2 max pooling) showed significant improvement, achieving AUC 0.86. Pretraining on the CIFAR10 dataset improved performance, reaching AUC 0.79 with pretrained weights versus 0.73 with random initialization. Performance generally decreased as more layers were frozen, though freezing only the C1 layer slightly improved results, possibly because subsequent layers capture more specific bladder lesion features.
CDSS-T clinical decision support: Cha et al. (2019) developed a CT-based decision-support system for MIBC treatment response assessment (CDSS-T) combining their DL-CNN with a radiomics assessment model. In a landmark observer study with 12 physicians using 123 patients, the CDSS-T alone achieved AUC 0.80, which was higher than the AUC of 0.77 when the system assisted physicians and 0.74 without any system assistance. This was the first observer study using a CAD system for this purpose, and the finding that standalone system accuracy exceeded physician-assisted accuracy highlights that clinician trust and experience with such systems still needs development.
Imaging modality gaps: Clinical BCa diagnosis often requires integration of various imaging data, including CT and different MRI sequences. While CT is the most commonly used imaging technique for BCa, MRI has been shown to be more effective for staging due to increased soft-tissue contrast resolution. Diffusion-weighted imaging (DWI) and dynamic contrast enhancement (DCE) are particularly useful for assessing tumor invasiveness and infiltration. However, most current DL studies still use CT as input data, and all studies using MRI have focused only on T2-weighted sequences, with no research exploring DWI or DCE sequences.
Small and non-standardized datasets: The limited quantity of medical imaging data restricts the development of DL, as the amount of data significantly affects model performance. Many BCa studies used datasets so small that they lacked independent validation or test sets, which biases the assessment of model performance. The review's summary tables confirm this: segmentation studies ranged from just 17 to 220 patients, diagnosis studies used 68 to 369 patients, and treatment studies included 62 to 123 patients. Transfer learning and data augmentation can improve performance to some extent, but they cannot replace the need for large datasets.
Cross-institutional generalizability: Different hospitals employ different scanning methods and equipment, making established models difficult to use across institutions. This is a fundamental limitation for clinical application. The FGP-Net study by Zhang et al. was one of the few to include external cohort validation, and it showed a notable drop in performance from AUC 0.861 (internal) to 0.791 (external), illustrating the generalizability problem. Semi-supervised and self-supervised methods could address data scarcity, but their application in BCa remains limited.
Single-modality focus: Most BCa DL research has focused on only one modality of medical imaging, whether CT or MRI. Multiple studies in other domains have shown that processing multiple modalities simultaneously can significantly improve DL model performance. The authors advocate for increasing data diversity, multimodal methods, and comprehensive BCa datasets including multi-center data or a nationwide BCa imaging database to advance the field significantly.
Lack of BCa-specific optimization: Most DL models in current BCa research simply apply existing network architectures without optimizing for the unique imaging characteristics of bladder cancer. BCa data possess distinctive structures, including the bladder's unique geometry, hollow structure, and variable shape. These characteristics are not being well utilized by current approaches. The review contrasts U-Net-based methods, which improve results through generic network design, with DeepMedic-based approaches that incorporate bladder-specific geometric information and achieve superior results.
Interpretability ("black box" problem): Compared with other ML methods, DL operates as a complex black box. For future optimization, it is essential to reflect physicians' diagnostic ideas and clinical experience within the DL model and improve its interpretability. Only when a physician can understand why a DL model makes a particular assessment can the model effectively assist in clinical decision-making. The CDSS-T observer study highlighted this challenge: standalone system accuracy (AUC 0.80) exceeded physician-assisted accuracy (AUC 0.77), suggesting that clinicians may not fully trust or effectively integrate AI recommendations.
Underexplored advanced techniques: Many state-of-the-art DL techniques, including self-supervised learning, pre-training models, transformers, and contrastive learning, have not yet been applied in BCa research. These methods have shown impressive results in other medical imaging domains and represent significant untapped potential. The majority of BCa research remains focused on image segmentation, while the authors believe DL could assist physicians in many more ways across the diagnostic and treatment workflow.
Application gap: Despite DL's demonstrated potential to match or exceed physician performance in specific tasks, transitioning from research to clinical application faces substantial hurdles. The radiologist's subjective judgment still offers advantages in certain scenarios. For instance, radiologists may deliberately upstage ambiguous tumors out of concern for missing MIBC, which could benefit patients through earlier clinical intervention. Integrating AI recommendations with clinical judgment requires careful consideration of when and how to deploy these tools in practice.
Multimodal fusion: The review identifies multimodal processing as one of the most promising research directions. Combining imaging-based assessment with other clinical data such as genomics and pathology has been shown to outperform unimodal approaches. BCa is heterogeneous at the molecular level, and different molecular classifications may help stratify patients for prognosis or treatment response. Including multimodal information can complement the shortcomings of BCa imaging in these areas. However, the small number of BCa open datasets has limited the adoption of multimodal processing methods.
Advanced MRI sequences: Combining DL with the most appropriate and advanced imaging techniques in BCa will be a key research direction. Specifically, DWI and DCE MRI sequences, which are far more useful for assessing tumor invasiveness and infiltration than T2-weighted imaging alone, have not yet been explored with DL methods. Integrating these sequences with DL could significantly improve staging accuracy, particularly for distinguishing T1 from T2 disease where current methods perform poorly.
Semi-supervised and self-supervised methods: Given the fundamental constraint of small, non-standardized datasets in BCa, the authors highlight semi-supervised and self-supervised learning as necessary approaches. These methods can extract useful representations from unlabeled data, reducing the dependency on large annotated datasets. Combined with transfer learning from pretrained models and data augmentation, these techniques could substantially improve model robustness even with limited BCa-specific training data.
Overall conclusions: The review positions DL as having extremely broad application prospects in BCa. In the era of precision medicine and individualized diagnosis, the central challenge is transforming DL from a research tool into one that effectively helps physicians in clinical practice. The authors emphasize three priorities: expanding multi-center and nationwide BCa imaging databases, developing BCa-specific DL architectures that exploit the bladder's unique anatomical features, and improving model interpretability to build clinician trust. The powerful potential demonstrated by DL is expected to bring about a new revolution in BCa management.