Bladder cancer recurrence: Nonmuscle-invasive bladder cancer (NMIBC) is one of the most recurrence-prone malignancies in urology. The standard treatment, transurethral resection of bladder tumor (TUR-BT), is followed by intravesical recurrence within two years in approximately 50% of cases. Much of this recurrence is attributed not to new tumor formation but to dissemination of tumor cells, expansion of precancerous lesions, and the overlooking of micro-disseminated daughter tumors during the initial procedure. Recurrence rates at first follow-up cystoscopy vary widely across institutions, highlighting a significant dependence on surgeon skill and experience.
Limitations of white-light cystoscopy: Cystoscopy is the essential tool for both diagnosing and monitoring bladder cancer, yet white-light imaging (WLI) misses lesions in 10 to 20% of cases. Reported sensitivity and specificity of diagnosis under WLI are only about 60% and 70%, respectively. Flat tumors such as carcinoma in situ (CIS), small-diameter lesions, and the flat extensions surrounding elevated tumors are particularly difficult to identify. While advanced endoscopic techniques like narrow band imaging (NBI) and photodynamic diagnosis (PDD) have improved visibility, WLI remains the primary observation method in most clinical settings.
Study objective: The authors, a collaborative team from the University of Tsukuba Hospital and the National Institute of Advanced Industrial Science and Technology (AIST) in Japan, aimed to develop an AI-based support system for cystoscopic diagnosis. Their goal was not to replace the urologist but to provide objective, automated evaluation of white-light cystoscopy images using a convolutional neural network (CNN) trained via transfer learning. The study used images collected from routine clinical practice between February 2017 and July 2018.
Image acquisition: The dataset consisted of 2102 cystoscopic images captured using a flexible endoscope (CYF-VHA, Olympus) at the outpatient clinic of the University of Tsukuba Hospital. All images were white-light still images stored as TIFF files at 1350 x 1080 pixels. Images degraded by urine turbidity or out-of-focus blur were excluded. One experienced urologist annotated all images, marking sites judged to be tumors using two categories: elevated lesions and flat lesions. Annotations were confirmed against pathologic results as ground truth.
Dataset composition: Of the 2102 images, 431 were tumor images and 1671 were normal images. The normal images were obtained from 1637 cystoscopy procedures performed during the same period and judged by the same urologist as showing no tumor lesions. The tumor images came from 124 bladder endoscopies performed on 109 patients (97 men, 12 women; median age 74 years). All but one image (a papilloma) showed urothelial carcinoma, and 96.3% of tumor images were NMIBC (stages Ta, Tis, and T1), while only 3.5% were T2 muscle-invasive disease.
Tumor morphology and size distribution: Annotation data categorized lesions as elevated in 265 images (61.5%), flat in 76 images (17.6%), and mixed (both elevated and flat) in 90 images (20.9%). Regarding tumor size relative to the overall image, 10.2% of images had lesions occupying less than 10% of the frame, 56.9% had lesions occupying 10-50%, and 32.9% had lesions filling more than 50%. The grade distribution was roughly balanced, with 45.3% low-grade and 54.7% high-grade tumors by the 1973 WHO classification.
Data split: The full dataset was randomly divided into training and test sets at an 8:2 ratio, yielding 344 tumor and 1336 normal training images, and 87 tumor and 335 normal test images. This split ensured that the model was evaluated on images it had never encountered during training.
GoogLeNet backbone: The classifier was built on GoogLeNet, the CNN architecture that won first place at the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) 2014. GoogLeNet was originally trained on the ImageNet dataset of 1.2 million natural images across 1000 object categories, giving it a rich set of pre-learned visual features. The authors employed transfer learning, taking the pre-trained network parameters and using them as the initial weights for learning cystoscopic image features. This approach is analogous to how physicians first learn to see general visual patterns before specializing in endoscopic diagnosis through clinical training.
Transfer learning and fine-tuning: All network parameters of the pre-trained GoogLeNet were fine-tuned using the Adam optimizer with a learning rate of 1e-5 over 150 epochs. The final classification layer was a multilayer perceptron (MLP) discriminator that output a score between 0 (normal) and 1 (tumor). Transfer learning was critical here because gathering millions of medical images for training from scratch is impractical. By leveraging features already learned from natural images, the model could extract meaningful patterns from only several thousand cystoscopy images.
Data augmentation: To address the class imbalance between tumor images (344 in training) and normal images (1336 in training), the authors augmented the training data by generating new images through random rotation and blurring of original images. This augmentation brought the tumor-to-normal ratio in the training set to 1:1, preventing the model from developing a bias toward classifying images as normal simply because normal images were more prevalent.
Implementation tools: The experiment was conducted in Python using Chainer, a deep learning framework, for model construction and training. Evaluation metrics including the confusion matrix and receiver operating characteristic (ROC) curve were computed using scikit-learn. The classifier's performance was measured by true-positive rate and false-positive rate across varying decision thresholds, plotted as the ROC curve, with the area under the curve (AUC), maximum Youden index, sensitivity, and specificity reported.
Primary outcomes: On the test set of 422 images (87 tumor, 335 normal), the CNN classifier achieved an area under the ROC curve of 0.98, a maximum Youden index of 0.837, sensitivity of 89.7%, and specificity of 94.0%. The confusion matrix showed 78 true positives, 315 true negatives, 20 false positives, and 9 false negatives. These results represent a substantial improvement over reported white-light cystoscopy performance alone, where sensitivity and specificity hover around 60% and 70%.
False negative analysis: Of the 9 false-negative cases (tumors the AI missed), 6 were early-stage elevated lesions classified as Ta tumors, 2 were flat Ta lesions, and 1 was a T1 tumor with mixed elevated and flat morphology. Critically, 8 of these 9 missed tumors were small lesions occupying less than 10% of the overall image. This finding indicates that small lesion size, rather than tumor type or morphology, was the dominant factor behind misclassification. These small lesions are also the ones most commonly missed by human observers during routine cystoscopy.
False positive analysis: The system generated 20 false-positive results (normal images incorrectly classified as containing tumors). While the paper does not extensively detail the characteristics of these false-positive images, the 94.0% specificity means that only about 6% of normal images triggered a false alarm. In a clinical support context, a modest false-positive rate is generally more acceptable than missed tumors, as false positives would prompt closer inspection rather than missed diagnoses.
Performance by lesion size: When stratified by the proportion of the image occupied by the tumor, smaller lesions proved significantly harder to classify. For images where the lesion occupied less than 10% of the frame, the AUC was 0.88 and the Youden index was only 0.62. For lesions occupying 10-50%, AUC was 0.88 with a Youden index of 0.90. For lesions filling more than 50%, the AUC was 0.88 with a Youden index of 0.92. The consistent AUC of 0.88 across size groups suggests similar discriminative capacity overall, but the markedly lower Youden index for small lesions reflects difficulty in finding an optimal classification threshold when tumors are tiny. Only about 10% of images contained these small lesions, likely limiting the model's opportunity to learn their features.
Performance by T stage: The classifier performed strongly across nearly all tumor stages. For Ta tumors (the most common at 76.5% of images), AUC was 0.98 and Youden index was 0.84. For T1 tumors, AUC was 0.98 with Youden index 0.86. Notably, Ta+1 and T2 stages achieved perfect AUC of 1.00 and Youden index of 1.00, meaning every image in these categories was correctly classified. Tis (carcinoma in situ) had AUC 0.98 and Youden index 0.96, and Ta+is achieved AUC 0.99 and Youden index 0.98. These results are particularly encouraging for CIS detection, which is notoriously difficult under white light.
Performance by tumor form: Classification accuracy varied modestly with tumor morphology. Elevated lesions had AUC 0.98 and Youden index 0.85. Flat lesions showed AUC 0.96 and Youden index 0.87. Mixed lesions (containing both elevated and flat components) performed best with AUC 0.99 and Youden index 0.92. The slightly lower AUC for flat lesions (0.96 vs. 0.98-0.99) reflects the inherent difficulty of detecting lesions that lack the papillary or elevated morphology that makes tumors more visually distinct. Still, even flat lesions were classified with high accuracy, suggesting the CNN learned subtle texture and color features beyond gross morphology.
Clinical significance: The authors emphasize that intravesical recurrence of NMIBC is driven largely by dissemination of tumor cells, expansion of precancerous lesions, and missed micro-disseminated daughter tumors rather than de novo development. Ensuring complete resection of these difficult-to-see lesions during TUR-BT is therefore critical. The proposed AI system, by objectively evaluating cystoscopic images with an AUC of 0.98, offers a tool that could reduce the diagnostic miss rate and improve the quality of endoscopic treatment.
Transfer learning as a practical solution: One of the key contributions of this study is demonstrating that transfer learning can achieve high diagnostic accuracy with a relatively small medical image dataset. Training an AI system from scratch typically requires millions of high-quality labeled images, which is impractical for most medical applications. By transferring features learned from 1.2 million general images in ImageNet and then fine-tuning on approximately 1700 cystoscopy images, the authors achieved robust classification performance. This mirrors human learning, as the authors note: we first learn general visual recognition from birth, then specialize through medical training.
Broader AI context in endoscopy: At the time of this study (2020), AI-assisted endoscopic diagnosis had already been clinically applied in gastroenterology for tasks such as gastric cancer detection, Helicobacter pylori infection evaluation, and colon polyp classification. In urology, however, only one prior study had applied deep learning to urologic endoscopy. This paper represents an early and important contribution to the field, demonstrating that CNN-based systems can objectively classify bladder tumors from standard white-light cystoscopy images without requiring specialized imaging modalities like NBI or PDD.
Support, not replacement: The authors are explicit that their system is not intended to replace urologists' diagnoses. Instead, if an automatic visual detection method for cystoscopic images can be established, it could support real-time cystoscopy by processing not only still images but also video feeds. This would help with accurate identification of tumor boundaries and reduction of the diagnostic miss rate, particularly for less experienced physicians whose recurrence rates after TUR-BT are known to be higher than those of experienced specialists.
Annotation limitations: All annotation data were prepared by a single urologist, which ensures internal consistency but introduces potential bias. The possibility that this annotator overlooked some lesions cannot be ruled out, and the study did not verify how multiple physicians would classify the test images as tumor or normal. Additionally, non-tumor abnormalities such as inflammation-induced changes in the bladder mucosa were not included in the training data, meaning the model was not trained to distinguish between tumors and inflammatory changes that could mimic tumor appearance.
Small lesion challenge: The most significant performance gap was for small lesions occupying less than 10% of the image, where the Youden index dropped to 0.62 compared to 0.90-0.92 for larger lesions. Only about 10% of the total image set contained these small tumors, limiting the data available for the model to learn their subtle features. These small lesions are also the ones most likely to be overlooked by human observers, making this a critical area for improvement. The authors acknowledge that AI performance is fundamentally dependent on the amount and quality of training data.
Image quality factors: Practical challenges in cystoscopy, including urine turbidity, out-of-focus images, and variable distances between the endoscope and the bladder wall, may all affect classification difficulty. While turbid and out-of-focus images were excluded from this study, a clinically deployed system would need to handle these conditions robustly. The study also used only still images, whereas real clinical cystoscopy involves continuous video examination.
Future directions: The authors identify several avenues for improvement. Learning from NBI, PDD, and high-quality endoscopic images such as 4K resolution could help detect lesions that are difficult to see under standard white light. New algorithms and additional images representing a greater variety of lesion types, sizes, and appearances are needed to close the accuracy gap for small and flat tumors. Extending the system from still image classification to real-time video analysis during live cystoscopy is the ultimate clinical goal, which would enable continuous AI-assisted surveillance throughout the procedure.
Single-institution scope: All data came from one hospital (University of Tsukuba), so external validation at other institutions is essential before clinical deployment. Differences in endoscope models, image capture protocols, patient demographics, and tumor prevalence across centers could all affect generalizability. Multi-center studies with larger and more diverse datasets would strengthen the evidence base for this approach.