Augmented Bladder Tumor Detection with Deep Learning

Plain-English Explanations

Overview

Pages 1-2

Why Bladder Cancer Detection Needs AI Augmentation

Bladder cancer is the ninth most common malignancy worldwide, with approximately 430,000 new diagnoses each year. Standard diagnosis and ongoing surveillance depend on white light cystoscopy (WLC), a procedure performed over 2 million times annually in the USA and Europe alone. When a suspicious lesion is spotted, the patient undergoes transurethral resection of bladder tumor (TURBT) in the operating room for tissue diagnosis and staging. Non-muscle-invasive bladder cancer accounts for roughly 75% of new diagnoses and is typically managed endoscopically, but high recurrence rates demand frequent surveillance and repeat interventions.

The detection gap: Despite being the standard of care, WLC misses up to 20% of bladder tumors. In patients with multifocal disease, incomplete initial resection occurs in up to 40% of cases. Blue light cystoscopy, which uses intravesical hexaminolevulinate to fluoresce tumors, can detect papillary tumors and flat lesions that WLC misses. However, blue light cystoscopy requires preoperative drug instillation and specialized fluorescence cystoscopes, and adoption has remained modest despite demonstrated benefit.

The AI opportunity: Recent advances in deep learning, particularly convolutional neural networks (CNNs), have demonstrated expert-level performance in image classification tasks across multiple medical fields. CNNs learn complex visual patterns through successive convolutional and pooling layers, making them well-suited to identify tumors in cystoscopy video frames. This study from Stanford University and the Chinese University of Hong Kong aimed to develop CystoNet, a CNN-based image analysis platform for augmented, real-time bladder cancer detection during cystoscopy and TURBT.

Study design: The researchers recruited patients undergoing cystoscopy or TURBT between 2016 and 2019, recording white light videos. Video frames containing histologically confirmed papillary urothelial carcinoma were selected and manually annotated using the LabelMe tool. The algorithm was developed on a dataset of 100 patients (95 for training, 5 for testing) and then prospectively validated in an additional 54 patients. This prospective validation design is a notable strength, as most AI-cystoscopy studies at the time relied on retrospective analysis only.

TL;DR: White light cystoscopy misses up to 20% of bladder tumors and 40% of multifocal disease goes incompletely resected. This Stanford-led study developed CystoNet, a CNN-based deep learning platform for real-time augmented tumor detection, trained on 100 patients and prospectively validated in 54 additional patients.

Dataset and Annotation

Pages 2-3

How the Training and Validation Datasets Were Built

Development dataset composition: The algorithm development dataset consisted of 141 videos from 100 patients who underwent TURBT. From these videos, frames containing pathologically confirmed papillary urothelial carcinoma were selected and tumor boundaries were manually outlined using LabelMe, a standard image annotation tool. The training set contained 2,335 frames of normal or benign bladder mucosa and 417 labeled frames containing histologically confirmed papillary urothelial carcinoma. Flat lesions were excluded from the development dataset because their margins could not be annotated accurately on cystoscopy frames.

Exclusion learning: In addition to tumor annotations, the researchers labeled anatomical landmarks and artifacts for exclusion learning. Specifically, the bladder neck, ureteral orifices, and air bubbles were all labeled so that the algorithm could learn to distinguish these common cystoscopic features from actual tumors. This is an important design choice, as these structures can mimic or obscure tumors during cystoscopy and are a frequent source of false positives in automated image analysis systems.

Tumor diversity in training data: The development dataset included a broad range of tumor characteristics. Among the 95 training patients, there were 142 tumors: 42 low-grade Ta, 54 high-grade Ta, 15 high-grade T1, and 9 high-grade T2 lesions. The 5-patient test set contained 10 tumors (1 low-grade Ta, 7 high-grade Ta, 2 high-grade T1). This mixture of tumor grades and stages ensured the model was exposed to the visual heterogeneity of bladder cancer during training.

Validation dataset: For prospective validation, videos from an additional 54 patients were collected. Of these, 31 patients had normal or benign bladder mucosa (the normal cohort), while 23 patients comprised the tumor cohort with 26 videos and 44 tumors identified. The tumor cohort included 13 low-grade Ta, 15 high-grade Ta, 9 high-grade T1, 3 high-grade T2, 3 carcinoma in situ (CIS), and 1 inverted papilloma. Importantly, all patients undergoing cystoscopy or TURBT for bladder cancer evaluation were eligible, including patients with nonpapillary tumors, making the validation cohort representative of real clinical practice.

TL;DR: CystoNet was trained on 2,335 normal frames and 417 tumor frames from 95 patients, with tumors spanning low-grade Ta through high-grade T2. Prospective validation included 54 patients (23 with tumors, 31 normal), encompassing 44 tumors including 3 CIS lesions. Exclusion learning on bladder neck, ureteral orifices, and air bubbles reduced false positives.

Architecture

Page 3

CystoNet: The CNN Architecture and Threshold Selection

Platform design: CystoNet is described as an image analysis platform based on convolutional neural networks (CNNs). Unlike systems that classify entire images as "tumor" or "normal," CystoNet performs tumor segmentation and localization. The system generates both segmentation overlays (blue shading indicating pixel-level tumor boundaries) and alert boxes (red bounding boxes flagging regions of interest). This dual-output approach enables two complementary modes of clinical feedback: precise tumor boundary delineation for surgical planning and rapid visual alerts for diagnostic screening.

Probability threshold optimization: The model outputs a probability score for each pixel or region indicating the likelihood of cancer presence. The researchers evaluated performance across a range of probability thresholds and selected 0.98 as the operating threshold for cancer presence. This is a notably high threshold, reflecting a deliberate design choice to minimize false positives. In clinical cystoscopy, false alerts could lead to unnecessary biopsies and erode clinician trust in the system, so a high threshold that maintains specificity while preserving acceptable sensitivity is pragmatically important.

Development test set performance: In the 5-patient development test set, the per-frame sensitivity for tumor detection was 88.2% (95% CI, 83.0-92.2%), and 9 of 10 tumors were accurately identified. The per-frame specificity was 99.0% (95% CI, 98.2-99.5%). These initial results demonstrated that CystoNet could detect the vast majority of tumors while generating very few false alarms, establishing a strong baseline before prospective validation.

Video-based analysis advantage: A key architectural distinction of CystoNet is that it was designed from the outset for video frame analysis rather than static image classification. Prior AI-assisted WLC systems had focused on analysis of individual bladder images, limiting their real-time clinical applicability. Because CystoNet processes video frames, it can be integrated directly into the live cystoscopy workflow, providing dynamic overlays of regions of interest as the cystoscope navigates the bladder. This represents a meaningful step toward practical clinical deployment compared to static image classifiers.

TL;DR: CystoNet uses CNNs to perform both tumor segmentation (blue overlays) and alert detection (red bounding boxes) on cystoscopy video frames. A high probability threshold of 0.98 was selected to minimize false positives. In the development test set, the system achieved 88.2% per-frame sensitivity and 99.0% specificity, correctly identifying 9 of 10 tumors.

Validation Results

Pages 3-4

Prospective Validation: Per-Frame and Per-Tumor Performance

Per-frame metrics: In the prospective validation dataset of 54 patients, CystoNet achieved a per-frame sensitivity of 90.9% (95% CI, 90.3-91.6%) and per-frame specificity of 98.6% (95% CI, 98.5-98.8%). These results represent an improvement over the development test set (88.2% sensitivity, 99.0% specificity), suggesting the algorithm generalized well to new patients. The validation dataset contained 7,542 tumor frames and 31,330 normal frames in the tumor cohort, plus 20,643 normal frames in the normal cohort, providing a substantial volume of data for performance evaluation.

Per-tumor sensitivity: The clinically more relevant metric, per-tumor sensitivity, was 95.5% (95% CI, 84.5-99.4%). This means CystoNet successfully detected 39 of 41 papillary tumors and all 3 carcinoma in situ (CIS) lesions in the prospective cohort. The per-tumor sensitivity is higher than the per-frame sensitivity because a tumor only needs to be detected in at least one video frame to count as a true positive. Since cystoscopy captures multiple frames of each lesion from different angles and distances, even if the algorithm misses a tumor in some frames, it typically detects it in others.

False alert analysis: The rate of false alerts was remarkably low. There was no significant difference in false alerts generated per cystoscopy between normal and tumor-containing cystoscopies (0.1%, 95% CI, -0.9% to 1.4%; p = 0.856). However, significantly more alerts were generated during cystoscopy with a tumor compared with normal examinations (12.5%, 95% CI, 10.3-14.6%; p < 0.001). This pattern indicates that CystoNet's alert behavior was driven primarily by true tumor presence rather than noise or artifact, which is essential for clinical trust.

True positive and true negative counts: In the validation tumor cohort, CystoNet generated 6,857 true positive frames and 685 false negative frames. In terms of normal frames, there were 23,382 true negatives in the tumor cohort and 20,359 true negatives in the normal cohort. The corresponding false positive counts were 406 in the tumor cohort and 284 in the normal cohort. The low false positive rate across both cohorts underscores the algorithm's high specificity and practical reliability.

TL;DR: Prospective validation in 54 patients showed 90.9% per-frame sensitivity, 98.6% specificity, and 95.5% per-tumor sensitivity. CystoNet detected 39 of 41 papillary tumors and all 3 CIS lesions. False alert rates were low at 0.1% in normal cystoscopies, with no significant difference between normal and tumor-containing examinations.

Flat Lesion Detection

Page 4

CIS Detection and Tumor Diversity in Clinical Practice

Unexpected CIS detection: One of the most striking findings was CystoNet's ability to detect carcinoma in situ (CIS), despite being trained exclusively on papillary urothelial carcinoma. In the prospective validation cohort, all 3 CIS cases were accurately identified by the algorithm. CIS is a flat, high-grade lesion that is notoriously difficult to detect with white light cystoscopy because it does not protrude from the bladder surface like papillary tumors. The fact that CystoNet flagged these lesions suggests that the CNN learned to recognize subtle visual features common to both papillary and flat bladder cancers.

Clinical diversity of the validation set: The cystoscopy videos analyzed in both development and validation phases were representative of clinical practice. The validation cohort included low-grade and high-grade cancers, tumors ranging from a few millimeters to over 5 cm, both solitary and multifocal disease, and varying degrees of cystoscopic visibility. This diversity is important because real-world cystoscopy encounters a wide range of tumor presentations, and an AI system that only works on textbook-quality images would have limited clinical utility.

Representative examples of CystoNet output: The paper presents multiple examples of CystoNet's performance across different clinical scenarios. These include detection of small papillary tumors at the bladder dome, posterior wall, and anterior wall, as well as larger multifocal tumors. The system also detected a small tumor at the dome as seen from the bladder neck and identified a large papillary tumor and a multifocal papillary tumor with limited background contrast. One illustrative case showed CystoNet detecting a flat CIS lesion under white light that was also confirmed by photodynamic diagnosis under blue light cystoscopy.

False positive behavior: The paper describes an informative false positive case involving a small bladder diverticulum. CystoNet initially generated an alert for this structure, but as the cystoscope moved closer to inspect the area, the alerting box disappeared. This self-correcting behavior suggests that the algorithm's confidence recalibrated with improved visualization, which could serve as a clinical indicator of benign versus malignant findings. The authors note that further work is needed to determine algorithm performance across a larger variety of flat lesions.

TL;DR: Despite being trained only on papillary tumors, CystoNet detected all 3 CIS (flat lesion) cases in validation, suggesting the CNN learned shared visual features of bladder cancer. The validation set spanned tumors from a few millimeters to over 5 cm, including solitary and multifocal disease. A self-correcting false positive on a bladder diverticulum disappeared upon closer inspection.

Comparison to Prior Work

Pages 4-5

How CystoNet Compares to Previous AI Cystoscopy Systems

Prior static image approaches: Before CystoNet, AI-assisted white light cystoscopy had focused on analysis of static bladder images rather than video. Gosnell et al. developed a color segmentation system that achieved good sensitivity for tumor identification but suffered from a false positive rate of 50%, meaning half of all alerts were incorrect. This level of false alerting would be clinically unusable, as it would erode physician trust and lead to unnecessary biopsies. CystoNet's false positive rate of approximately 1.4% represents a dramatic improvement.

Curated atlas limitation: Eminaga et al. achieved high sensitivity and specificity for cystoscopy image classification using CNNs, but their model was trained and validated on a highly curated, previously published image atlas. This approach limits clinical translation because curated images may not reflect the variability of real-world cystoscopy, where lighting conditions, image quality, camera angles, and tissue distortion differ substantially from idealized atlas photographs. CystoNet's training on actual clinical cystoscopy videos avoids this generalization gap.

Video-based real-time integration: The critical advantage of CystoNet over prior systems is its foundation on cystoscopy video analysis. Because the algorithm was developed using video frames from real clinical procedures, integration of CystoNet in real time during cystoscopy and TURBT is feasible. Dynamic overlays of regions of interest hold promise for improving both diagnostic yield (finding more tumors) and thoroughness of bladder tumor resection (ensuring complete removal). This real-time capability distinguishes CystoNet from retrospective, static-image classifiers.

Blue light cystoscopy comparison: While the paper does not directly compare CystoNet to blue light cystoscopy in a head-to-head study, the context is significant. Blue light cystoscopy improves tumor detection and reduces recurrence but requires preoperative intravesical instillation of hexaminolevulinate and specialized fluorescence cystoscopes. CystoNet, by contrast, works with standard white light cystoscopy equipment and requires no preoperative preparation, drug administration, or specialized hardware. This makes it a potentially lower-cost, more easily adoptable adjunct imaging technology.

TL;DR: Prior AI cystoscopy systems had a 50% false positive rate (color segmentation) or were trained on curated atlases limiting real-world applicability. CystoNet reduced false positives to approximately 1.4%, was trained on actual clinical videos for real-time integration, and requires no specialized equipment or preoperative drug instillation, unlike blue light cystoscopy.

Limitations and Future Directions

Pages 5-7

Study Limitations and Clinical Outlook for CystoNet

Definition of "normal": A key limitation acknowledged by the authors is the definition of normal bladder mucosa. While bladder tumors were defined histopathologically through biopsy confirmation, "normal" was based on cystoscopic interpretation without tissue diagnosis. This means some frames classified as normal could theoretically contain subclinical or microscopic disease not visible on cystoscopy. However, this reflects standard clinical practice, where normal-appearing mucosa is not routinely biopsied, and it would be impractical to require histological confirmation of every normal frame.

Training set size: The number of patients in the training set was relatively small at 95 patients. However, the analysis of cystoscopy videos from these patients provided 2,752 frames for algorithm development, which proved sufficient to achieve excellent performance in distinguishing cancer from benign lesions. The authors attribute this to the relative homogeneity of gross tumor structure in papillary urothelial carcinoma. For more complex tasks such as subclassification of benign and malignant lesions or grading, a larger training set would be needed.

Field of view constraint: For CystoNet to detect bladder cancer, the tumor must be within the visual field of the cystoscope. This means the system cannot compensate for poor cystoscopic technique, incomplete bladder surveys, or tumors located in anatomically difficult positions such as the anterior wall or dome. While CystoNet can improve detection of tumors that the cystoscope visualizes, it cannot find tumors that are never brought into view. This underscores that AI augmentation complements, rather than replaces, thorough cystoscopic examination technique.

Clinical potential: Cystoscopic tumor detection is affected by clinician experience, clarity of visual field, and tumor characteristics including size, morphology, and location. CystoNet-augmented cystoscopy has the potential to aid in training and diagnostic decision making, standardize performance across providers in a noninvasive fashion, and do so without requiring costly specialized equipment. As demand rises from an aging population, deep learning algorithms like CystoNet may serve to improve the quality and availability of cystoscopy globally by enabling providers with limited experience to perform high-quality examinations.

Quality control and scalability: The authors envision CystoNet facilitating a streamlined quality control process for cystoscopy. Just as colonoscopy quality metrics (such as adenoma detection rate) have improved screening outcomes, AI-augmented cystoscopy could establish standardized benchmarks for tumor detection completeness. The system could also serve as a training tool for urology residents, providing objective real-time feedback on lesion identification. Despite the study's limitations, CystoNet represents a critical step toward computer-augmented cystoscopy and TURBT, with the next steps being larger multicenter validation trials and integration into real-time surgical workflows.

TL;DR: Limitations include reliance on cystoscopic (not histological) definition of "normal," a training set of 95 patients, and the requirement that tumors be in the cystoscope's visual field. Despite these constraints, CystoNet achieved strong performance and has potential to standardize cystoscopy quality, aid trainee education, and improve global access to high-quality bladder cancer detection without specialized equipment.