Real-Time Deep Learning Bladder Tumor Detection

Plain-English Explanations

Overview

Pages 1-2

Why Cystoscopy Needs AI and What This Study Set Out to Do

The clinical problem: Cystoscopy and transurethral resection of bladder tumor (TURBT) are the primary methods for identifying and treating bladder lesions. Yet modern cystoscopy still relies on the same white-light technique developed over eight decades ago. This reliance introduces significant subjectivity: detection rates vary between examiners of different experience levels, and even the same examiner can reach inconsistent conclusions across repeat procedures. Prior literature estimates that approximately 10 to 20% of bladder cancers are missed during routine cystoscopy, with reported sensitivity ranging from 68 to 100% and specificity from 57 to 97%.

Enhanced imaging techniques: Blue-light cystoscopy and narrow-band imaging have improved tumor detection rates relative to white light alone, but they still depend on subjective visual assessment by the endoscopist. Additionally, these enhanced modalities involve more complex procedural workflows, which limits their widespread adoption. The fundamental gap remains: there is no objective, automated system integrated into standard cystoscopy that can flag suspicious regions in real time.

Study objective: This diagnostic study, conducted at Peking Union Medical College Hospital between July 2022 and July 2023, enrolled 94 patients undergoing cystoscopy or TURBT. The investigators collected 102 white-light cystoscopy videos and extracted frames containing suspected bladder lesions for manual annotation. The primary goal was to evaluate the HRNetV2 deep learning model for real-time intelligent bladder lesion detection, with a particular focus on how image resolution affects diagnostic performance.

Design rationale: The study used semantic segmentation rather than simple object detection because segmentation precisely delineates the spatial boundaries of lesions, providing richer information for downstream classification and subgroup analysis. By targeting all suspicious lesions rather than confirmed tumors alone, the study aimed to build a foundation for comprehensive bladder cavity assessment that could assist clinicians in catching lesions the naked eye might miss.

TL;DR: White-light cystoscopy misses 10-20% of bladder cancers due to subjective visual interpretation. This study tested the HRNetV2 deep learning model on 102 cystoscopy videos from 94 patients to determine whether real-time semantic segmentation could improve bladder lesion detection, especially at varying image resolutions.

Architecture

Pages 2-3

The HRNetV2 Architecture and FCN Decoder Pipeline

High-Resolution Network V2: The study employed HRNetV2, an enhanced version of the original High-Resolution Network (HRNet) first introduced in 2019. Unlike most deep learning architectures that progressively downsample feature maps (reducing spatial resolution at each stage), HRNetV2 maintains high-resolution representations throughout the entire forward pass. This architectural choice is critical for medical image segmentation, where preserving fine-grained positional detail is necessary to accurately delineate lesion boundaries.

Multi-scale feature extraction: HRNetV2 produces feature maps at four different scales (denoted s4, s8, s16, and s32 in the paper's architecture diagram). These multi-resolution feature maps capture both fine local details and broader contextual information. The feature maps are then unified to a consistent resolution using deconvolution operations. This approach resolves the common downsampling problem in segmentation networks, where spatial information is lost as the network deepens, and closes semantic gaps that arise when integrating information across different abstraction levels.

FCN decoder: The decoder portion of the network uses a Fully Convolutional Network (FCN) architecture, forming what the authors call the FCN_HRNetV2 segmentation network. The FCN restores the multi-scale feature maps back to the original image dimensions, producing pixel-level classification output. Each pixel is assigned a class label (lesion or background), enabling precise spatial delineation of bladder tumors within each cystoscopy frame.

Why HRNetV2 over alternatives: The authors note that HRNetV2 significantly enhances contextual feature extraction and improves semantic segmentation accuracy compared to conventional encoder-decoder designs that use a bottleneck structure. While HRNetV2 had shown strong performance in non-medical tasks such as mask detection for social distancing, this study represents one of the first applications of HRNetV2 to medical image semantic segmentation, making the architecture choice both novel and scientifically motivated.

TL;DR: The study used HRNetV2 as the encoder paired with an FCN decoder. HRNetV2 maintains high-resolution feature maps across four scales (s4 through s32) throughout the network, avoiding the information loss typical of bottleneck designs. This is among the first applications of HRNetV2 to medical image segmentation.

Data and Training

Page 3

Data Collection, Preprocessing, and Model Training Pipeline

Dataset composition: The study recruited 94 patients and collected 102 white-light cystoscopy videos. From these videos, frames containing suspected bladder lesions were extracted at fixed intervals and annotated frame by frame. Two urologists independently verified all annotations. In total, 33,657 frames were manually annotated, outlining 37,947 individual targets. The data was split into training and test sets at a 4:1 ratio: 75 patients (82 videos, 26,654 frames, 30,103 targets) for training and 19 patients (20 videos, 7,003 frames, 7,844 targets) for testing.

Input preprocessing: Original video frames came in three resolutions: 544 x 672, 480 x 720, and 1080 x 1920. Before feeding into the network, all images were standardized across RGB channels using mean values of (123.675, 116.28, 103.53) and variance values of (57.12, 57.12, 57.12). During training, images were cropped to 512 x 512 dimensions. Data augmentation included flipping, contrast enhancement, and brightness adjustment to improve model robustness against the natural variation in lesion size, angle, and lighting encountered during cystoscopy.

Training configuration: The model was trained on a single NVIDIA Tesla V100 Tensor Core GPU with a batch size of 16. The backbone used an ImageNet-pretrained model optimized with the stochastic gradient descent (SGD) optimizer at an initial learning rate of 0.01. A polynomial learning rate decay strategy was employed with a minimum learning rate of 1 x 10^-4. A feature pyramid structure integrated feature maps of different scales to enhance performance across diverse scenarios. The final model was produced after 80,000 iterations.

Post-processing: After the network produced its segmentation output, a morphological post-processing step was applied using a 29 x 29 kernel. An opening operation removed edge spikes in predicted regions, followed by a closing operation to eliminate small fragmented areas. This cleanup step helps produce cleaner lesion boundaries for clinical interpretation.

TL;DR: The dataset comprised 33,657 annotated frames (37,947 targets) from 102 cystoscopy videos across 94 patients, split 4:1 for training and testing. Training used an ImageNet-pretrained HRNetV2 backbone with SGD optimization on a V100 GPU for 80,000 iterations. Post-processing with morphological operations cleaned up the segmentation output.

Results

Pages 3-4

Overall Diagnostic Performance and Resolution-Based Analysis

Overall test performance: On the held-out test set of 7,003 frames (19 patients), the HRNetV2 model achieved an overall sensitivity of 91.6% and precision of 91.3%, with a mean Dice (mDice) score of 80.3%. The model evaluation used an intersection over union (IOU) threshold of 0.1, deliberately set low to prioritize bladder lesion detection over precise localization. This threshold choice reflects the clinical reality that identifying that a lesion exists matters more than perfectly outlining its edges during a live cystoscopic procedure.

High-resolution vs. low-resolution subgroups: Test frames were categorized into high-resolution and low-resolution groups based on lesion clarity and contrast with surrounding tissues. High-resolution frames typically showed clear morphology with intact structures resembling aquatic grass or cauliflower shapes. Low-resolution frames featured incomplete appearance due to excision, unclear features from camera distance, or turbid fluid interference. In the high-resolution group (5,897 frames, 6,522 targets), the model achieved 94.8% sensitivity, 94.4% precision, and an mDice of 84.7%. In the low-resolution group (1,106 frames, 1,322 targets), performance dropped to 75.6% sensitivity, 74.8% precision, and an mDice of 56.6%.

Video increment experiment: The authors conducted a data scaling experiment to measure how additional training data affected model performance. With only 15 training videos (5,421 frames), the model achieved 76.7% sensitivity but just 39.2% precision, indicating a high false-positive rate. Increasing to 68 videos (22,270 frames) improved performance dramatically to 91.1% sensitivity and 90.1% precision. The final model trained on all 82 videos (29,528 frames) reached the best results of 91.6% sensitivity and 91.3% precision. This experiment confirmed that model stability improves with increasing data volume, a finding consistent with the broader deep learning literature.

Clinical significance: The overall sensitivity of 91.6% means the model successfully detected more than 9 out of every 10 bladder lesions in the test set. Given that conventional white-light cystoscopy misses 10-20% of cancers, this level of performance suggests HRNetV2 could serve as a meaningful real-time second opinion for endoscopists, particularly when image quality is high.

TL;DR: HRNetV2 achieved 91.6% sensitivity and 91.3% precision overall, with an mDice of 80.3%. High-resolution frames reached 94.8% sensitivity and 84.7% mDice, while low-resolution frames dropped to 75.6% sensitivity and 56.6% mDice. A data scaling experiment showed precision jumped from 39.2% to 91.3% as training videos increased from 15 to 82.

Ground Truth Analysis

Pages 4-5

Impact of Lesion Size on Detection Performance

Ground truth area proportion: The authors investigated how the proportion of the image occupied by the ground truth (GT) annotation affected diagnostic performance. This is a critical analysis because bladder lesions vary enormously in apparent size depending on camera distance, lesion stage, and whether partial resection has already occurred. The GT area proportion was divided into ten bins ranging from 0 to 1.0 (where 1.0 means the lesion fills the entire frame).

Small lesion challenge: When the GT area proportion was very small (0 to 0.02, representing lesions occupying less than 2% of the frame), the model performed poorly with a recall of only 56.4% and precision of 60.6%. This means the model missed nearly half of the smallest lesions. This is clinically important because small, early-stage lesions are precisely the targets where early detection matters most. The sensitivity improved dramatically as lesion size increased: 93.3% at the 0.02-0.05 range, 96.2% at 0.05-0.1, and peaking at 99.0-100% for medium-to-large lesions (0.1-0.5 GT area proportion).

Large lesion anomaly: Interestingly, performance declined again when the GT area approached half of the image (0.5-0.7 range), with sensitivity dropping to 81.6% and 55.1% for the 0.5-0.6 and 0.6-0.7 bins respectively. However, the authors caution that these bins contained very few targets (114 and 49 respectively), making the estimates unreliable. When the lesion nearly filled the frame (0.7-1.0), all targets were correctly detected, though only 3 targets fell in this category.

Practical implications: The GT area analysis reveals a clear pattern: the model excels when lesions occupy a moderate portion of the frame (roughly 5-50%) but struggles with very small targets. In clinical practice, this suggests that endoscopists should ensure adequate camera distance to capture lesions at a size where the AI system performs optimally. It also highlights the need for future training data that is enriched with small-lesion examples to improve performance in this critical detection range.

TL;DR: Lesion size strongly affected detection: very small lesions (under 2% of frame area) had only 56.4% recall, while medium-sized lesions (10-50% of frame) achieved 99-100% sensitivity. Performance dropped for very large lesions occupying over half the frame, though sample sizes in those bins were too small to draw firm conclusions.

Error Analysis

Pages 5-6

False Positives, False Negatives, and Image Quality Factors

Sources of false negatives: The study systematically analyzed the causes of missed detections (false negatives), which the authors correctly prioritize over false positives since missed lesions carry more severe clinical consequences. Three main factors drove false negatives: (1) targets positioned too close to the camera lens, creating out-of-focus images; (2) relatively small target sizes that fell below the model's effective detection threshold; and (3) atypical target features, particularly flat lesions that lack the distinctive cauliflower or aquatic-grass morphology the model was primarily trained on.

Sources of false positives: False-positive detections were primarily caused by two factors: (1) abnormal mucosal texture features that visually mimicked lesion patterns, and (2) misidentification of small non-lesion targets as tumors. These error modes are not unique to AI systems; human endoscopists also struggle to distinguish atypical mucosal textures from early neoplastic changes. However, the systematic nature of these errors in an AI model means they can be addressed through targeted training data augmentation.

Impact of the liquid environment: Unlike gastrointestinal endoscopy, which uses air to conduct light, cystoscopy operates in a liquid (water) medium. Cloudy urine, blood mist, and light scattering are common in clinical practice and degrade imaging quality. The study demonstrated that these visibility factors directly impact model performance: the substantial gap between high-resolution results (94.8% sensitivity) and low-resolution results (75.6% sensitivity) is largely attributable to fluid turbidity and related optical interference. The authors emphasize that maintaining good transparency of bladder cavity fluid is essential for effective AI-assisted detection.

Strategies for improvement: The authors suggest several approaches to mitigate these error sources. Moving the camera lens to adjust focus can partially address proximity-related false negatives. Multi-angle observation training data could improve identification of flat or morphologically ambiguous lesions. Real-time panoramic image stitching could help with very small lesions by providing a broader field of view. Image enhancement techniques such as noise filtering and image dehazing could further improve performance in low-visibility conditions.

TL;DR: False negatives were caused by targets too close to the lens, small size, and atypical morphology (especially flat lesions). False positives arose from abnormal mucosal textures and misidentified small targets. The liquid cystoscopy environment, with cloudy fluid and light scattering, was the dominant factor behind the 19-percentage-point sensitivity gap between high- and low-resolution frames.

Limitations and Future Directions

Pages 6-7

Study Limitations, Context Among Competing Models, and Next Steps

Comparison with prior work: The authors situate their findings within the existing literature. Ikeda et al. used GoogLeNet on 1,671 normal and 431 tumor images and achieved 89.7% sensitivity and 94.0% specificity. Zhang et al. applied attention mechanisms to U-Net for bladder tumor segmentation and achieved an mDice of 82.7%, slightly above this study's overall 80.3%. Wu et al. developed the CAIDS framework using a pyramid scene parsing network trained on over 69,000 images from more than 10,000 patients, achieving accuracies above 97% across validation sets. A comparative study of eight deep neural networks for cystoscopy segmentation found the Pyramid Attention Network (PAN) model to be superior. These comparisons reveal that while HRNetV2's performance is competitive, especially at high resolution (94.8% sensitivity), other architectures trained on larger datasets have achieved higher benchmarks.

Single-center limitation: The most significant limitation is that this is a single-center study with a relatively limited dataset of 94 patients. While the 33,657 annotated frames provide reasonable training volume, all data came from Peking Union Medical College Hospital, which limits generalizability. The model has not been tested on cystoscopy equipment from other institutions, different patient populations, or varying procedural techniques. The authors acknowledge this and plan to collect more data and conduct multicenter validation studies.

Pathological type imbalance: The dataset lacked sufficient samples of certain pathological subtypes, particularly carcinoma in situ (CIS), which prevented the model from performing tumor classification beyond simple lesion detection. CIS is notoriously difficult to detect visually because it grows as a flat lesion without the papillary morphology that makes other bladder tumors more conspicuous. The absence of CIS from the training data is a meaningful gap, as CIS detection is one of the areas where AI could potentially add the most clinical value.

Model comparison gap: Although HRNetV2 showed satisfactory results, the authors acknowledge that its performance compared with other state-of-the-art architectures may not be optimal. No head-to-head comparison with alternative segmentation models (such as U-Net variants, PSPNet, or PAN) was performed on the same dataset. Future research should focus on both algorithm optimization and architecture benchmarking to identify the best-performing model for bladder lesion segmentation.

Future directions: The study lays the groundwork for several next steps: expanding the dataset with multicenter data, enriching the training set with CIS and other underrepresented pathological types, optimizing the HRNetV2 architecture specifically for the cystoscopic domain, and eventually integrating pathological classification into the real-time detection pipeline. The ultimate vision is a system that not only identifies lesions during cystoscopy but also provides preliminary pathological assessment to guide immediate clinical decisions.

TL;DR: Key limitations include single-center data (94 patients), lack of CIS samples preventing tumor classification, and no head-to-head comparison with competing architectures like PAN or PSPNet. Larger multicenter studies (e.g., CAIDS used 69,000+ images from 10,000+ patients) have achieved higher accuracy. Future work will focus on multicenter validation, dataset enrichment, and integrating pathological classification into the real-time detection pipeline.