Deep Learning Screening of Urothelial Carcinoma in Cytology

Plain-English Explanations

Overview

Pages 1-2

Why Urine Cytology Screening Needs Deep Learning

Clinical motivation: Urothelial carcinoma is the most common malignancy detected by urine cytology, with bladder cancer ranking as the tenth most commonly diagnosed cancer worldwide, accounting for 573,278 new cases and 212,536 deaths in 2020 according to GLOBOCAN statistics. Approximately 90% of bladder cancers are urothelial in origin. While cystoscopy with biopsy remains the gold standard for diagnosis, it is invasive and inconvenient for follow-up monitoring. Urine cytology provides a non-invasive alternative, and studies have shown that 48.6% of biopsy-proven low-grade urothelial carcinomas were already identified as atypical or neoplastic suspicious on urine cytology, confirming its diagnostic utility.

Liquid-based cytology advantages: Liquid-based cytology (LBC), developed in the 1990s as an alternative to conventional smear cytology, offers several improvements. LBC preserves cells in a liquid medium and removes debris, blood, and exudate through filtering or density gradient centrifugation, producing uniformly distributed, cell-enriched slides. The sensitivity of LBC was reported at 0.58 (CI: 0.51-0.65) compared to 0.38 for conventional smear, and LBC performed significantly better for high-grade urothelial carcinoma (HGUC) detection. In Japan alone, there were 2,041,547 urine cytodiagnosis reports in 2021, making it the second most common cytology after cervical screening.

Study goal: Conventional screening by cytoscreeners and cytopathologists using microscopes is limited by human resources. This study by Tsuneki, Abe, and Kanavati from Medmain Inc. and Tochigi Cancer Center investigated whether deep learning models based on convolutional neural networks (CNNs) could classify urine LBC whole-slide images (WSIs) into neoplastic versus non-neoplastic (negative) categories. They used a total of 786 WSIs for training and evaluated on 750 WSIs across two test sets, achieving ROC area under the curve (AUC) values in the range of 0.984 to 0.990 with the best model.

TL;DR: This study applied deep learning to classify urine liquid-based cytology whole-slide images as neoplastic or negative for urothelial carcinoma. Using 786 training WSIs and 750 test WSIs, the best model achieved AUC 0.984-0.990, offering a potential tool to accelerate screening of the over 2 million urine cytology cases processed annually in Japan alone.

Dataset and Classes

Pages 3-4

Dataset Composition and Cytopathological Classification System

Total dataset: The study collected 1,556 LBC SurePath specimens of human urine cytology from a private clinical laboratory in Japan. After excluding 21 inadequate specimens (due to insufficient cellularity or artifacts like dust and ink markings), the dataset was divided into four subsets: 786 training WSIs, 20 validation WSIs, 200 equal-balance test WSIs (100 negative, 100 neoplastic), and 550 clinical-balance test WSIs (500 negative, 50 neoplastic). The clinical balance test set used a 10:1 negative-to-neoplastic ratio, reflecting real-world prevalence as reported by the Japanese Society of Clinical Cytology from 2016 to 2021.

Classification scheme: WSIs were classified into two categories: negative and neoplastic. Negative WSIs corresponded to Class I (negative for HGUC) and Class II (negative for HGUC with reactive urothelial epithelial cells). Neoplastic WSIs included Class III (atypical urothelial epithelial cells, suspicious for low-grade urothelial carcinoma), Class IV (LGUC, suspicious for HGUC), and Class V (HGUC). In the training set, 724 WSIs were negative and 62 were neoplastic. Each WSI was reviewed by at least two cytoscreeners or pathologists, with final verification by a senior cytoscreener or pathologist.

Scanning and preparation: All WSIs were scanned at magnification x20 using the Leica Aperio AT2 Digital Whole Slide Scanner and saved in SVS format with JPEG2000 compression. The specimens were prepared using the SurePath (Becton Dickinson) method, one of two FDA-approved LBC preparation techniques alongside ThinPrep. The LBC approach produces single-layer WSIs that are particularly suitable for high-throughput automated image analysis, unlike conventional smear preparations where cells may overlap.

TL;DR: The dataset comprised 1,556 urine LBC specimens split into training (786), validation (20), equal-balance test (200), and clinical-balance test (550) sets. Cases were classified as negative (Class I/II) or neoplastic (Class III/IV/V), with clinical-balance testing using a realistic 10:1 negative-to-neoplastic ratio. All slides were digitized at x20 magnification and verified by multiple pathologists.

Annotation Process

Pages 4-5

Manual Annotation of Neoplastic Cells for Supervised Learning

Annotation scope: A subset of 62 neoplastic training cases and 10 neoplastic validation cases were manually annotated by experienced pathologists using a web-based tool built on the open-source OpenSeadragon viewer. On average, cytoscreeners and pathologists manually annotated 180 cells or cellular clusters per WSI, with an average annotation time of approximately 90 minutes per WSI. The negative subset of training and validation sets was not annotated, and entire cell-spreading areas within those WSIs were used directly.

Three annotation labels: The annotation scheme defined three neoplastic labels: atypical cell, low-grade urothelial carcinoma (LGUC) cell, and high-grade urothelial carcinoma (HGUC) cell. These labels were assigned based on representative neoplastic urothelial epithelial cell morphology, including features such as hyperchromatism, irregular chromatin distribution, abnormalities of nuclear shape, increased nuclear-to-cytoplasmic ratio, irregular nuclear distribution, nuclear enlargement, abnormal cytoplasm, prominent nucleolus, and cellular and nuclear polymorphism. A single WSI classified as Class V could contain all three annotation types simultaneously.

Annotation totals: Across all annotated WSIs, the study produced a total of 13,207 annotations: 9,950 atypical cell annotations, 1,646 LGUC cell annotations, and 1,611 HGUC cell annotations. Importantly, pathologists did not annotate areas where it was difficult to cytologically determine that cells were neoplastic. All annotations were verified and, if necessary, modified by a senior cytoscreener, ensuring quality control throughout the labeling process. For the purpose of model training, all three labels were grouped into a single "neoplastic" category.

TL;DR: Pathologists manually annotated 72 neoplastic WSIs (62 training, 10 validation) with 13,207 total annotations across three labels: atypical cells (9,950), LGUC cells (1,646), and HGUC cells (1,611). Each WSI took about 90 minutes to annotate, averaging 180 cells per slide. All annotations were verified by a senior cytoscreener.

Model Architecture

Pages 5-6

Deep Learning Architecture and Training Approaches

Four model variants: The study trained four primary models using a modified EfficientNetB1 (ENB1) architecture with an input tile size of 1024 x 1024 pixels at x10 magnification. The larger-than-standard tile size was chosen based on cytologists' input that they typically need to view neighboring cells around a given cell for more accurate diagnosis. The four models differed in weight initialization and training approach: ENB1-UC-FS+WS (uterine cervix pretrained, fully and weakly supervised), ENB1-UC-WS (uterine cervix pretrained, weakly supervised only), ENB1-IN-FS+WS (ImageNet pretrained, fully and weakly supervised), and ENB1-IN-WS (ImageNet pretrained, weakly supervised only).

Transfer learning strategy: Two weight initialization strategies were compared. The first used standard ImageNet pretrained weights. The second used weights from a uterine cervix neoplastic LBC screening model previously trained by the same group. Both leveraged partial fine-tuning, which consists of only fine-tuning the affine parameters of batch normalization layers and the final classification layer, rather than updating all network weights. This approach reduces overfitting risk, especially when training data is limited, and was validated in the group's prior studies.

Additional architectures: For comparison, the team also trained models using ResNet50V2, DenseNet121, and InceptionV3 architectures, all with ImageNet initialization and the fully plus weakly supervised (FS+WS) training method. These comparisons demonstrated how the EfficientNetB1 architecture performed relative to other well-established CNN designs for this specific cytology classification task.

Tile extraction and inference: Tissue regions were detected by thresholding a grayscale version of each WSI using Otsu's method, eliminating white background. During inference, a sliding window approach with a 512 x 512 pixel stride (half the tile size) was applied, generating a grid of predictions across all cell-containing areas. Each tile received a neoplastic probability, and the WSI-level prediction was computed as the maximum probability across all tiles, enabling visualization as a heatmap superimposed on the original WSI.

TL;DR: Four EfficientNetB1-based models were trained with 1024x1024 pixel tiles at x10 magnification, differing by weight initialization (ImageNet vs. uterine cervix LBC model) and supervision strategy (fully+weakly supervised vs. weakly supervised only). Partial fine-tuning was used to reduce overfitting. WSI-level predictions were derived from the maximum tile probability across a sliding window with 512-pixel stride.

Training Strategy

Pages 6-7

Fully Supervised, Weakly Supervised, and Hard Mining Training

Fully supervised learning: During fully supervised training, the model maintained an equal balance of positively and negatively labeled tiles in each training batch. Positive tiles were extracted randomly from annotated regions of neoplastic WSIs, ensuring at least one annotated cell was visible within the 1024 x 1024 pixel tile. Negative tiles were extracted randomly from tissue regions of negative WSIs. The two tile types were interleaved to construct equally balanced batches. To reduce false positives, the training incorporated hard mining at the end of each epoch, performing full sliding window inference on all negative WSIs and adjusting the sampling probability so that falsely predicted positive tiles were more likely to be sampled in subsequent training.

Weakly supervised learning: In the weakly supervised approach, the model did not use per-cell annotations. Instead, it used only the WSI-level labels (neoplastic or negative). To maintain balance, tiles were oversampled from WSIs to ensure the model trained on tiles from all WSIs in each epoch. The approach then switched to hard mining, alternating between training and inference. During inference, the CNN applied a sliding window across all tissue regions, selecting the k = 8 tiles with the highest positive probability per WSI. These tiles were placed in a training subset, and once that subset contained N = 256 tiles, training was initiated with a batch size of 32.

Training configuration: Real-time data augmentation was applied using variations in brightness, saturation, and contrast. The model was trained with the Adam optimizer (beta1 = 0.9, beta2 = 0.999, learning rate = 0.001) and binary cross-entropy loss. A learning rate decay of 0.95 every 2 epochs was applied, and early stopping was triggered when validation loss showed no improvement for 10 consecutive epochs. The model with the lowest validation loss was selected as the final model. All models were implemented using TensorFlow, with AUCs calculated via scikit-learn and 95% confidence intervals estimated using 1,000 bootstrap iterations.

TL;DR: Two training strategies were used: fully supervised (with per-cell annotations and balanced positive/negative tile sampling) and weakly supervised (using only WSI-level labels with hard mining of the top 8 most suspicious tiles per WSI). Both incorporated hard negative mining to reduce false positives. Adam optimizer with binary cross-entropy and early stopping after 10 epochs of no improvement controlled the training process.

Results

Pages 7-10

Classification Performance Across All Models and Test Sets

Baseline performance: Before training urine-specific models, the researchers applied an existing uterine cervix LBC neoplastic screening model to the urine test sets. This off-the-shelf model achieved only AUC 0.836 (CI: 0.775-0.885), confirming that domain-specific training was essential and that models trained on one cytology type do not generalize well to urine specimens without fine-tuning.

Best model (ENB1-UC-FS+WS): The best-performing model used uterine cervix pretrained weights with combined fully and weakly supervised learning. On the equal-balance test set, it achieved AUC 0.984 (CI: 0.969-0.995), accuracy 0.945, sensitivity 0.960, and specificity 0.929. On the clinical-balance test set (reflecting real-world prevalence), it achieved AUC 0.990 (CI: 0.982-0.996), accuracy 0.946, sensitivity 0.940, and specificity 0.946. These results substantially outperformed a previously reported urine LBC deep learning model that achieved only 0.842 accuracy, 0.795 sensitivity, and 0.845 specificity.

Comparison across EfficientNetB1 variants: All four EfficientNetB1 models showed comparable WSI-level performance. ENB1-UC-WS achieved AUC 0.990 on both test sets. ENB1-IN-FS+WS achieved AUC 0.982/0.986, and ENB1-IN-WS achieved AUC 0.980/0.995. However, heatmap visualization revealed meaningful differences at the tile level: the ENB1-UC-FS+WS model produced the highest neoplastic probabilities in individual tiles corresponding to actual neoplastic cells, making it the best model for localization-based screening despite similar aggregate AUC scores.

Alternative architectures: Models based on other CNN architectures performed lower than EfficientNetB1. ResNet50V2 achieved AUCs of 0.962/0.972 on equal/clinical balance test sets. DenseNet121 achieved 0.945/0.957, and InceptionV3 achieved 0.959/0.978. These results confirm that EfficientNetB1, particularly with transfer learning from a related cytology domain, was the superior architecture for this classification task.

TL;DR: The best model (ENB1-UC-FS+WS) achieved AUC 0.984-0.990, accuracy 0.945-0.946, sensitivity 0.940-0.960, and specificity 0.929-0.946 across both test sets. It outperformed an off-the-shelf uterine cervix model (AUC 0.836) and alternative architectures like ResNet50V2 (AUC 0.962-0.972) and DenseNet121 (AUC 0.945-0.957). Heatmap analysis confirmed superior tile-level localization.

Error Analysis

Pages 11-13

Analysis of True Positives, False Positives, and False Negatives

True positive predictions: The ENB1-UC-FS+WS model correctly identified neoplastic cells across all cytodiagnostic classes. For Class III WSIs, the heatmap highlighted atypical urothelial epithelial cells. For Class IV, it detected low-grade urothelial carcinoma (LGUC) cells. For Class V, it identified high-grade urothelial carcinoma (HGUC) cells. In each case, two independent cytoscreeners confirmed that low-probability tiles (blue heatmap regions) contained no neoplastic cells, validating the model's localization accuracy for true positives.

True negative predictions: The model correctly classified negative cases, including pyuria (Class I) consisting of infective fluid with a small number of non-atypical epithelial cells, and Class II cases containing urothelial epithelial cells with slight nuclear enlargement. The heatmap images showed zero-probability or near-zero-probability tiles across these WSIs, demonstrating that the model was not confused by common benign findings like reactive nuclear changes or inflammatory cells.

False positive analysis: A notable false positive case involved a Class I (negative) WSI that contained metaplastic squamous epithelial cells alongside non-neoplastic urothelial epithelial cells. These cells showed a slightly increased nuclear-to-cytoplasmic (N/C) ratio, which the model interpreted as neoplastic. The metaplastic squamous morphology, which can mimic certain features of neoplastic cells, was identified as a likely cause of this misclassification.

False negative analysis: A representative false negative was a Class III WSI containing clusters of atypical neoplastic urothelial epithelial cells with high nuclear-to-cytoplasmic ratio. The model failed to detect these cells, producing very low probability predictions in the corresponding tiles. The researchers hypothesized that the cellular clustering, where neoplastic cells overlapped and obscured nuclear shapes, caused the model to misidentify the clusters as urine crystals or cell debris. This overlapping morphology represents a key challenge for tile-based classification approaches.

TL;DR: The model correctly localized neoplastic cells across all grades (Class III-V) and successfully ignored benign findings in negatives. False positives were triggered by metaplastic squamous cells with increased N/C ratios. False negatives occurred when neoplastic cells formed overlapping clusters that mimicked the appearance of urine crystals or cell debris.

Limitations and Future Directions

Pages 14-16

Clinical Implications, Study Limitations, and Next Steps

Clinical workflow integration: The authors envision their model as a primary screening tool that ranks urine LBC cases by priority, highlighting suspected neoplastic regions via heatmap overlays. Cytoscreeners and cytopathologists would still perform full screening and subclassification (negative, atypical cells, suspicious for malignancy, and malignant) after the deep learning model's initial pass. This workflow would reduce working time by eliminating the need for exhaustive manual searching across entire WSIs, a significant benefit given the scale of urine cytology workloads.

Transfer learning from cervical cytology: A notable finding was that pre-training on uterine cervix LBC specimens improved performance compared to ImageNet-only initialization. The uterine cervix pretrained model achieved the best tile-level heatmap predictions, even though WSI-level AUC scores were comparable across initialization strategies. This cross-domain transfer learning approach, where knowledge from one cytology type improves another, is a promising strategy for future development of cytology screening models in settings where annotated data is scarce.

Single-source limitation: The primary limitation is that all training and test WSIs came from a single private clinical laboratory in Japan using a single LBC method (SurePath). This means the models may be biased toward specimens from that specific laboratory. The authors acknowledge that validation on specimens from multiple different origins, including both clinical laboratories and hospitals, and on alternative LBC methods such as ThinPrep, would be essential for demonstrating robustness and generalizability.

Future validation needs: Beyond multi-site validation, the authors recommend comparison of model performance against cytoscreeners and cytopathologists in a clinical setting. The concordance between cytological and histological diagnosis for LBC has been reported at 92%, with 20.5% of LGUC cases revealed by urinary cytology and validated by histology, and only an 8% rate of misjudgement. Demonstrating that the deep learning model can match or exceed these benchmarks in a prospective study would be a critical step toward clinical adoption. Additionally, addressing the false negative problem caused by overlapping cell clusters remains an important area for architectural improvement.

TL;DR: The model is intended as a priority-ranking screening tool, not a replacement for pathologists. Transfer learning from cervical cytology improved tile-level localization. Key limitations include single-laboratory, single-LBC-method data. Future work requires multi-site validation, head-to-head comparison with cytopathologists, and architectural improvements to handle overlapping cell clusters that caused false negatives.