Deep Learning HER2 Prediction in Bladder Cancer

Plain-English Explanations

Overview

Pages 1-3

Why HER2 Status Matters in Bladder Cancer and What This Study Aims to Solve

Clinical context: Bladder cancer is the seventh most prevalent cancer worldwide and ranks 13th in cancer-related mortality, with an estimated 82,290 new cases in the United States in 2023. After first-line platinum-based chemotherapy fails in locally advanced or metastatic urothelial carcinoma, targeted therapy against human epidermal growth factor receptor 2 (HER2) has emerged as a critical second-line treatment option. HER2 is a transmembrane glycoprotein with tyrosine kinase activity that controls growth and differentiation of epithelial cells, and its overexpression (IHC staining 2+ and 3+) is found in 18.1% to 36% of bladder urothelial carcinoma cases.

Therapeutic significance: A phase II clinical trial of the antibody-drug conjugate RC048-ADC in HER2-positive locally advanced and metastatic urothelial carcinoma demonstrated an overall response rate of 51.2%, with extensions in both median progression-free survival and median overall survival. Anti-HER2 antibody-drug conjugates have now entered international treatment guidelines from the EAU and NCCN. The efficacy of these therapies depends entirely on accurate identification of HER2 expression status, making rapid and reliable HER2 testing a prerequisite for personalized treatment.

Diagnostic limitations: The current gold-standard method for HER2 assessment is immunohistochemistry (IHC), which is resource-intensive in both time and cost, and its interpretation depends heavily on pathologist expertise. High-throughput sequencing can predict HER2 status but is mostly qualitative and expensive. Studies have shown that HER2 expression in bladder cancer is often low and exhibits high heterogeneity, which correlates with disease-free survival and overall survival. Not all medical facilities have the equipment or expertise for reliable IHC testing.

Study objective: This study from Renmin Hospital of Wuhan University (RHWU) pioneers the application of deep learning to predict HER2 expression status directly from routine hematoxylin and eosin (H&E)-stained pathological images of bladder cancer, bypassing the need for IHC staining or high-throughput sequencing. The researchers developed a weakly-supervised model using the clustering-constrained-attention multiple-instance learning (CLAM) framework on a cohort of 106 patients, aiming to deliver an economical, swift, and precise diagnostic solution that integrates into existing digital pathology workflows.

TL;DR: HER2-targeted therapy is an important second-line treatment for bladder cancer (51.2% response rate), but current HER2 testing via IHC is costly, time-consuming, and pathologist-dependent. This study uses the CLAM deep learning framework to predict HER2 status directly from routine H&E-stained slides of 106 bladder cancer patients, potentially eliminating the need for specialized IHC staining.

Patient Cohort

Pages 3-4

Patient Selection, Cohort Composition, and HER2 Classification Criteria

Cohort assembly: The researchers collected H&E-stained pathology slides from 115 patients who underwent bladder tumor surgery at RHWU between 2020 and 2023. All slides were prepared by skilled technicians and evaluated individually by molecular pathologists. The inclusion criteria required a definitive pathological diagnosis of bladder cancer, known HER2 status, no history of targeted or immunotherapy, and availability of clinical data on age, gender, T stage, lymphovascular invasion, and histologic grade.

Exclusions and final cohort: After quality assessment, 9 patients were excluded from the original 115: 1 for poor image quality, 5 for suboptimal bladder tumor tissue sampling, and 3 for incomplete pathological information. The remaining 106 patients and their corresponding 106 tissue pathology slides formed the final RHWU cohort. Among these, 70 patients (66%) were HER2-positive and 36 patients (34%) were HER2-negative. The cohort was divided in a 3:1:1 ratio into a training set (64 cases, 60.4%), a test set (21 cases, 19.8%), and a validation set (21 cases, 19.8%).

HER2 classification: IHC staining was used to determine HER2 status, scored according to current ASCO/College of American Pathologists guidelines. An IHC staining score of 3+ or 2+ was classified as HER2-positive, while scores of 0 or 1+ were classified as HER2-negative. The patient demographics showed a median age of 68 years (range 32 to 90), with 85.85% male and 14.15% female patients. The pT stage distribution spanned from pTis (1.9%) through pT4 (3.77%), with pTa (45.28%) and pT1 (21.70%) being the most common.

Clinical characteristics: The cohort included patients across nearly all pathological stages: 53.77% at Stage 0a, 21.70% at Stage I, 18.87% at Stage II, 2.83% at Stage III, and 0.94% at Stage IV. High-grade tumors accounted for 57.55% and low-grade for 42.45%. Lymphovascular invasion was present in 18.87% of cases. This diversity across TNM stages and histologic grades supports the generalizability of findings, although the distribution was not perfectly balanced across subgroups.

TL;DR: From 115 initial patients, 106 met quality criteria and were included. The cohort was 66% HER2-positive and 34% HER2-negative, split 3:1:1 into training (64), test (21), and validation (21) sets. Patients spanned nearly all pathological stages with median age 68 years, supporting broad applicability of the model.

Image Processing

Pages 4-5

WSI Preprocessing, Patch Extraction, and Feature Embedding with ResNet-50

Digital scanning: Each of the 106 qualified tissue pathology slides was scanned into a whole slide image (WSI) in .svs format using a KFBIO KF-PRO-020 digital scanner. Every WSI was carefully reviewed by pathologists after scanning and stored in an external storage system. The images were imported at 20x magnification, which provides sufficient resolution for identifying tissue-level features while keeping computational requirements manageable.

Tissue segmentation and patching: The preprocessing pipeline began with segmenting the boundaries of bladder cancer tissues and identifying natural pores on the tissue pathology slide images. The digital slides were then divided into 256 x 256-pixel patches suitable for convolutional neural network processing. Blank patches were filtered out using color thresholding techniques, ensuring that only tissue-containing regions were retained for analysis. Importantly, the researchers chose not to preprocess the WSIs to eliminate background before segmentation, instead relying on post-segmentation filtering.

Feature extraction with ResNet-50: All retained patches were transformed into a low-dimensional feature embedding set using the ResNet-50 model with ImageNet pre-training weights. Each patch produced a 1024-dimensional feature vector through the feature extractor. ResNet-50 was selected for its ability to overcome the degradation problem in deep network training through its residual learning framework, which mitigates the vanishing gradient issue. The architecture's transfer learning capabilities, leveraging pre-trained weights from ImageNet, allow it to extract both low-level and high-level visual features relevant to histopathological analysis.

Design rationale: The choice of ResNet-50 as the backbone feature extractor reflects its balance between computational efficiency and representational power. Its modular design allows customization for varying task complexities, and the availability of pre-trained models enables rapid deployment in medical imaging contexts where training data is limited. The 1024-dimensional feature vectors capture sufficient information from each patch to feed into the downstream attention network for slide-level classification.

TL;DR: The 106 WSIs were scanned at 20x magnification, segmented into 256 x 256-pixel patches, and filtered to remove blank regions. ResNet-50 with ImageNet pre-training extracted 1024-dimensional feature vectors from each patch, converting raw image data into compact numerical representations for downstream classification by the CLAM attention network.

Model Architecture

Pages 5-6

CLAM Framework: Weakly-Supervised Learning with Attention-Based Classification

CLAM architecture: The study employed clustering-constrained-attention multiple-instance learning (CLAM), an advanced weakly supervised deep learning method that leverages attention mechanisms to automatically identify subregions with significant diagnostic value within each WSI. Unlike fully supervised approaches that require pixel-level or region-level annotations, CLAM operates at the slide level, meaning only a single label (HER2-positive or HER2-negative) is needed per slide. This dramatically reduces the annotation burden on pathologists.

Attention-based pooling: CLAM classifies unannotated WSIs by using an attention-based pooling function. The attention network assigns a score to each patch, indicating its relative importance in the overall slide-level diagnosis. Instance-level clustering on identified representative regions constrains and refines the feature space, leading to more precise WSI classification. This approach allows the model to focus on the most diagnostically relevant tissue areas while downweighting uninformative regions such as stroma, necrosis, or normal tissue.

Training configuration: The model was trained using five-fold cross-validation to prevent overfitting. The Adam optimizer was configured with an initial learning rate of 1 x 10^-4 and L2 weight decay of 1 x 10^-5. The beta parameters were set to default values of 0.9 (beta1) and 0.999 (beta2). The input feature dimension was 1024 (matching ResNet-50 output), the hidden layer dimension was 256, and the dropout rate was 0.25. A smooth top-1 SVM loss function was selected for classification. The maximum training epoch was set to 200 with an early stopping strategy triggered after 20 consecutive epochs without improvement in validation loss.

Cross-validation strategy: Five-fold cross-validation segments the dataset into five equal parts, using one part for validation in each round while the remaining four constitute the training set. This approach provides a more consistent and dependable assessment of the model's generalization performance, confirming its stability across various data subsets. It is particularly important in small-cohort studies like this one, where a single train-test split could yield unreliable performance estimates due to the limited sample size of 106 cases.

TL;DR: CLAM is a weakly-supervised framework that uses attention-based pooling to identify diagnostically important regions in each WSI without requiring detailed annotations. The model was trained with five-fold cross-validation, Adam optimizer (learning rate 1e-4), smooth top-1 SVM loss, dropout of 0.25, and early stopping after 20 epochs of no improvement, with a maximum of 200 training epochs.

Results

Pages 6-8

Model Performance: Validation Set, Test Set, and Comparison with Pathologists

Validation set performance: On the validation set (N = 21), the CLAM model achieved an AUC of 0.92 (95% CI: 0.86 to 0.94), an accuracy of 0.86 (95% CI: 0.74 to 0.94), a sensitivity of 0.87 (95% CI: 0.63 to 0.95), a specificity of 0.83 (95% CI: 0.53 to 0.87), and an F1 score of 86.7%. These results indicate strong discriminative ability for identifying HER2-positive versus HER2-negative cases within the validation cohort, with the model correctly identifying the vast majority of HER2-positive cases while maintaining a low false-positive rate.

Test set performance: On the independent test set (N = 21), performance declined somewhat: AUC of 0.88 (95% CI: 0.82 to 0.92), accuracy of 0.67 (95% CI: 0.55 to 0.86), sensitivity of 0.56 (95% CI: 0.45 to 0.76), specificity of 0.75 (95% CI: 0.40 to 0.80), and F1 score of 77.8%. The drop in sensitivity from 0.87 to 0.56 between validation and test sets is notable, suggesting the model may miss a substantial proportion of HER2-positive cases in unseen data. The authors attribute this discrepancy to the use of a single-center sample for training, which limits diversity in tissue appearance and processing.

Human-machine competition: To validate clinical utility, two senior pathologists were invited to judge the 21 H&E slides in the test set one by one, back-to-back. The optimal CLAM model achieved an accuracy of 0.86, statistically outperforming both Pathologist A (accuracy 0.62, p < 0.01) and Pathologist B (accuracy 0.43, p < 0.001) using a two-sided McNemar's test. This comparison demonstrates that predicting HER2 expression from H&E images alone is extremely difficult for human experts, and the deep learning model captures visual features that are not perceptible to pathologists.

Key takeaway: While the AUC of 0.88 on the test set indicates good overall discriminative ability, the sensitivity of 0.56 highlights the risk of missed HER2-positive diagnoses, which could mean patients miss out on beneficial targeted therapy. Nevertheless, the model significantly outperformed both pathologists, confirming that H&E images do contain latent information about HER2 status that deep learning can extract but the human eye cannot reliably detect.

TL;DR: The CLAM model achieved AUC 0.92 and accuracy 0.86 on validation, and AUC 0.88 and accuracy 0.67 on the test set. Test sensitivity dropped to 0.56, indicating room for improvement. In a head-to-head comparison, the model (accuracy 0.86) significantly outperformed two senior pathologists (accuracy 0.62 and 0.43, both p < 0.05).

Visualization

Pages 8-9

Attention Heatmaps: Visualizing What the Model Sees in Tumor Tissue

Heatmap generation: The CLAM model produces attention heatmaps that map the HER2 molecular feature data of bladder cancer onto the spatial locations of the original pathological image. The attention network assigns a score to each patch, and these scores are converted into percentile values scaled between 0 and 1, where 1 represents the highest attention and 0 the lowest. A diverging color map converts these normalized scores to RGB colors, overlaid on the corresponding spatial locations in the slide.

Interpreting the heatmaps: Red regions in the heatmap indicate areas of high interest that contribute strongly to the model's prediction (positive evidence for HER2 overexpression), while blue regions indicate low contribution relative to other patches. The researchers found a strong correlation between the intensely activated red regions in the heatmap and the overexpression of HER2. This spatial mapping allows pathologists to visually inspect which tissue areas drove the model's decision, providing a degree of interpretability that is often lacking in deep learning approaches.

Tumor region dominance: Through heatmap visualization analysis, the attention scores obtained by the original tumor region were consistently higher than those obtained by the surrounding microenvironment tissue region. This finding suggests that the tumor region contains more critical predictive information for HER2 status than the adjacent stroma or normal tissue. It also validates the biological plausibility of the model's predictions, since HER2 expression is a property of tumor cells rather than the surrounding microenvironment.

Clinical implications: The ability to generate interpretable heatmaps is significant for clinical adoption. Rather than providing a simple binary prediction, the CLAM model highlights specific tissue regions that warrant further pathological examination. This transparency could help build trust among clinicians and facilitate the integration of AI-based HER2 prediction into existing digital pathology workflows. Once training is complete, the model can identify feature-rich regions and perform classification at the WSI level, offering a rapid screening capability.

TL;DR: The CLAM model generates attention heatmaps showing which tissue regions drive HER2 predictions. Red (high-attention) areas correlated strongly with HER2 overexpression, and tumor regions consistently received higher attention scores than surrounding microenvironment tissue, confirming biological plausibility and providing interpretable outputs for clinical use.

Discussion

Pages 9-11

Comparison with Prior Work, Clinical Implications, and Technical Design Choices

Comparison with existing models: Yan et al. previously introduced a hierarchical deep multiple-instance learning framework for predicting HER2 expression status in bladder cancer from the TCGA dataset (123 cases), achieving an AUC of 0.91. However, that study required labor-intensive annotation of bladder cancer tissues and used a quantitative approach. By contrast, the present study's CLAM model eliminates the need for manual tissue annotation and adopts a qualitative prediction approach (positive vs. negative). Loeffler et al. showed that deep learning can predict FGFR3 mutation status from bladder cancer pathological images with an AUC of 0.701, demonstrating that molecular biomarker prediction from H&E images is feasible across different targets.

Advantages over traditional testing: The deep learning model uses only WSIs as input, which can be easily obtained in surgical environments and adopted in economically underdeveloped and remote areas. Compared to IHC testing, which requires specific antibodies, equipment, and trained personnel, the model significantly reduces pathologist workload and patient burden. The approach is particularly valuable in settings where high-throughput sequencing is unavailable and IHC expertise is limited, potentially democratizing access to HER2 status determination for treatment planning.

Technical design choices: The use of ResNet-50 as the feature extractor was motivated by its ability to learn a spectrum of feature representations from rudimentary to sophisticated. The Adam optimizer with L2 weight decay (1 x 10^-5) and dropout rate of 0.25 were chosen to enhance generalization and prevent overfitting on the small training dataset. The smooth top-1 SVM loss function was selected specifically for the binary classification task. These hyperparameter choices aimed to balance model performance with generalization capability, although the authors acknowledge that further tuning may be necessary for different datasets.

Clinical perspective: From a practical standpoint, the study has the potential to alleviate pathologist workload, reduce the time and financial burden of traditional IHC testing, and shorten treatment cycles. During diagnosis, the model can serve as a decision-support tool, aiding clinicians in making well-informed choices. The researchers noted that HER2 was weakly expressed in the collected bladder cancer cases, and the quality of immunohistochemical staining determined the regimen of neoadjuvant chemotherapy or targeted therapy. Using deep learning to mine more detailed features provides potentially objective criteria that could benefit patients.

TL;DR: The CLAM model eliminates the need for manual tissue annotations required by prior approaches (Yan et al., AUC 0.91 on TCGA data) and operates directly on routine H&E slides obtainable in any surgical setting. ResNet-50 feature extraction with Adam optimizer and dropout were configured for small-dataset generalization, positioning the tool as a cost-effective alternative to IHC and high-throughput sequencing.

Limitations and Future Directions

Pages 11-13

Study Limitations, Single-Center Constraints, and the Path to Multi-Center Validation

Single-center limitation: The most significant constraint of this study is that all data came from a single clinical center (RHWU), and the cohort of 106 patients is relatively small. This impacts accuracy and reproducibility, and may limit the applicability of the findings in broader clinical practice. The drop in test set sensitivity to 0.56 compared to 0.87 on the validation set directly reflects this limitation, as the model may have learned institution-specific tissue processing and staining patterns rather than universal features of HER2 expression.

Image quality issues: The authors acknowledge that some out-of-focus patches remained in the dataset even after quality filtering, which affected model performance. While 9 of the original 115 cases were excluded for quality reasons, the remaining dataset may still contain suboptimal regions that introduce noise into the training process. More rigorous automated quality control at the patch level could help address this issue in future iterations.

Class imbalance and stage distribution: The cohort included 70 HER2-positive and 36 HER2-negative cases, creating a roughly 2:1 class imbalance. Additionally, the distribution across pathological stages was not balanced, with Stage 0a comprising over half the cohort while Stages III and IV were represented by only a handful of cases. This imbalance may introduce bias in the model's predictions and limits the ability to assess performance across different disease stages.

Future directions: The researchers envision multi-center studies with large sample sizes to achieve early and accurate diagnosis of HER2 overexpression in bladder cancer. They advocate for active involvement of pathologists in research projects for quality control and overall improvement. The integration of clinical data, radiomics, and gene-related information using multi-instance learning is anticipated to create more robust and reliable clinical support models. Expanding beyond single-institution training data would help the model generalize across different tissue processing protocols, scanner types, and patient populations.

Broader context: This study represents an important proof-of-concept that HER2 status can be predicted from routine H&E-stained images of bladder cancer using deep learning. While the test set performance shows clear room for improvement, the model's superiority over experienced pathologists in the human-machine competition validates the fundamental premise that H&E images contain latent molecular information extractable by AI. The next steps require external validation, larger datasets, and prospective clinical trials to determine whether this approach can reliably guide HER2-targeted therapy decisions in practice.

TL;DR: The study is limited by single-center data (106 patients), class imbalance (2:1 HER2-positive to negative), unbalanced stage distribution, and residual image quality issues. Future work should include multi-center cohorts, larger sample sizes, integration of radiomics and genomic data, and prospective validation to confirm clinical utility for guiding HER2-targeted therapy decisions.