Automated Gleason Grading by Weakly Supervised Deep Learning

Plain-English Explanations

Overview and Background

Page 1

Why Automated Gleason Grading Matters

Prostate cancer is the second most common malignancy in men worldwide and the second leading cause of cancer death among men in the United States. After a prostate needle biopsy, a pathologist assigns a Gleason score by examining tissue under a microscope. This score is the single most powerful prognostic predictor for prostate cancer outcomes and directly determines which treatment a patient receives.

The Gleason grading system, devised in the late 1960s by Dr. Donald F. Gleason, categorizes tumor architectural features on a five-point scale, with pattern 1 being the most differentiated and pattern 5 the least differentiated. The Gleason score for a needle biopsy is the sum of the primary (most prevalent) and secondary (worst remaining) pattern numbers. In 2013, a new classification was proposed that maps these scores into five prognostic grade groups: grade group 1 (Gleason score 6 or below), grade group 2 (3+4=7), grade group 3 (4+3=7), grade group 4 (Gleason 8), and grade group 5 (Gleason 9 and 10). This system was accepted at the 2014 ISUP consensus conference and by the WHO in 2016.

A well-documented problem with Gleason grading is inter-observer variability. The concordance rate (measured in Cohen's kappa) among general pathologists ranges from only 0.40 to 0.50, while urologic pathology specialists achieve 0.56 to 0.70. This variability can meaningfully affect treatment decisions, making automated grading systems a high-value target for AI research.

Prior automated systems: Previous deep learning approaches to Gleason grading relied on a two-stage architecture where a model first recognizes individual Gleason patterns 3, 4, and 5, then feeds pattern-wise features into a grade prediction model. However, these systems required extensive region-level manual annotations by experts, complex computer vision algorithms to extract diagnostic markers, or separate immunohistochemistry-based epithelial tissue detection models. All of these approaches carry large annotation costs and algorithmic complexity.

TL;DR: Gleason grading is the most important predictor for prostate cancer treatment decisions, but inter-observer variability among pathologists (kappa 0.40-0.70) is a major problem. Prior AI systems required expensive region-level annotations. YAAGGS aims to eliminate that requirement entirely.

Methodology

Pages 1-2

Two-Stage Architecture with Weakly Supervised Learning

YAAGGS (Yet Another Automated Gleason Grading System) uses a two-stage convolutional neural network (CNN) approach trained using only slide-level labels, not region-level annotations. The key insight is that a cancer detection model, trained simply to distinguish cancer from benign tissue, may implicitly learn features that differentiate between Gleason patterns. This eliminates the need for pathologists to painstakingly outline individual Gleason pattern regions on whole-slide images (WSIs).

First stage (feature extraction): Patch images of 360 x 360 pixels are extracted from the entire WSI at 10x magnification and fed into a DenseNet-121 CNN model. The model was initially trained using multiple-instance learning (MIL) to classify patches as benign or cancer, using only slide-level labels. After training, the last hidden layer outputs 1024-dimensional feature vectors for each patch. These vectors are spatially arranged according to their original positions in the WSI, producing a 1024-channel two-dimensional feature map. WSIs up to 2 x 2 cm can be converted into 1024-channel, 64 x 64 pixel feature maps.

Second stage (grade group prediction): A separate CNN accepts these feature maps and classifies them into one of six categories: benign, grade group 1, grade group 2, grade group 3, grade group 4, or grade group 5. The architecture consists of five layers of 1x1 kernel convolutions (with batch normalization and ReLU activation) followed by 16 residual blocks. A weighted cross-entropy loss function addresses class imbalance, with weights of 1.0, 1.0, 1.5, 1.4, 1.7, and 1.6 for benign through grade group 5, respectively.

Training details: The first stage used SGD optimizer with 0.9 momentum, 1e-5 weight decay, initial learning rate of 0.01 (decayed by 0.1 every 25 epochs), and a mini-batch size of 128 over 100 epochs. Data augmentations included random horizontal flips, 90-degree rotations, and color jittering. The second stage used SGD with initial learning rate 0.1, step decay every 40 epochs, and 150 total epochs with batch size 256. The best model was selected by quadratic-weighted kappa on the tuning set.

TL;DR: YAAGGS uses a two-stage CNN pipeline. Stage 1 (DenseNet-121 trained via MIL) extracts 1024-dimensional features from WSI patches using only slide-level cancer/benign labels. Stage 2 (residual CNN) classifies the assembled feature maps into grade groups. No region-level annotations are required.

Dataset

Pages 1-2

Multi-Institutional Dataset and Experimental Design

Data was collected from two Korean hospitals: Hanyang University Medical Center (HUMC) and Korea University Guro Hospital (KUGH). H&E-stained glass slides containing single prostate needle biopsy cores were digitized using Aperio AT2 scanners at 40x magnification (0.25 microns per pixel). A pathologist performed blinded quality checks, excluding slides from non-prostate tissues, immunohistochemistry or special stains, and slides with inadequate quality such as severe out-of-focusing or indelible markings.

Dataset composition: The total dataset comprised 788 cases (7,600 WSIs). In the holistic setting, 689 cases (6,664 WSIs) were used for discovery and 99 cases (936 WSIs) for validation, with discovery further split into 5,716 training and 948 tuning WSIs. Benign tissue dominated the datasets (45-66% of slides), followed by grade groups 1 through 5. For external validation, 244 tissue microarray (TMA) images from the Gleason 2019 Challenge were used, with ground truth derived from pixel-wise majority voting among six pathologists (with 1-27 years of experience).

Two experimental settings: The holistic setting trained on a combined pool from both institutions and validated on a held-out portion, measuring the system's best achievable performance. The inter-institutional setting trained exclusively on 621 HUMC cases (6,071 WSIs) and validated on 167 KUGH cases (1,529 WSIs), testing generalization across institutional boundaries. An additional experiment in the holistic setting used a reduced training set of 5,206 slides to isolate the effect of dataset size on performance.

Reference standard: For HUMC, original hospital diagnoses (Gleason scores) were converted into grade groups. Five surgical pathologists with 1 to 20 years of experience contributed to these diagnoses over the 2009-2017 collection period, including one genitourinary pathologist. For KUGH, a single board-certified pathologist with 9 years of experience re-reviewed all WSIs according to 2014 ISUP guidelines.

TL;DR: 788 cases (7,600 WSIs) from two Korean hospitals were used. The holistic setting used 689 cases for training and 99 for validation. The inter-institutional setting trained on HUMC alone and validated on KUGH. External validation used 244 TMA images from the Gleason 2019 Challenge.

Results

Pages 2-3

Holistic and Inter-Institutional Performance

Holistic setting, cancer detection (stage 1): The first stage model achieved a ROC AUC of 0.983 (95% CI: 0.964-1.000) and a precision-recall AUC of 0.984 (95% CI: 0.965-1.000). Cancer detection accuracy was 94.7% (95% CI: 91.4-98.0%), with sensitivity of 0.936 (95% CI: 0.900-0.972) and specificity of 0.960 (95% CI: 0.931-0.989).

Holistic setting, grade group prediction (stage 2): The system achieved an overall accuracy of 77.5% (95% CI: 72.3-82.7%), Cohen's kappa of 0.650 (95% CI: 0.570-0.730), and quadratic-weighted kappa of 0.897 (95% CI: 0.815-0.979). This kappa value of 0.650 falls within the range reported for urologic pathology specialists (0.56-0.70) and substantially exceeds the typical range for general pathologists (0.40-0.50). When trained on a reduced subset of 5,206 slides, performance dropped to 69.3% accuracy, kappa of 0.521, and quadratic-weighted kappa of 0.824.

Inter-institutional setting: Training on HUMC and validating on KUGH, the cancer detector maintained strong performance with ROC AUC of 0.982 (95% CI: 0.967-0.997) and PR AUC of 0.984 (95% CI: 0.970-0.998). However, grade group prediction accuracy dropped to 67.4% (95% CI: 63.2-71.6%), kappa to 0.553 (95% CI: 0.495-0.610), and quadratic-weighted kappa to 0.880 (95% CI: 0.822-0.938). The reduced-dataset experiment yielded a similar kappa of 0.521, suggesting the performance drop was at least partly due to having fewer training cases rather than institutional differences alone.

External validation: On the Gleason 2019 Challenge dataset, the cancer detector achieved ROC AUC of 0.943 (95% CI: 0.913-0.973) and PR AUC of 0.985 (95% CI: 0.972-0.998). Grade group prediction was notably lower: accuracy of 54.5% (95% CI: 48.3-60.8%), kappa of 0.389 (95% CI: 0.305-0.473), and quadratic-weighted kappa of 0.634 (95% CI: 0.468-0.800). The stain color distribution of the Gleason 2019 dataset was visually different from the training data, which likely contributed to this decline.

TL;DR: In the best setting, YAAGGS achieved 77.5% grade group accuracy with kappa 0.650 and quadratic-weighted kappa 0.897, matching urologic pathologist-level concordance. Cross-institution accuracy dropped to 67.4% (kappa 0.553), and external validation on Gleason 2019 data fell to 54.5% (kappa 0.389).

Comparative Analysis

Pages 3-5

Comparison with Baseline Methods

The authors compared YAAGGS against three baseline approaches to validate the design choices of their system. All comparisons were conducted in the holistic setting using the same validation data.

ImageNet pre-trained feature extractor: Replacing the MIL-trained cancer detection model with a generic ImageNet pre-trained model for the first stage dropped accuracy from 77.5% to 72.6% (95% CI: 67.2-78.1%), kappa from 0.650 to 0.559 (95% CI: 0.471-0.647), and quadratic-weighted kappa from 0.897 to 0.845 (95% CI: 0.746-0.945). This demonstrates that domain-specific feature extraction, even when trained only for binary cancer detection, captures histological patterns more effectively than general-purpose ImageNet features.

Multi-class MIL model: Training the first stage model with a multi-class MIL method (benign, Gleason patterns 3, 4, and 5) rather than binary cancer detection yielded accuracy of 75.6% (95% CI: 70.3-80.9%), kappa of 0.622 (95% CI: 0.540-0.703), and quadratic-weighted kappa of 0.901 (95% CI: 0.819-0.982). Interestingly, this explicit Gleason pattern discrimination did not meaningfully outperform the binary cancer detection approach, supporting the hypothesis that the cancer detection model implicitly learns Gleason pattern-specific features.

CLAM (attention-based MIL): The recently proposed Clustering-constrained Attention Multiple Instance Learning (CLAM) method performed worst among all approaches, with accuracy of 67.3% (95% CI: 61.6-73.0%), kappa of 0.469 (95% CI: 0.376-0.562), and quadratic-weighted kappa of 0.779 (95% CI: 0.658-0.900). The authors tested CLAM with both ResNet-50 and DenseNet-121 backbones, explored attention-based pooling, linear layer sizes of 128 to 1024, bag weights from 0.0 to 1.0, and learning rates from 1e-1 to 1e-5. Despite this hyperparameter search, CLAM underperformed the proposed system by over 10 percentage points in accuracy.

TL;DR: YAAGGS (77.5% accuracy, kappa 0.650) outperformed ImageNet pre-trained features (72.6%), multi-class MIL (75.6%), and CLAM (67.3%). The binary cancer detection model proved as effective as explicit Gleason pattern training for feature extraction.

Mechanism Analysis

Pages 4-6

How the Model Learns to Distinguish Gleason Patterns

A critical question for any weakly supervised system is whether it actually learns the features it is hypothesized to learn. The authors performed two mechanism analyses to investigate whether the first stage model distinguishes Gleason patterns despite being trained only for cancer detection, and whether the second stage model is sensitive to the proportions of different patterns.

Feature space visualization (t-SNE): The authors sampled 600 cancer image patches each from WSIs with Gleason scores 3+3, 4+4, and 5+5 in the validation set. These 1,800 patches were embedded into 1024-dimensional space by the first stage model and then projected into two dimensions using t-distributed stochastic neighbor embedding (t-SNE). The visualization revealed clear clustering by Gleason pattern, with distinct spatial separations among patterns 3, 4, and 5. Results were consistent across perplexity values ranging from 5 to 1,000, confirming that the cancer detection model assigns distinguishable features to different Gleason patterns without ever being explicitly trained on pattern labels.

Proportion sensitivity analysis: To test whether the second stage model responds appropriately to varying ratios of Gleason patterns, the authors created synthetic WSIs by combining portions of pure 3+3 and 4+4 slides at six predefined ratios (100:0, 80:20, 60:40, 40:60, 20:80, 0:100). Five non-overlapping WSI pairs were used to generate 30 synthetic WSIs. The results showed that as the proportion of Gleason pattern 4 increased, the grade group 1 probability continually decreased, grade group 4 and 5 probabilities increased, and grade group 2 and 3 probabilities first rose then fell. This pattern mirrors exactly how a pathologist would assess the relative ratio of Gleason patterns when assigning a grade group.

This analysis provides meaningful evidence that the model operates in a manner analogous to the human diagnostic process: first recognizing pattern-level features and then assessing their relative proportions to determine the final grade group. The model appears to have independently discovered the clinical logic embedded in the Gleason grading system.

TL;DR: t-SNE visualization confirmed that the cancer detection model learns distinct feature representations for Gleason patterns 3, 4, and 5 without explicit pattern labels. Synthetic WSI experiments showed the second stage model responds proportionally to Gleason pattern ratios, mimicking pathologist behavior.

Limitations

Pages 5-6

Data Quality, Reference Standards, and Generalization Gaps

Weak reference standard: The most significant limitation is the quality of the ground truth labels. Gleason grading is inherently variable among pathologists, and the reference standard in this study was derived from either a single pathologist's review or original hospital diagnoses rather than a consensus panel of experts. While this approach aligns with the study's goal of minimizing development costs, it means the system was trained and evaluated against a noisy ground truth. A stronger reference standard, such as consensus grading by multiple urologic pathologists, would better demonstrate clinical utility.

Limited inter-institutional generalization: The inter-institutional experiment showed a meaningful performance drop (kappa from 0.650 to 0.553). The additional experiment with a reduced training set (kappa 0.521) suggested that at least part of this decline was attributable to reduced data volume rather than institutional differences alone. However, with only two institutions in the study, the true generalization capability across diverse clinical settings remains uncertain.

External validation challenges: The steep performance drop on the Gleason 2019 Challenge data (kappa from 0.650 to 0.389) exposes sensitivity to stain color distribution differences between institutions. The authors incorporated color augmentation during training, but this proved insufficient for the visually distinct external dataset. The Gleason 2019 data also used tissue microarray (TMA) images rather than needle biopsy WSIs, introducing additional domain shift. This highlights the ongoing challenge of developing histopathology AI models that generalize across staining protocols, scanners, and tissue preparation methods.

Guideline inconsistency: The HUMC data spanned 2009-2017, during which Gleason grading guidelines changed. Some slides were graded with the 2005 ISUP guidelines and others with the 2014 guidelines. The authors could not analyze in detail whether these guideline changes and inter-observer variability among the five pathologists at HUMC affected the validation results. This temporal inconsistency introduces systematic noise into both the training labels and the reference standard.

TL;DR: Key limitations include a weak reference standard (single-pathologist or original hospital diagnoses), poor external generalization (kappa dropped to 0.389 on Gleason 2019 data), stain color sensitivity, only two training institutions, and inconsistent grading guidelines across the 9-year data collection period.

Future Directions

Pages 6-7

Scaling Weakly Supervised Learning Beyond Prostate Cancer

The central contribution of YAAGGS is not just its Gleason grading performance, but its demonstration that a weakly supervised approach, requiring only slide-level labels, can achieve competitive grading accuracy. This has significant implications for developing AI systems for other cancer types where expert region-level annotations are scarce or prohibitively expensive to obtain.

Multi-institutional expansion: The authors explicitly call for future research involving an increased number of hospitals and pathologists. Expanding beyond two Korean institutions to include international centers with diverse staining protocols, scanner hardware, and pathologist populations would be essential for validating the approach's generalizability. Techniques such as stain normalization, domain adaptation, and more aggressive color augmentation could help bridge the gap exposed by the external validation results.

Stronger reference standards: Future studies should incorporate consensus grading by multiple expert urologic pathologists, potentially combined with molecular or genomic ground truth (such as correlation with biochemical recurrence outcomes) to move beyond the limitations of single-pathologist reference standards. This would also enable more meaningful comparisons with existing systems like those by Bulten et al. (quadratic-weighted kappa 0.918) and Strom et al. (linear-weighted kappa 0.83), which used more complex annotation pipelines but also stronger evaluation frameworks.

Transferability to other cancers: The weakly supervised MIL framework demonstrated here could potentially be applied to grading systems in other cancers, such as breast cancer (Nottingham grading), renal cell carcinoma (Fuhrman/WHO-ISUP grading), or bladder cancer. Any cancer type that uses morphological grading from histopathology slides could benefit from the reduced annotation burden. The key requirement is that the first-stage detection model must implicitly learn features relevant to grading, which the t-SNE analysis in this study suggests is plausible for well-differentiated pattern systems.

Technical improvements: More advanced stain augmentation strategies, larger and more diverse training datasets, and modern attention-based architectures (which underperformed here via CLAM but have since been refined in the literature) could further close the gap between weakly supervised and fully supervised approaches. Integration with clinical outcome data, rather than just pathologist grading, would also strengthen the clinical relevance of these systems.

TL;DR: The weakly supervised approach could extend to other cancer grading systems (breast, renal, bladder) without expensive region-level annotations. Future work should prioritize multi-institutional validation, stronger reference standards, advanced stain normalization, and correlation with clinical outcomes rather than just pathologist agreement.

Yet Another Automated Gleason Grading System (YAAGGS) by weakly supervised deep learning

Original Paper (PDF)