Automated Gleason Grading via Deep Learning

Overview and Background

Pages 1-2

Why Automated Gleason Grading Matters for Prostate Cancer

The Gleason scoring system, first established by Donald Gleason in 1966, remains the most powerful prognostic tool for prostate cancer. It works by examining the architectural patterns of tumor tissue under a microscope. Pathologists assign numbers from 1 (well differentiated) to 5 (poorly differentiated) to different regions. The final Gleason score is the sum of the two most predominant patterns, with the lowest score in modern practice being 6 (3+3). The system was revised by the International Society of Urological Pathology (ISUP) in 2005 and 2014, and it is recognized by the World Health Organization.

Despite its clinical importance, the Gleason scoring system suffers from a well-known problem: limited inter-pathologist reproducibility. The histological assessment is time-consuming and subjective, particularly for intermediate Gleason patterns 3 and 4, which can be very difficult to distinguish. This subjectivity directly affects patient stratification and treatment decisions. Gleason pattern 3 describes well-formed, separated glands of variable size. Gleason pattern 4 includes fused glands, cribriform and glomeruloid structures, and poorly formed glands. Gleason pattern 5 involves poorly differentiated individual cells, solid sheets, cords, and comedonecrosis.

Automated computational approaches operating on digital pathology images offer the potential to deliver reproducible results at scale. Earlier approaches relied on hand-crafted image features combined with conventional classification techniques. In recent years, deep learning has emerged as a disruptive alternative, using multi-layered neural networks that extract complex features directly from the data without manual feature engineering.

Prior deep learning work on prostate cancer Gleason grading had significant limitations. Kallen et al. only assessed their method on tissue slides with homogeneous Gleason patterns. Zhou et al. achieved 75% accuracy differentiating Gleason 3+4 from 4+3 on TCGA whole slide images, but focused only on intermediate scores. del Toro et al. trained a binary classifier limited to low (7 or lower) versus high (8 or higher) discrimination. This study aims to move beyond those constraints by training and evaluating on detailed subregion annotations with heterogeneous patterns.

TL;DR: Gleason scoring is the most important prognostic tool for prostate cancer but suffers from inter-pathologist variability, especially for intermediate patterns 3 and 4. Prior deep learning approaches were limited to homogeneous patterns or binary classification. This study tackles the full grading problem with detailed subregion annotations.

Dataset and Cohort Design

Pages 2-3

Tissue Microarray Resource: 886 Patients Across Five TMAs

The study used five tissue microarrays (TMAs), each containing 200 to 300 spots. Spots with artifacts or non-prostate tissue (such as lymph node metastases) were excluded. A first pathologist (K.S.F.) carefully delineated cancerous regions in each TMA spot and assigned Gleason patterns of 3, 4, or 5 to each region. Spots without cancerous tissue were labeled as benign. This level of subregion annotation is more granular than most prior work, which typically assigned a single score per whole slide.

Training cohort: TMAs 111, 199, and 204 were combined for a total of 508 training spots. TMA 204 was notably enriched for high-grade disease (69 spots with Gleason 10). Validation cohort: TMA 76 (133 spots) was selected because it had the most balanced distribution of Gleason scores across all categories. Test cohort: TMA 80 (245 spots) was chosen for final evaluation because it had the highest number of cases and, critically, the most death events (n = 30) for survival analysis.

To enable a rigorous assessment of inter-pathologist variability, the entire test cohort (TMA 80) was independently annotated by a second pathologist (J.H.R.). This dual-annotation design allowed direct comparison of model-versus-pathologist agreement with pathologist-versus-pathologist agreement on the same dataset. Clinical survival data was available for three TMAs (76, 80, 111), with TMA 80 providing the most informative survival endpoints due to its 30 disease-specific death events.

The TMAs were digitized at 40x resolution (0.23 microns per pixel) using a NanoZoomer-XR Digital slide scanner at the University Hospital Zurich. Tumor stage and Gleason scores were assigned according to UICC and WHO/ISUP criteria. The TMARKER software was used for annotation delineation.

TL;DR: Five TMAs totaling 886 patients were used: 508 for training, 133 for validation, and 245 for testing. The test cohort was independently annotated by two uropathologists, enabling direct comparison of model performance against inter-pathologist variability. TMA 80 had 30 disease-specific death events for survival analysis.

Methodology

Pages 2-3, 8-9

Architecture Selection, Transfer Learning, and Training Pipeline

The approach used a patch-based classification strategy. Image patches of 750 x 750 pixels were extracted from each TMA spot (original resolution 3100 x 3100) using a step size of 375 pixels. Each patch was labeled according to the annotation in its central 250 x 250 region, and patches with no annotation or multiple annotations in the center were discarded. The patches were resized to 250 x 250, with random cropping to 224 x 224 during training alongside random rotations, flipping, and color jittering for data augmentation.

Architecture benchmarking: Five convolutional neural network architectures were evaluated: VGG-16, Inception-V3, ResNet-50, DenseNet-121, and MobileNet. All fully connected layers were removed and replaced with a global average pooling layer followed by a softmax classification layer over four classes (benign, Gleason 3, Gleason 4, Gleason 5). Transfer learning from ImageNet pre-trained weights consistently outperformed training from scratch, as expected given the limited dataset size of 641 training patients.

MobileNet (width multiplier alpha = 0.5) won the benchmark. MobileNets use depthwise separable convolutions, a design that dramatically reduces parameter count compared to standard convolutions. With alpha = 0.5, the network starts at 16 channels and scales up to 512, further reducing parameters. This smaller model was better suited to the limited training data, avoiding severe overfitting without sacrificing performance. Fine-tuning used SGD with learning rate 0.0001 and Nesterov momentum 0.9, with dropout of 0.2. The categorical cross-entropy loss was minimized over 50,000 iterations with balanced mini-batches of size 32.

Balanced mini-batches were crucial. At each training iteration, an equal number of examples from each class was randomly selected. This prevented the model from being biased toward more frequent classes. The authors also noted that networks trained from scratch without proper regularization overfit completely, producing confident but wrong predictions with exploding validation loss, while fine-tuned regularized models like MobileNet maintained stable validation loss trajectories.

TL;DR: Five CNN architectures were benchmarked. MobileNet (alpha = 0.5) was selected for its resistance to overfitting on the small dataset. Transfer learning from ImageNet, dropout of 0.2, and balanced mini-batches were all critical. Patches of 750 x 750 were classified into four categories: benign, Gleason 3, 4, and 5.

Patch-Level Results

Pages 3-4

Patch-Level Performance and Inter-Annotator Agreement

On the validation cohort (TMA 76), MobileNet achieved a macro-average recall of 70% across all four classes. The per-class recall values were: benign 63%, Gleason 3 72%, Gleason 4 58%, and Gleason 5 88%. Cohen's quadratic kappa for patch-level classification on the validation set was 0.67. The lower recall for Gleason 4 and benign classes reflects the inherent difficulty of distinguishing these patterns, particularly in borderline cases.

On the test cohort (TMA 80), the model's patch-level annotations were compared against both pathologists. Against the first pathologist, Cohen's quadratic kappa was 0.55 and macro-average recall was 0.58. Against the second pathologist, kappa was 0.49 and macro-average recall was 0.53. The inter-pathologist agreement on the same test patches was kappa = 0.67 with macro-average recall of 0.71. Most misclassifications occurred between neighboring Gleason patterns. For example, 31% of patches annotated as Gleason 5 by the first pathologist were predicted as Gleason 4 by the model, but only 5% were predicted as lower patterns.

Using a "union" evaluation (a prediction counted as correct if it matched at least one pathologist), precision values were: benign 58%, Gleason 3 75%, Gleason 4 86%, and Gleason 5 48%. The lower precision for benign and Gleason 5 classes is expected given their lower frequency in the dataset. Detailed inspection revealed that model "errors" often reflected genuine ambiguity. Benign predictions were mostly stromal tissue or glands that truly appeared benign but fell within a larger region annotated as cancerous. Gleason 5 predictions often flagged patches with single cells indicative of pattern 5, where pathologists had assigned pattern 4 based on broader context.

This finding is clinically relevant: the model's patch-level focus sometimes identified small atypical regions that pathologists may overlook when evaluating larger tissue areas. The model consistently recognized key characteristics of each Gleason pattern, suggesting utility as an attention-directing tool even when full agreement with pathologists is not achieved.

TL;DR: Patch-level kappa between model and pathologists was 0.55 and 0.49, compared to inter-pathologist kappa of 0.67. Most errors were between adjacent Gleason patterns. The model detected small atypical regions that pathologists missed when looking at larger areas, showing potential as an attention-directing assistant.

TMA Spot-Level Results

Pages 4-6

Spot-Level Gleason Scoring Matches Pathologist Agreement

The clinical question is not patch-level accuracy but the final Gleason score assigned to each tissue sample. To generate spot-level scores, the trained network was applied in a sliding window fashion to produce pixel-level probability maps for each class (benign, Gleason 3, 4, and 5). A weighted score was computed for each class, and a threshold of c = 0.25 was applied so that minor detections below this threshold did not influence the final score. The primary and secondary Gleason patterns were then summed to produce the composite Gleason score.

At the TMA spot level, the inter-annotator agreement between the model and each pathologist reached kappa = 0.75 (model vs. first pathologist) and kappa = 0.71 (model vs. second pathologist). Critically, the inter-pathologist agreement was kappa = 0.71. This means the model's agreement with the first pathologist actually exceeded the agreement between the two human experts. These results demonstrate that the deep learning system performs at pathologist-level for composite Gleason scoring.

The pixel-level probability maps provide transparent, visually interpretable outputs that pathologists can directly evaluate. In representative examples, the model correctly identified a Gleason 6 (3+3) case, correctly annotated a Gleason 8 (4+4) case, and in a disputed case where the first pathologist assigned Gleason 8 (4+4) and the second assigned Gleason 6 (3+3), the model assigned Gleason 7 (4+3). A third independent uropathologist confirmed the model's annotation for this case. In another disputed case (Gleason 8 vs. Gleason 10), the model assigned Gleason 9 (5+4), and the third pathologist noted the presence of ambiguous single cells supporting the model's intermediate interpretation.

The network was converted to a fully convolutional architecture for efficient evaluation. The global average pooling layer was replaced with a local 7x7 average pooling layer with stride 1, the classification layer was replaced with a convolutional layer with four output channels, and an upsampling layer with factor 32 restored the original image dimensions.

TL;DR: At the TMA spot level, model-pathologist kappa values were 0.75 and 0.71, matching the inter-pathologist kappa of 0.71. In disputed cases, an independent third pathologist confirmed the model's intermediate scores, demonstrating pathologist-level Gleason grading performance.

Model Interpretability

Pages 5-6

Class Activation Mapping Confirms the Model Learned Real Morphology

Deep learning models are frequently criticized as non-interpretable "black boxes," which limits clinical trust. To address this, the authors applied class activation mapping (CAM), a technique that highlights the image regions most important for each classification decision. CAM works by projecting the class-specific weights of the output layer back to the feature maps of the last convolutional layer, producing heatmaps that reveal where the network is focusing.

The CAM analysis confirmed that the model learned biologically meaningful patterns. For benign patches, the model focused on well-formed glands with intact basal cell layers and no cytological atypia. For Gleason 3, it concentrated on round-shaped, well-formed glands that were variable in size but clearly separated. For Gleason 4, the model focused on merged, irregularly shaped glands and implied cribriform patterns. For Gleason 5, it detected the absence of gland formation and solid sheets of tumor.

A particularly notable finding was that for Gleason 3 predictions, the model specifically focused on gland junctions, verifying that glands were not fused. Fused glands would indicate Gleason pattern 4. This demonstrates that the network learned to discriminate between the two most clinically challenging patterns (3 vs. 4) by attending to the exact morphological features that pathologists use. The model also consistently ignored stromal regions and focused on epithelium, which aligns with correct pathological practice.

High-confidence predictions (probability greater than 0.8 for the correct class) were selected for CAM visualization. These confident and correctly classified patches showed clear architectural differences across Gleason patterns, providing evidence that the model internalized the Gleason grading criteria rather than relying on spurious correlations or artifacts.

TL;DR: Class activation mapping showed the model focuses on gland architecture, specifically checking gland junctions for fusion (the key Gleason 3 vs. 4 distinction). It ignores stroma and attends to epithelium, confirming it learned the same morphological features pathologists use.

Survival Analysis

Pages 6-8

Model Achieves Pathologist-Level Survival Stratification

The ultimate clinical test for a Gleason grading system is whether it can stratify patients into groups with meaningfully different outcomes. For the test cohort, patients were split into three risk groups based on their Gleason score: low risk (Gleason 6 or below), intermediate risk (Gleason 7), and high risk (Gleason 8 or above). Kaplan-Meier curves were generated for disease-specific survival, and pairwise two-tailed logrank tests with Benjamini-Hochberg correction were used to assess statistical significance.

The model outperformed both pathologists in separating low-risk from intermediate-risk patients. The BH-corrected p-value for this separation was 0.098 for the model, compared to 0.79 for the first pathologist and 0.292 for the second. For low-risk versus high-risk separation, the model achieved p = 0.023, the first pathologist achieved p = 0.048, and the second pathologist achieved p = 0.217. Notably, the model assigned Gleason 7 more frequently than either pathologist, capturing heterogeneous cases that humans tended to classify as either purely low-grade or high-grade.

A striking result was that no disease-specific death events occurred among patients classified as low-risk by the model. In contrast, 3 deaths occurred in the first pathologist's low-risk group and 2 in the second pathologist's low-risk group. This suggests the model may be slightly better at identifying occult higher-grade patterns that elevate patient risk.

The inter-pathologist variability for Gleason 7 was substantial: the two pathologists agreed on only 19 cases and disagreed on 50 cases for this intermediate-risk group. Meanwhile, 59% of cases assigned Gleason 7 by the first pathologist and 59% of cases assigned Gleason 7 by the second pathologist were also assigned Gleason 7 by the model. The model's consistent identification of heterogeneous patterns contributed to its superior low-vs-intermediate risk stratification.

TL;DR: The model separated low-risk from intermediate-risk patients more significantly (p = 0.098) than either pathologist (p = 0.79 and p = 0.292). Zero disease-specific deaths occurred in the model's low-risk group, compared to 3 and 2 deaths in the pathologists' low-risk groups. The two pathologists disagreed on 50 out of 69 Gleason 7 cases.

Limitations and Future Directions

Pages 7-8

From TMA Spots to Clinical Biopsies: What Remains to Be Done

Stromal misclassification: The model occasionally misclassified stromal regions as Gleason pattern 3, likely because training annotations for Gleason 3 sometimes included adjacent stromal tissue. Border artifacts from tissue preparation also caused errors at TMA spot edges. The authors suggest that a separate neural network trained to detect stroma and artifacts could serve as a preprocessing filter to eliminate these errors in clinical deployment.

Subjective ground truth: A fundamental limitation is that the model was trained on annotations from a single pathologist, inheriting that individual's biases and habits. Inter-pathologist variability is non-negligible, as the study itself demonstrates (kappa = 0.71 between the two pathologists). Consensus annotations from multiple experts would enable more objective model training. Furthermore, pathologists from different hospitals may follow slightly different guidelines, meaning larger multi-center studies are needed to build a generalizable system.

TMA spots versus biopsies: The study used resection specimens formatted as tissue microarray spots. In clinical practice, Gleason grading is typically performed on needle core biopsies, which present different challenges. While the Gleason scoring procedure itself does not change between TMAs and biopsies, the approach would need to be validated on biopsy specimens with corresponding annotations. Additionally, the current five Grade Group system (Grade Group 1 through 5, corresponding to Gleason 6 or below, 3+4, 4+3, 8, and 9-10) requires evaluation on larger tissue areas than TMA spots.

Future work: The authors plan to evaluate the approach on whole slide tissue images with survival information. They also note that with a larger survival-annotated cohort, a deep neural network could be trained directly on survival endpoints rather than Gleason patterns, potentially discovering novel morphological features that predict outcome. The method is not limited to H&E staining and could be applied to immunohistochemistry or other specialized stains to study gene expression, tumor microenvironment, or drug uptake effects on clinical outcome.

TL;DR: Key limitations include occasional stromal misclassification, single-pathologist training annotations, and evaluation only on TMA spots rather than clinical biopsies. Future directions include whole slide image evaluation, multi-center validation, consensus annotations, and training directly on survival endpoints to discover novel prognostic morphological features.

Automated Gleason grading of prostate cancer tissue microarrays via deep learning

Original Paper (PDF)

Plain-English Explanations