Deep Learning for Gleason Pattern Classification

Overview and Background

Page 1

Why Automated Gleason Grading Matters for Prostate Cancer

Prostate cancer is the second most commonly diagnosed cancer among men, accounting for roughly 25% of cancer cases in the western world. Diagnosis relies heavily on pathological grading using the Gleason system, which assigns histological patterns (Gleason patterns 1 through 5) based on the degree of glandular differentiation in biopsy tissue. The two most prominent patterns are summed to produce a Gleason score, which in turn maps to one of five grade groups (GG 1 through GG 5) introduced to better predict patient prognosis.

The inter-observer variability problem: Despite the clinical importance of accurate grading, substantial disagreement exists between pathologists when assigning Gleason patterns. This variability persists even with the newer grade group system, and it directly impacts treatment decisions. For example, a patient graded as GG 2 (Gleason 3+4) might qualify for active surveillance, while a GG 3 (Gleason 4+3) classification could trigger more aggressive intervention. When pathologists disagree on these borderline cases, patients risk receiving suboptimal care.

The opportunity for deep learning: The digitization of histology slides into whole slide images (WSIs) has opened the door for computer-aided diagnosis (CAD). Convolutional neural networks (CNNs), a deep learning architecture especially suited for image classification, can automatically learn discriminative features from tissue images without requiring hand-engineered feature extraction. This study proposes a CNN-based approach to automatically detect Gleason patterns 3 and 4+ in heterogeneous prostate biopsies and then determine the overall grade group for each biopsy.

Prior work: Earlier automated approaches relied on gland detection followed by extraction of hand-crafted features such as gland and lumen surface area. These methods could miss regions of GP 4 where glandular structures are largely reduced. Litjens et al. used a CNN to differentiate tumorous from non-tumorous biopsies. Kallen et al. differentiated between GP 3 and GP 5 in homogeneous regions with 81% accuracy but did not address heterogeneous biopsies or full-slide grade group classification.

TL;DR: Gleason grading of prostate biopsies suffers from high inter-observer variability between pathologists, risking suboptimal treatment. This study uses an Inception-v3 CNN to automatically classify Gleason patterns and determine grade groups in heterogeneous whole-slide biopsy images.

Study Design

Page 2

Patient Cohort and Slide Digitization

The study was approved by the Institutional Review Board of Amsterdam University Medical Centers (AMC). Hematoxylin and eosin (H&E) stained tissue sections were retrieved from the pathology archives of patients who underwent diagnostic biopsy between 2015 and 2017. A total of 96 tissue sections from 38 patients were included, with a median of two tissue blocks per patient (interquartile range: 1 to 4). Each tissue section could contain multiple biopsies or biopsy fragments.

Digitization process: The 4-micrometer-thick H&E sections were scanned using a Philips UltraFast scanner at 20x magnification, producing whole slide images with a pixel resolution of 0.5 micrometers. This high resolution is essential for distinguishing fine glandular structures that define different Gleason patterns.

Annotation protocol: Two trained observers manually annotated the digitized slides using a free-hand annotation tool, and a genitourinary pathologist subsequently checked all annotations. Four tissue classes were defined at the pixel level: (1) unaffected stroma (connective tissue), (2) non-atypical glands including healthy glands and low-grade prostatic intraepithelial neoplasia (LGPIN), (3) Gleason pattern 3, and (4) Gleason pattern 4 or higher with affected stroma. Because GP 5 was very rare in the dataset, GP 4 and GP 5 were merged into a single class to maintain balanced training data.

Adjusted grade groups: Since no differentiation was made between GP 4 and GP 5, the authors used a slightly modified grade group classification. GG 1 corresponded to Gleason score 6 or less, GG 2 to 3 + 4 or higher, GG 3 to 4+ plus 3, and GG 4 to 4+ plus 4 or higher. Regions with ambiguous grading due to out-of-focus areas, tissue folds, excessive ink, or inconclusive immunohistochemistry were excluded.

TL;DR: 96 tissue sections from 38 patients were digitized at 20x magnification (0.5 micrometer resolution) and annotated at the pixel level into four classes by trained observers and a genitourinary pathologist. GP 4 and GP 5 were merged due to low GP 5 incidence.

Methodology

Pages 2-3

CNN Architecture, Patch Generation, and Training

Patch extraction: The CNN required input patches of 299 x 299 pixels, corresponding to approximately 150 x 150 micrometers of tissue. Patches were randomly extracted from annotated RGB images using MATLAB R2015b, with the central pixel of each patch determining its class label. To expand the training set, data augmentation was applied: rotation by 90, 180, and 270 degrees, plus horizontal and vertical mirroring of all patches.

Balanced partitioning: Patches were organized into four balanced partitions, with each partition containing biopsies from distinct patients (no biopsy appeared in more than one partition). Within each partition, the number of patches per class was reduced to match the smallest class, ensuring the network did not develop bias toward overrepresented classes. Three partitions (approximately 268,000 patches) served as the training set, while the fourth (approximately 89,000 patches) was held out for cross-validation testing. This four-fold procedure allowed the CNN to be evaluated four separate times.

Inception v3 architecture: The authors retrained the Inception v3 CNN using CNTK (Microsoft Cognitive Toolkit), an open-source deep learning framework. Inception v3 is a well-established architecture composed of multiple layers of Inception modules and two classifying layers. The network outputs a probability distribution over the four tissue classes for each input patch. Training each network took approximately 175 hours.

Post-processing with SVM: The raw CNN probabilities were further refined using a cross-validated support vector machine (SVM) to differentiate between three groups: non-atypical tissue (combining stroma and non-atypical glands), GP 3, and GP 4 or higher. Probability maps were then generated by assigning each patch to the class with the highest probability. To reduce noise in grade group determination, a minimum threshold of 4.5% of patches needed to be positively identified for each Gleason pattern class before that pattern was counted.

TL;DR: The Inception v3 CNN was trained on approximately 268,000 patches (299 x 299 pixels each) across four balanced, patient-separated partitions with data augmentation. An SVM post-processed the CNN outputs into probability maps, and a 4.5% minimum threshold reduced noise in grade group assignment.

Grade Group Determination

Pages 3-4

From Probability Maps to Whole-Biopsy Grade Groups

The core innovation of this study is converting pixel-level Gleason pattern predictions into a whole-biopsy grade group classification. After the CNN and SVM produced probability maps for each biopsy, the percentages of patches classified as GP 3 and GP 4 or higher were calculated. These percentages were then combined according to the adjusted grade group table to assign an overall grade group.

Classification logic: The majority Gleason pattern and the minority Gleason pattern were summed to produce the adjusted grade group. For example, if a biopsy had more GP 4+ patches than GP 3 patches, the resulting score would be GP 4+ plus GP 3, yielding adjusted GG 3. If only one pattern was present, it was doubled (e.g., GP 3 + GP 3 = adjusted GG 1). The 4.5% minimum threshold ensured that scattered false-positive patches did not incorrectly elevate the grade group.

Probability map visualization: The study generated color-coded probability maps overlaid on the original H&E images. These maps provided a visual representation of where the CNN detected malignant tissue and GP 4+ regions. Post hoc visual evaluation of these maps was performed to identify common sources of false-positive detections, including tissue folds, out-of-focus regions, ink artifacts, and biopsy border effects where incomplete glands and cutting artifacts were present.

Accuracy assessment: Performance was evaluated using confusion matrices, sensitivity, specificity, accuracy, and F-measure (F1 score). The analysis was performed in two stages: first, a binary classification between non-atypical versus malignant tissue (GP 3 or higher), and second, a binary classification between GP 3 or lower versus GP 4 or higher. The kappa statistic with quadratic weighting was used to assess concordance between automated and pathologist-assigned grade groups.

TL;DR: Patch-level CNN predictions were aggregated into whole-biopsy grade groups using majority/minority Gleason pattern logic with a 4.5% noise-reduction threshold. Color-coded probability maps provided visual localization of malignant and high-grade regions.

Results

Page 4

Patch-Level Classification Accuracy

The four trained networks showed only minor performance differences, so the authors reported results from one representative network. At the patch level, the CNN correctly classified 93% of non-atypical patches, 73% of GP 3 patches, and 77% of GP 4 or higher patches. The confusion matrix revealed that GP 3 was the most challenging class, with 14% of GP 3 patches misclassified as non-atypical and 13% misclassified as GP 4 or higher.

Malignancy detection (GP 3 or higher vs. non-atypical): When the classification was dichotomized into malignant versus non-atypical tissue, the system achieved an accuracy of 92%, with a sensitivity of 90% and specificity of 93%. The F-measure was 0.93, indicating strong overall performance in identifying cancerous regions.

High-grade detection (GP 4 or higher vs. GP 3 or lower): For the clinically critical distinction between GP 4+ and GP 3 or lower, accuracy reached 90%, with a sensitivity of 77% and specificity of 94%. The F-measure was 0.81. The lower sensitivity for GP 4+ reflects the well-documented difficulty of this classification, as fused or small glands without lumina can be categorized as either GP 3 or GP 4 even by expert pathologists.

Sources of error: Visual inspection of the probability maps identified several consistent sources of false positives. Tissue folds, out-of-focus regions, and areas obscured by ink all triggered erroneous high-probability predictions. Another major contributor was the border of biopsies, where incomplete glands and cutting artifacts mimicked malignant morphology. These artifacts represent addressable preprocessing challenges rather than fundamental model limitations.

TL;DR: Malignancy detection reached 92% accuracy (sensitivity 90%, specificity 93%, F1 = 0.93). GP 4+ vs. GP 3 or lower detection reached 90% accuracy (sensitivity 77%, specificity 94%, F1 = 0.81). False positives were mainly caused by tissue artifacts and biopsy borders.

Results

Pages 4-5

Whole-Biopsy Grade Group Agreement with Pathologist

The automated grade group classification agreed with the genitourinary pathologist in 65% of biopsies (N = 40), yielding a quadratic weighted kappa of 0.70. According to standard interpretation guidelines, this indicates substantial agreement. The confusion matrix showed that the method performed best for GG 1, correctly classifying 19 of 22 GG 1 biopsies. Performance was weaker for intermediate grade groups, particularly GG 2 and GG 3.

Comparison with inter-observer variability: The 65% concordance and kappa of 0.70 are in line with inter-observer agreement rates between two general pathologists, as reported by Ozkan et al. This is a significant finding because it suggests the CNN-based system has reached a performance level comparable to pathologist-to-pathologist agreement. Improving concordance with one observer would likely reduce concordance with another, reflecting the inherent subjectivity of Gleason grading.

GG 2 vs. GG 3 challenge: Differentiating between adjusted GG 2 (3 + 4 or higher) and adjusted GG 3 (4+ plus 3) remained the most difficult task. This distinction has the largest clinical implications, as it can determine whether a patient receives conservative management or more aggressive treatment. The difficulty mirrors the well-known challenge pathologists face when distinguishing borderline GP 3 from GP 4 patterns, particularly in fused or small glands without lumina.

TL;DR: Automated grade group classification agreed with the genitourinary pathologist in 65% of cases (kappa = 0.70, substantial agreement). This matches reported inter-observer agreement rates between general pathologists. GG 2 vs. GG 3 differentiation remained the most challenging distinction.

Limitations

Pages 5-6

Dataset Size, Annotation Bias, and Partition Design

Limited dataset: The study included only 96 tissue sections from 38 patients, which is relatively small for training a deep learning model. To maximize available data, the authors partitioned by biopsy rather than by patient. This means that biopsies from the same patient could appear in both training and testing partitions, potentially leading to an overestimation of accuracy due to patient-specific tissue patterns leaking between sets. However, the authors noted that visual inspection showed similar performance for patients present in only a single partition compared to those in multiple partitions.

Single-center, single-annotator bias: All annotations were created by two trained observers checked by one genitourinary pathologist. Given the well-documented inter-observer variability in Gleason grading, the reference standard may not fully represent the range of expert interpretations. The same concern applies to the annotation of low-grade prostatic intraepithelial neoplasia (LGPIN). However, the precise pixel-level delineation on high-resolution images and the two-stage annotation process likely improved reliability compared to typical clinical grading.

GP 5 underrepresentation: Because GP 5 was extremely rare in this biopsy cohort, the authors were forced to merge GP 4 and GP 5 into a single class and use an "adjusted" grade group system. This means the model cannot distinguish between GG 4 (Gleason 4+4) and GG 5 (Gleason 9-10), limiting its applicability to cases where this distinction is important. A larger, multi-institutional dataset with more heterogeneous GP 5 cases would be needed to train a full five-group classifier.

Artifact handling: The current pipeline does not automatically exclude artifacts such as tissue folds, out-of-focus regions, biopsy borders, and ink marks. These artifacts were identified as consistent sources of false positives. Implementing an automated artifact detection and exclusion step as a preprocessing module would likely improve both patch-level accuracy and grade group concordance.

TL;DR: Key limitations include a small dataset (96 sections, 38 patients), single-center data with potential patient-level leakage between partitions, annotations from only one pathologist team, inability to distinguish GP 5 from GP 4, and lack of automated artifact exclusion.

Future Directions

Pages 6-7

Scaling Up and Improving Generalizability

Multi-annotator reference standards: Future datasets should be annotated by multiple genitourinary pathologists to reduce the influence of individual inter-observer variability on the training labels. Consensus annotations or probabilistic labeling strategies could produce a more robust ground truth that better represents the range of expert opinion.

Multi-institutional data: Incorporating biopsies from multiple hospitals is important for two reasons. First, it makes the model more robust against differences in tissue appearance caused by varying staining protocols, fixation methods, and scanner types. Second, larger and more diverse datasets generally improve CNN performance. The current single-scanner, single-institution design limits the model's generalizability to other clinical settings.

Advanced post-processing: The authors suggest that classification results could be improved by incorporating spatial context through techniques such as conditional random fields (CRFs). By considering the class assignments of neighboring patches, CRFs can smooth out noisy predictions and produce more spatially coherent probability maps. This approach has shown success in other histopathology segmentation tasks.

Clinical integration: Automated diagnosis has the potential to reduce both the workload of and variability between pathologists. CNNs have already outperformed time-constrained pathologists in detecting breast cancer metastasis in lymph nodes. For prostate cancer, a practical deployment would likely involve the CNN generating probability maps and suggested grade groups that pathologists then review and confirm, serving as a decision support tool rather than a fully autonomous diagnostic system.

TL;DR: Future improvements should include multi-pathologist annotations, multi-institutional datasets for generalizability, conditional random field post-processing for spatial coherence, and integration as a pathologist decision-support tool rather than a standalone diagnostic system.

Deep learning for automatic Gleason pattern classification for grade group determination of prostate biopsies

Original Paper (PDF)

Plain-English Explanations