Radiomic-based machine learning model for the accurate prediction of prostate cancer

PMC 2023 AI 8 Explanations View Original
Original Paper (PDF)

Unable to display PDF. Download it here or view on PMC.

Plain-English Explanations
Page 1
Why Risk Stratification Matters in Prostate Cancer

Prostate cancer (PCa) is the most common malignant tumour of the male genital system. In 2021, the American Cancer Society estimated 248,530 new cases and 34,130 deaths in the United States alone. While PSA screening and digital rectal examination have improved detection rates and reduced advanced disease, they have also introduced significant overdiagnosis. In a large European randomised screening trial of 61,404 men, 76% of biopsies prompted by an elevated PSA found no cancer. Among those with a positive biopsy, an estimated 20 to 50% represented overdiagnosis.

Clinical risk groups: The EAU-EANM-ESTRO-ESUR-SIOG and NCCN guidelines classify PCa into low, intermediate, and high risk based on PSA levels, Gleason score, and clinical T stage. Low-risk patients are eligible for active surveillance, as most will not experience symptomatic progression. Intermediate- and high-risk patients typically receive surgery or radiation, though there is ongoing debate about whether active surveillance protocols can be extended to certain intermediate-risk patients.

The need for non-invasive tools: Current risk stratification depends on pathologic Gleason scores obtained through prostate biopsy, a procedure that carries risks including haemorrhage and urinary tract infections. Overtreatment itself can cause urinary incontinence and erectile dysfunction. This motivates the development of non-invasive, imaging-based approaches that can accurately classify risk without the complications of biopsy.

The authors set out to build a machine learning model using MRI-derived radiomic features to predict three-class PCa risk stratification (low, intermediate, and high risk), with particular attention to the intermediate-risk group, which had been underexplored in previous studies that typically merged it with the high-risk category.

TL;DR: Prostate cancer risk stratification currently relies on biopsy-derived Gleason scores, PSA, and T stage. Biopsy carries procedural risks and PSA screening produces high false-positive rates (76% in one major trial). This study aims to use MRI radiomics and machine learning to classify PCa into three risk groups non-invasively.
Pages 2-3
Study Design, Patient Cohort, and MRI Protocol

Study population: This was a single-centre retrospective study conducted at the Second Affiliated Hospital of Chongqing Medical University. Between August 2016 and May 2021, 229 consecutive patients with histologically proven PCa who underwent pre-operative MRI were screened. After applying inclusion and exclusion criteria, 213 patients were enrolled: 16 low-risk, 65 intermediate-risk, and 132 high-risk. Inclusion required complete clinical data, pathological confirmation via biopsy or prostatectomy, and no prior prostate surgery, radiation, or endocrine therapy. Patients who had undergone biopsy within 6 months before MRI, had non-MRI-visible lesions, or had poor image quality were excluded.

Risk group assignment: Patients were stratified using the EAU-EANM-ESTRO-ESUR-SIOG classification. Low risk was defined as PSA less than 10 ng/mL, ISUP Grade 1 (Gleason score below 7), and clinical stage T1-2a. Intermediate risk required PSA of 10 to 20 ng/mL, ISUP Grade 2 or 3 (Gleason score of 7), or clinical stage T2b. High risk was defined as PSA above 20 ng/mL, ISUP Grade 4 or 5 (Gleason score above 7), or clinical stage T2c and above. T staging was determined by a urologist with 10 years of experience using the AJCC/UICC system, and final risk assignment was confirmed by a urologist with 20 years of experience.

MRI acquisition: All imaging was performed on a 3.0T Siemens MAGNETOM Prisma scanner with an 8-channel phased array coil. The protocol included axial fat-suppression T2-weighted imaging (T2WI) with TR 3090 ms, TE 77 ms, slice thickness 3 mm, and a 320 x 240 matrix. Diffusion-weighted imaging (DWI) used an axial EPI sequence with TR 3800 ms, TE 84 ms, and a b-value of 1400 s/mm2. Apparent diffusion coefficient (ADC) maps were then calculated on a GE Advanced Workstation.

No significant difference in age was observed across the three risk groups (one-way ANOVA, F = 0.617, p > 0.05), with mean ages of 71.00 +/- 9.19, 73.22 +/- 7.49, and 72.38 +/- 8.20 years for the low-, intermediate-, and high-risk groups respectively.

TL;DR: 213 patients were stratified into low (n=16), intermediate (n=65), and high risk (n=132) per EAU guidelines. MRI was acquired at 3.0T with T2WI, DWI (b=1400), and ADC sequences. Risk assignment was confirmed by experienced urologists using PSA, Gleason score, and T stage criteria.
Pages 3-4
Radiomic Feature Extraction and Selection

Image segmentation: All pre-operative MRI images (T2WI, DWI, and ADC) were imported into GE Healthcare's Artificial Intelligence Kit (A.K. software) for region of interest (ROI) delineation. Two radiologists, each with more than 10 years of experience, manually delineated ROIs slice by slice. For multifocal prostate cancer, individual ROIs were drawn separately and then connected as a whole. All disagreements between the radiologists were resolved by consensus to ensure reproducibility.

Feature extraction: From each MRI sequence, the software extracted 107 radiomic features per sequence: 14 shape features, 18 first-order features, and 75 second-order features. The second-order features included grey level co-occurrence matrix (GLCM), grey level run-length matrix (GLRLM), grey level size zone matrix (GLSZM), grey level dependence matrix (GLDM), and neighbouring grey tone difference matrix (NGTDM). After wavelet transformation, each sequence yielded 851 additional features. In total, 2,553 radiomic features were extracted across the three MRI sequences (T2WI, DWI, and ADC).

Feature selection: All features were first normalised using Z-score standardisation. Then, mutual information (MI), a statistical measure of stochastic dependence between variables, was used to rank features by their relevance to risk stratification. Features scoring below the 80th percentile threshold were discarded using min-max normalisation. This process reduced the initial 2,553 features down to 24 meaningful radiomic features.

Of the final 24 features, 19 were second-order texture features and 5 were first-order statistics. By imaging sequence, 13 came from DWI, 9 from T2WI, and only 2 from ADC. The top-ranked feature was a T2WI-derived wavelet-LLL NGTDM Coarseness, indicating that texture patterns captured through wavelet transformation on T2-weighted images had the strongest correlation with risk group membership.

TL;DR: 2,553 radiomic features were extracted from T2WI, DWI, and ADC sequences (including wavelet transforms). Mutual information reduced this to 24 key features, dominated by second-order texture features (19 of 24) and DWI-derived features (13 of 24). The top-ranked feature was a T2WI wavelet-LLL NGTDM Coarseness metric.
Page 4
Machine Learning Model Building and Validation

Five ML classifiers: Based on the 24 selected radiomic features, five traditional machine learning algorithms were used to build predictive models: logistic regression (LR), random forest (RF), gradient boosting decision tree (GBDT), k-nearest neighbour (KNN), and support vector machine (SVM). Each model was trained to perform three-class classification, predicting whether a patient falls into the low-, intermediate-, or high-risk category.

Addressing class imbalance: The dataset was significantly imbalanced, with only 16 low-risk patients compared to 65 intermediate and 132 high-risk patients. To mitigate this, the authors applied the Synthetic Minority Oversampling Technique (SMOTE), which generates synthetic samples from the joint weighting of multiparametric features in the minority class. This is a well-established approach for preventing ML models from simply learning to predict the majority class.

Training and validation: The cohort was randomly split into 80% for training and 20% for testing. A five-fold cross-validation approach was applied to validate model performance. Multilesion cases were strictly constrained so that all lesions from a single patient appeared in either the training or the validation set, never both. This is an important methodological detail that prevents data leakage.

Model performance was evaluated using the area under the receiver operating characteristic curve (AUC), along with accuracy, precision, and recall. The mean ADC value, a commonly used traditional imaging marker for PCa aggressiveness, served as the baseline comparator. Statistical analyses were performed using Python and IBM SPSS 25.0, with p less than 0.05 considered significant.

TL;DR: Five ML models (LR, RF, GBDT, KNN, SVM) were trained on 24 radiomic features with an 80/20 train-test split. SMOTE addressed class imbalance (16 low vs. 132 high-risk patients). Five-fold cross-validation was used, and patient-level constraints prevented data leakage in multilesion cases.
Page 5
Overall Classification Performance Across Five Models

All five ML models achieved AUC values above 0.6, ranging from 0.65 to 0.87, confirming that the selected radiomic features carry meaningful discriminative information for PCa risk stratification. However, the models varied substantially in their predictive accuracy.

Random forest dominated: The RF model achieved the best overall performance with an AUC of 0.87, followed closely by GBDT at 0.85 and then LR at 0.75. KNN and SVM lagged behind with AUCs of 0.70 and 0.65, respectively. In terms of accuracy, precision, and recall, the RF model scored 0.79, 0.78, and 0.79, which were consistently the highest across all five models. The SVM model performed worst, with accuracy and recall of just 0.62.

Outperforming traditional markers: Critically, the RF model (AUC = 0.87) outperformed the traditional mean ADC value as a standalone predictor of PCa risk (AUC = 0.82). This is significant because ADC values have been widely used in clinical practice as a quantitative MRI marker for tumour aggressiveness, and the radiomic-based ML approach provided a measurable improvement of 0.05 in AUC over this established baseline.

The strong performance of ensemble methods (RF and GBDT) relative to simpler approaches like SVM and KNN suggests that the relationship between radiomic features and PCa risk is nonlinear. Ensemble techniques combine multiple individual learners, which improves generalisation and is better suited to capturing complex feature interactions in high-dimensional radiomic data.

TL;DR: RF achieved the best AUC of 0.87 (accuracy 0.79, precision 0.78, recall 0.79), followed by GBDT at 0.85. Both ensemble methods outperformed the traditional mean ADC value (AUC = 0.82). SVM performed worst at AUC = 0.65.
Pages 5-6
Subgroup Analysis: RF Performance by Risk Category

The authors conducted a detailed subgroup analysis using the top-performing RF model to evaluate how well it classified each individual risk category. The model showed strong predictive performance across all three subgroups, with AUC values ranging from 0.83 to 0.89.

High-risk group: The RF model performed best for high-risk patients, achieving an AUC of 0.89. Precision was 0.81, recall was 0.94, F1-score was 0.87, and sensitivity reached 0.94. However, specificity for the high-risk group was lower at 0.67, meaning some non-high-risk patients were incorrectly classified as high risk. This pattern is clinically understandable, as high-risk tumours tend to have more distinctive radiomic signatures on MRI.

Intermediate-risk group: Performance was solid, with precision of 0.79, recall of 0.61, F1-score of 0.69, sensitivity of 0.61, and a notably high specificity of 0.91. The lower recall indicates that some intermediate-risk patients were being misclassified into other groups, which reflects the inherent difficulty of this "middle" category where features overlap with both low- and high-risk disease.

Low-risk group: The low-risk subgroup had the weakest recall at 0.33, with precision of 0.50, F1-score of 0.40, and sensitivity of 0.33. However, specificity was excellent at 0.98. The poor sensitivity is partly attributable to the small sample size of only 16 low-risk patients, making it difficult for the model to learn robust patterns for this class even with SMOTE oversampling.

TL;DR: RF subgroup AUCs ranged from 0.83 to 0.89. High-risk classification was strongest (AUC 0.89, sensitivity 0.94, specificity 0.67). Intermediate-risk achieved 0.91 specificity but only 0.61 sensitivity. Low-risk suffered from small sample size (n=16), yielding just 0.33 sensitivity but 0.98 specificity.
Pages 6-7
Why DWI and T2WI Features Dominated, and What It Means

A key finding was that the most informative radiomic features came primarily from DWI (13 of 24) and T2WI (9 of 24), with ADC contributing only 2 features. This contrasts with a previous study by Ahmad et al., who found that first-order ADC statistics and Gabor/Haralick features were most important. The authors attribute this difference to their three-class grouping strategy versus the two-class approach (low vs. combined intermediate/high) used in earlier work.

Biological rationale: DWI features primarily reflect tumour heterogeneity through the diffusion restriction patterns of water molecules within tissue. T2WI features, by contrast, are better at depicting the zonal anatomy of the prostate and the tumour's boundary characteristics. When the intermediate-risk group is analysed as a separate class rather than being merged with high-risk, the complexity of distinguishing features increases. T stage, which determines tumour size and invasion borders, becomes particularly relevant, and T2WI excels at capturing these anatomical details.

The dominance of second-order features: 19 of the 24 selected features were second-order texture features, consistent with findings by Varghese et al. Second-order features quantify the spatial relationships between adjacent voxels' signal intensities, capturing complex patterns such as grey level co-occurrence and run-length distributions. These texture patterns may reflect inherently aggressive tumour biology that is not visible in simpler first-order statistics like mean or standard deviation.

Nearly all selected features (23 of 24) were derived from wavelet-transformed images rather than original images. Wavelet transforms decompose the signal at multiple scales and frequencies, preserving full information content while providing high time resolution at high frequencies. This multi-scale analysis appears crucial for capturing the subtle textural differences between risk groups.

TL;DR: DWI features (reflecting tumour heterogeneity) and T2WI features (reflecting tumour boundaries) were most discriminative. Second-order texture features dominated (19 of 24), and wavelet transforms were critical, with 23 of 24 features coming from wavelet-processed images rather than originals.
Pages 7-8
Study Limitations and the Path Toward Clinical Translation

Manual segmentation: All tumour ROIs were manually delineated by two radiologists, which is time-consuming and introduces potential inter-observer variability. The authors acknowledge that semi-automatic or fully automatic segmentation methods should be explored in future work to improve reproducibility and clinical scalability.

Biopsy-only pathological validation: All risk stratification was based on biopsy-proven pathology, without further validation against radical prostatectomy specimens. Prostatectomy provides the definitive whole-gland pathological assessment, and biopsy alone can underestimate Gleason grade due to sampling limitations. This means the "ground truth" labels used for model training may themselves contain some inaccuracy.

Multifocal disease limitations: For patients with multifocal lesions, the authors did not have biopsy confirmation for every individual focus. Instead, only a per-patient level risk categorisation was assigned. This means the model may have been trained on ROIs that included lesions of varying grades without lesion-specific pathological correlation.

Single-centre and retrospective design: The study was conducted at a single institution with a retrospective design, which limits generalisability. The small low-risk cohort (n=16) is a particular concern, as it restricts the model's ability to learn robust patterns for low-risk disease. External validation in multicentre, prospective studies is needed to confirm these findings. Additionally, the value of comparing single-sequence versus multi-sequence radiomic approaches for risk stratification was not fully explored and warrants further investigation. Despite these limitations, the study provides proof of concept that an RF-based radiomic model can non-invasively characterise PCa risk, potentially reducing unnecessary biopsies and enabling more tailored treatment strategies.

TL;DR: Key limitations include manual ROI segmentation, biopsy-only pathological validation (no prostatectomy confirmation), small low-risk sample (n=16), single-centre retrospective design, and lack of lesion-level pathology for multifocal cases. External, multicentre, prospective validation is the critical next step.
Citation: Shu X, Liu Y, Qiao X, et al.. Open Access, 2023. Available at: PMC9975368. DOI: 10.1259/bjr.20220238. License: Open Access.