AI for Prostate Cancer Diagnosis and Treatment

Plain-English Explanations

Overview & Background

Pages 1-2

Why AI Matters for Prostate Cancer: Clinical Context and Scope

Prostate cancer (PC) is the second most commonly diagnosed cancer in the male population and the most common cancer type in the United States. According to the WHO's 2020 cancer statistics, there were 1,414,259 cases of PC globally. In 2022, NIH data reported 268,490 new cases and 34,500 deaths worldwide. The mortality rate increases sharply with age, with individuals over 66 accounting for more than 55% of total deaths. Risk factors include age (above 40), race (higher incidence in Black and Afro-American populations per the SEER registry), BRCA2 gene mutations, family history, smoking, obesity, and high-fat diets.

Diagnostic data for prostate cancer patients are derived from prostate-specific antigen (PSA) tests, MRI-guided biopsies, genetic biomarkers, and the Gleason grading system. These are used for diagnosis, risk stratification, and patient monitoring. However, recording diagnoses and stratifying risks based on these data frequently involves significant subjectivity. Different pathologists can arrive at different scores for the same biopsy, and human experts can only store a limited set of information, which may result in wrong classification of unseen cases.

Artificial intelligence (AI) techniques, including machine learning (ML) and deep learning (DL) algorithms based on artificial neural networks (ANNs), are being applied to produce precise, cross-sectional illustrations of clinical data. The authors classify ML into supervised (labelled data), semi-supervised (partial labelling), unsupervised (no labels), and weakly supervised (inadequately labelled) categories. Deep learning extends ML by using multiple hidden layers to extract increasingly higher-level features from data without human intervention.

The paper also describes specific neural network architectures relevant to prostate cancer diagnostics. Convolutional neural networks (CNN) are composed of convolution layers (applying filters to create feature maps), pooling layers (for dimensionality reduction to avoid overfitting), and fully connected layers (applying transform and activation functions). Fully convolutional networks (FCN), such as U-Net, omit the fully connected layer entirely. Dropout mechanisms are applied in deep architectures to reduce overfitting and produce unbiased predictions.

TL;DR: Prostate cancer affects over 1.4 million men globally, with 268,490 new U.S. cases in 2022. Diagnosis relies on PSA, MRI, biopsies, and Gleason grading, all of which suffer from subjectivity. AI and deep learning (CNNs, U-Net, DNNs) can reduce this subjectivity by extracting quantitative features from clinical data more consistently than human assessment.

Methodology

Pages 3-4

Search Strategy, Article Selection, and Screening Process

The authors followed the Preferred Reporting Items for Systematic Review and Meta-Analysis (PRISMA) protocols. They searched PubMed using a comprehensive keyword string combining terms such as "artificial intelligence," "machine learning," "deep neural network," "deep learning," "random forest," "decision tree," "support vector," "computer aided," "gaussian mixture modelling," and "natural language processing," all combined with "prostate cancer." The search was restricted to review and research articles using PubMed's advanced search options.

The search was conducted in three phases. First, keywords were searched in the title section, yielding 41 reviews and 369 research articles. Second, the title/abstract section produced 196 reviews and 2,735 research articles. Third, searching all fields returned 5,025 reviews and 46,326 journal articles. An intersection exercise confirmed consistency: all articles found by title search were also present in title/abstract results (n1 intersect n2 = n1), and all title/abstract results were contained in the all-fields results (n1 intersect n3 = n1). This validated that the most relevant articles contained the keywords directly in their titles.

Exclusion criteria removed opinion letters, commentaries, case reports, and surveys. From the 369 research articles found via title search, 13 commentaries and 1 case report were excluded, leaving the final pool. No review articles fell under the exclusion categories. The selected papers were then evaluated in two phases: first, abstracts and introductions were reviewed for background content; then, the most relevant articles were closely examined for results and conclusions.

The authors also tracked publication trends, noting that the number of ML-based prostate cancer studies increased significantly in recent years, with 2021 showing the highest publication counts for both reviews and journal articles. References cited within the screened articles were also included when relevant.

TL;DR: A PRISMA-guided PubMed search using AI and prostate cancer keyword combinations, conducted in three phases (title, title/abstract, all fields). Title search yielded 41 reviews and 369 research articles. After excluding 14 articles (commentaries and case reports), selected papers were evaluated for direct relevance to AI applications in PC diagnosis and treatment.

Biopsy-Based AI

Pages 5-7

AI for Gleason Grading and Biopsy Classification

The Gleason grading system is the primary method used in prostate cancer detection and treatment planning, determined by urological pathologists examining prostate biopsies. Scores range from Grade 1 (score 6 or below, benign, single glands with sharp boundaries) through Grade 5 (scores 9-10, malignant, solid sheets with no gland formation). The system correlates with pathological variables including radical prostatectomy margin status, PSA levels, tumor volume, and molecular markers. However, human scoring is error-prone due to ink on slides, cutting artifacts, and rare cancer subtypes, and in several cases the Gleason score underestimates disease severity.

Deep learning for Gleason scoring: Nagpal et al. developed a DL model trained on 112 million image patches from 1,226 slides annotated by pathologists, then tested on 331 slides from 331 patients. The mean accuracy of 29 expert urologic pathologists on the validation dataset was 0.61, while the DL algorithm achieved 0.70. This demonstrated both the direct applicability of DL for image classification and the scalability advantage: an algorithm trained on 112 million patches retains information perpetually and improves with more data, unlike human experts with limited memory capacity.

Pathologist-level AI systems: A separate AI system was developed using 5,209 hematoxylin and eosin (H&E)-stained digitized biopsies from 1,243 patients for training, with 550 biopsies for evaluation. Additionally, 160 random samples from the test set were used for manual evaluation to compare the AI system against pathologists. An ML-based cascade approach for differentiating needle biopsies into multiple classes used 698 biopsies from 174 patients for training and 37 biopsies from 21 patients for testing, achieving 100% sensitivity on the validation dataset.

CNN-based classification: CNNs were applied to enhance accuracy of Gleason pattern and grading classification. One study trained on 96 tissue specimens from 38 patients but acknowledged potential accuracy overestimation due to the small dataset. A larger study used 838 digitized biopsies for training and 162 for testing, employing a panel of three pathologists for evaluation. The conclusion was that AI assistance reduced both analysis time and inconsistency in results. The CNN method remains the most appropriate algorithm for image classification, though the speed of CNN training is challenging due to slower calculations in the maxpool layer, and large biopsy datasets require more robust computing systems.

TL;DR: DL models outperformed 29 expert pathologists in Gleason scoring (accuracy 0.70 vs. 0.61) using 112 million image patches. A cascade ML approach achieved 100% sensitivity on 37 test biopsies. CNN-based systems on 838+ biopsy images reduced analysis time and grading inconsistency, though small datasets risk accuracy overestimation.

MRI-Based AI

Pages 7-8

AI for MRI-Guided Prostate Cancer Detection and Classification

MRI is highly effective for differentiating clinically significant prostate cancers from non-significant ones and has emerged as an alternative to transrectal ultrasonography (TRUS)-guided biopsy for directing pathologists to accurate tissue sampling sites. The European Association of Urology (EAU) recommends multiparametric MRI (mpMRI) before biopsy. However, MRI images contain highly granular information that is particularly challenging to interpret, making AI-based automation a natural fit.

AI-assisted detection with PI-RADS: A multi-institutional study implementing AI algorithms improved imaging sensitivity to 78% when used with PI-RADS v2 (Prostate Imaging-Reporting and Data System version 2). In the transition zone specifically, AI-assisted automated detection achieved 84% sensitivity compared to 67% for MRI alone. A cascade deep learning algorithm was developed for enhanced detection and bi-parametric classification of prostate MRIs using PI-RADS scoring. This algorithm used 1,390 samples obtained at 3 Tesla for model training, testing, and validation, and employed a 3D U-Net-based residual network architecture for automated detection and image segmentation.

Radiomics and advanced imaging: Radiomics-based methods using T2-weighted (T2W) images proved more accurate than apparent diffusion coefficient (ADC) alone, though ADC associated with Gleason score was effective for categorizing disease stage. Innovative radiomics using computed high-B value diffusion-weighted imaging (CHB-DWI) and ADC modalities outperformed clinical heuristics-driven methods for prostate cancer detection. Aldoj et al. demonstrated a CNN employing multiple 3D combinations (ADC, DWI, T2-weighted pictures) that achieved AUC of 0.91, with 81.2% sensitivity and 90.5% specificity, compared against radiologist performance using PI-RADS v2.

Biparametric vs. multiparametric MRI: A CNN was trained and validated on 300 prostate MRIs to decide whether biparametric MRI alone would suffice or whether dynamic contrast-enhanced (DCE) sequences were needed. The method achieved 94.4% sensitivity and 68.8% specificity for assessing DCE necessity. This research may help avoid DCE-MRI when not needed, preventing contrast-agent-induced negative effects. Key CNN architectures mentioned include AlexNet, VGG, ResNet, and GoogLeNet, all widely used for MRI image classification tasks.

TL;DR: AI with PI-RADS improved MRI sensitivity from 67% to 84% in the transition zone. A 3D U-Net residual network on 1,390 bpMRI samples showed strong detection performance. A multi-channel 3D CNN achieved AUC 0.91, 81.2% sensitivity, and 90.5% specificity. A CNN for biparametric vs. multiparametric MRI triage reached 94.4% sensitivity on 300 scans.

TRUS and Fusion Biopsy

Pages 8-9

AI in Transrectal Ultrasound and MRI-TRUS Fusion Biopsies

Transrectal ultrasound biopsy (TRUS) uses a transducer in the patient's rectum to generate high-frequency sound waves that produce sonogram images of the prostate gland and surrounding tissues. Before MRI imaging became widespread, TRUS was the gold standard for guiding prostate cancer needle biopsies. An ANN-based AI system was designed to assist clinicians by training on variables including digital images from TRUS, prostate-specific antigen (PSA) levels, and patient age. The validation demonstrated that the system could efficiently differentiate between malignant and non-malignant prostate tissues.

TRUS limitations and staging accuracy: TRUS has poor detection and staging accuracy due to low-contrast images. However, it offers advantages over MRI in being less costly, more convenient in the office setting, and providing real-time snapshots. TRUS has an overall tumor staging accuracy of 80-95% compared to 75-85% for MRI, except for T4 stage when TRUS only provides anterior tumor images. A significant limitation is TRUS's remarkably high degree of operator dependency.

MRI-TRUS fusion-guided biopsies: The advancement in prostate cancer diagnosis has developed MRI-TRUS fusion-guided needle biopsies. This technique is time-consuming and laborious, so AI has been introduced to automate the process and reduce the burden on clinicians. A dataset of 436 TRUS images from 181 men across three institutions (using Aixplorer, iU22, and Pro Focus 2202a scanners) was used to develop a deep learning method for automated segmentation of TRUS images. The model achieved median accuracy of 98%, Hausdorff distance of 3.0 mm, and Jaccard index of 0.93. Pixel-wise accuracies for zonal segmentation of peripheral and transition zones were 97% and 98%, respectively.

AIUSP comparison study: A prospective multi-center trial compared AI ultrasound of the prostate (AIUSP) against TRUS-guided 12-core systematic biopsy and mpMRI. AIUSP detected PC in 49.5% of cases compared to 34.6% for TRUS-guided systematic biopsy and 35.8% for mpMRI. The clinically significant PC detection rate was 32.3% for AIUSP versus 26.3% for TRUS-SB and 23.1% for mpMRI. The overall biopsy core positive rate was 22.7% for AIUSP compared to 11.0% for TRUS-SB and 12.7% for mpMRI, indicating substantially better targeting accuracy.

TL;DR: AI-assisted TRUS segmentation achieved 98% median accuracy and 0.93 Jaccard index on 436 images from 181 men. MRI-TRUS fusion biopsies with AI reached 97-98% pixel-wise zonal accuracy. AIUSP detected clinically significant PC at 32.3% versus 26.3% (TRUS-SB) and 23.1% (mpMRI), with double the biopsy core positive rate.

3D Pathology & Genomics

Pages 9-11

AI in 3D Pathology, Genomics, and Proteomics-Based Detection

3D pathology advances: Traditional prostate cancer care relies on two-dimensional (2D) histopathology using formalin-fixed, paraffin-embedded (FFPE) core-needle biopsies. Gleason grading is based on visual interpretation of only about 1% of the whole biopsy, leading to substantial interobserver variability and only marginal correlation with outcomes, particularly in intermediate-grade cases. Kaneko et al. introduced a pilot AI method for 3D prediction of prostate cancer by integrating multiparametric MR-US image data with fusion biopsy trajectory-proven pathology data. The AI prediction concordance with clinically significant cancer (CSCa) data on robot-assisted radical prostatectomy (RARP) specimens was 83% versus 54% for the radiologist's reading (p = 0.036), and AI-predicted CSCa volumes were more accurate (r = 0.90, p = 0.001).

ITAS3D method: Xie et al. created a method for non-destructive 3D pathology and computational analysis of entire prostate biopsies tagged with a fluorescent equivalent of conventional H&E staining. Their image translation-assisted 3D segmentation (ITAS3D) technique is a generalizable deep learning approach for volumetrically segmenting tissue microstructures in an annotation-free and biomarker-based way. They imaged 300 ex vivo samples from 50 stored radical prostatectomy cases (118 containing malignancy) and demonstrated that 3D glandular characteristics of cancer biopsies outperformed similar 2D features for risk assessment of individuals with low- to intermediate-risk prostate cancer.

Genomics and biomarkers: The ANN technique can be beneficial for assessing genomic biomarkers. Green et al. built an ANN model to validate Ki67 gene expression while comparing it to another candidate, DLX2. Both Ki67 and DLX2 were significant predictors of future metastases in univariate analysis, though only 6.8% of individuals with prostate cancer had substantial Ki67 expression. The most suitable approach is multi-omics, combining genomic, transcriptomic, and metabolomic data in ML algorithms. Genomic deep learning (GDL) uses CNN and LSTM architectures to predict secondary structure, subcellular localization, and peptide binding to MHC Class II molecules from biological sequence data.

Proteomics-driven detection: Kim et al. used targeted proteomics integrated with computational biology to find new potential proteomic markers for prostate cancer, starting with 133 differentially expressed proteins in a cohort of 74 patients tested with synthetic peptides. ML methodology was then used to construct clinical prediction models. A deep learning algorithm using H&E-stained digital slides predicted ERG rearrangement status in prostatic adenocarcinoma with AUC values ranging from 0.82 to 0.85, sensitivity of 0.75, and specificity of 0.83.

TL;DR: AI-based 3D pathology achieved 83% concordance with CSCa data vs. 54% for radiologists (p = 0.036) and volume accuracy of r = 0.90. ITAS3D deep learning on 300 specimens showed 3D features outperform 2D for risk assessment. DL predicted ERG rearrangement with AUC 0.82-0.85. Multi-omics integration with ML is the recommended path forward.

CT Scans & Treatment

Pages 11-13

AI in CT Scan Detection and Prostate Cancer Treatment Decisions

CT scan-based AI detection: Researchers at RMIT University and St Vincent's Hospital Melbourne developed an AI algorithm to detect early symptoms of prostate cancer by analyzing normal computed tomography (CT) images. The system was trained to identify disease characteristics in asymptomatic individuals with and without prostate cancer. The algorithm's performance was assessed by comparing it to professional radiologists using cross-validation techniques on a dataset of 571 CT images of abdomen and pelvic areas. The AI algorithm produced better outcomes and assisted in identifying malignant growths in approximately seconds. The AI approach improves with each scanned image, learning to interpret scans from various machines to detect even the smallest anomalies.

CT limitations: CT scans are not recommended for routine cancer examination because of high radiation doses with potential long-term effects. The contrast between the prostate and surrounding tissues is not strong enough in CT images for easy separation. However, CT scans can help detect prostate cancer's spread in bone tissue and determine whether brachytherapy is needed. The technology may be used to screen for cancer in men whose abdomen or pelvis CT scans were being taken for other complications. False positive predictions can be re-confirmed through other tests, but false negatives could lead to problematic outcomes, making diverse training data essential for reducing false negative rates.

Treatment decision support: In the PROTECT study, 1,643 men were randomly assigned to radical prostatectomy, radiation, or active monitoring. Cancer-specific mortality was minimal and statistically equivalent across all therapies at 10-year median follow-up. Surgery and radiation reduced cancer development and metastases but brought higher comorbidities affecting quality of life. Auffenberg et al. demonstrated a random forest ML prediction model using 7,543 men diagnosed with prostate cancer. The cohort included 45% treated with radical prostatectomy, 17% with radiotherapy, 30% with surveillance, 6% with androgen deprivation therapy, and 2% with watchful waiting. Data were divided 2:1 into train/test subgroups stratified by training location.

The customized model achieved an AUC of 0.81 for treatment classification. The analysis showed that age was the most relevant variable influencing treatment decisions, followed by the number of positive cores and then the Gleason score. The authors note that random forest is less interpretable than a single decision tree but more accurate because it is composed of multiple decision trees. For cases with limited numerical data points, even basic algorithms like random forest produce acceptable predictions, though deep networks should be explored when extensive patient records are available.

TL;DR: AI on 571 CT images detected malignant growths in seconds, outperforming radiologists. A random forest model on 7,543 prostate cancer patients achieved AUC 0.81 for treatment classification, identifying age as the strongest predictor of treatment choice, followed by positive core count and Gleason score. The PROTECT trial showed equivalent 10-year mortality across surgery, radiation, and monitoring.

Comparative Performance

Pages 13-15

Benchmark Comparisons: Models, Datasets, and AUC Scores

The review compiled performance benchmarks across multiple studies using different ML/DL methods and datasets. For biparametric MRI (bpMRI) of 1,513 patients with clinical variables (PSA, PSA density, and age), a deep learning algorithm achieved AUC of 0.86, a random forest model achieved 0.83, a neural network achieved 0.74, a Ctree model achieved 0.74, a support vector machine (SVM) achieved 0.72, and tPSA logistic regression achieved lower performance. These results demonstrate that deep learning consistently outperformed traditional ML methods on imaging data with integrated clinical variables.

Clinical variable models: A study of 551 patients incorporating age, BMI, hypertension, diabetes, total PSA (tPSA), free PSA (fPSA), the fPSA/tPSA ratio, prostate volume (PV), PSA density (PSAD), neutrophil-to-lymphocyte ratio (NLR), and pathology reports showed strong results for tree-based methods. Random forest achieved AUC 0.92, decision tree achieved 0.91, multivariate logistic regression achieved 0.84, and SVM achieved 1.00 (likely overfitting) and 0.88 in separate evaluations. For 315 patients with preoperative T2WI, DWI, and ADC MR images combined with TRUS-guided 12-needle puncture and P504S/P63 immunohistochemistry status, random forest achieved AUC 0.92, gradient boosting decision tree achieved 0.91, and logistic regression, AdaBoost, and K-nearest neighbors each achieved 0.89.

Dense neural networks and radiomics: Among 356 patients undergoing transrectal ultrasound-guided prostate biopsy, a dense neural network achieved AUC of 0.94, substantially outperforming logistic regression (0.80) and decision tree classifier (0.78). For 103 patients with mpMRI scans and PI-RADS V2 scores of 4/5 confirmed by prostatic biopsy, R-logistic achieved AUC 0.93, R-SVM achieved 0.84, and R-AdaBoost achieved 0.73. In 438 men with metastatic prostate cancer, gradient boosting machine models ranged from AUC 0.73 to 0.86 across six different model configurations.

The authors identify two major limitations across these computer-aided detection techniques: small dataset sizes and the absence of federated learning strategies. Federated learning models can enhance data collection and sharing across institutions for research objectives without centralizing sensitive patient data. Increasing the sample size would improve the performance of multilayer DL models by allowing neural networks with more hidden layers and nodes to extract a broader spectrum of feature sets while avoiding early overfitting.

TL;DR: DL achieved AUC 0.86 on 1,513 bpMRI patients, outperforming random forest (0.83) and SVM (0.72). Dense neural networks reached AUC 0.94 on 356 TRUS biopsy patients. Random forest hit 0.92 on 315 patients with combined MRI and clinical data. The two main limitations identified are small datasets and the lack of federated learning for multi-institutional data sharing.

Limitations & Future Directions

Pages 15-17

Limitations, Open-Source Tools, and Future Directions

Dataset and generalizability concerns: Many studies reviewed used relatively small datasets (as few as 96 tissue specimens in one CNN study), which risks accuracy overestimation. The CNN method requires robust computing systems for training on large biopsy image sets, and the speed of CNN training is constrained by slower calculations in the maxpool layer. Multiple cross-validations (3-fold or 5-fold) are required to attain higher prediction accuracy, and dataset diversity must be accounted for in ML algorithm development. Deep neural networks behave as black boxes, creating a tradeoff between accuracy and interpretability. Overfitting in fully connected dense networks can be controlled using regularization techniques, but at the cost of higher computing power.

Clinical integration challenges: AI is not intended to replace human expertise in detecting prostate cancer but to reduce the chance of missing actual positive cases. The risk of false negatives is reported less often because doctors and pathologists also consider laboratory studies, patient history, and other relevant clinical information. AI-assisted diagnostics in PC biopsies can improve outcome quality and reduce cost and time. However, the subjectivity of the Gleason grading system (linked to substantial interobserver variability) remains a fundamental challenge even with AI augmentation, particularly in intermediate-grade prostate cancer cases.

Open-source tools: Several publicly available tools have emerged. The platform prostatecancer.ai provides an AI model in a web browser for computer-assisted detection, diagnosis, and prognosis (code at github.com/Tesseract-MI/prostatecancer.ai). A Hierarchical Probabilistic 3D U-Net for bpMRI-based detection is available at github.com/DIAGNijmegen/prostateMR_3D-CAD-csPCa. The sigminer.prediction tool predicts cancer subtypes based on mutational signatures. A CNN-based gland detection algorithm for digitized biopsies is also open-source.

Future directions: The authors recommend federated learning to enable multi-institutional data sharing while preserving patient privacy, which would increase sample sizes and improve DL model performance. Multi-omics integration (combining genomic, transcriptomic, and metabolomic data) is essential for building strong prediction platforms, as individual biomarkers like Ki67 are expressed in only 6.8% of patients. Data augmentation techniques are needed for medical imaging, particularly PC-MRI scans, to generate datasets that can be optimally used in CNN architectures like VGG. The FDA has authorized AI use in prostate cancer detection, signaling regulatory readiness for broader clinical deployment.

TL;DR: Key limitations include small datasets, black-box interpretability, computing demands, and Gleason grading subjectivity. Several open-source tools exist for prostate cancer AI (prostatecancer.ai, 3D U-Net for bpMRI, sigminer). Future work should prioritize federated learning for multi-institutional data sharing, multi-omics integration, data augmentation, and broader clinical deployment under FDA authorization.

Artificial Intelligence for Clinical Diagnosis and Treatment of Prostate Cancer

Original Paper (PDF)