Ultrasound ML Model for Endometrial Cancer Risk in Postmenopausal Women

Plain-English Explanations

1. Why This Study Matters: The Problem with Ultrasound Screening for Endometrial Cancer

Endometrial cancer (EC) is the most common malignancy of the female genital tract in middle- and high-income countries, with incidence increasing by 132% over the past 30 years. While early-stage EC is often curable, delayed diagnosis significantly worsens outcomes. Ultrasound remains the first-line imaging modality for EC risk assessment, but current methods have well-documented shortcomings. Endometrial thickness (ET) measurement, the standard screening approach, suffers from poor specificity of just 51.5% at the commonly used 5 mm threshold, leading to a high rate of unnecessary invasive procedures such as endometrial biopsy.

Advanced ultrasound techniques have limited utility: Approaches like Doppler imaging and morphological evaluation of the endometrium have been explored as alternatives. For example, the interrupted endo-myometrial junction sign achieved an AUROC of only 0.70 with 62% sensitivity and 78% specificity. Doppler imaging performed slightly better with an AUROC of 0.745, sensitivity of 72.4%, and specificity of 74.4%. However, all these methods remain heavily operator-dependent, with significant inter- and intra-observer variability that limits their feasibility for widespread clinical adoption.

The AI opportunity: Machine learning and deep learning approaches can extract quantitative features from ultrasound images that are invisible to the human eye, reducing subjectivity and operator variability. Radiomics enables high-throughput extraction of texture, shape, and intensity features from images, while convolutional neural networks (CNNs) learn hierarchical spatial representations directly from raw pixel data. Prior work has shown that combining these two approaches outperforms either one alone in oncological imaging tasks. However, no previous study had applied a deep learning radiomics (DLR) fusion model to predict endometrial cancer risk from ultrasound images.

Study objective: This multicenter study by Li et al., published in BMC Medical Imaging in 2025, aimed to develop and validate an AI-driven diagnostic model that integrates radiomics and deep learning features from transvaginal ultrasound, enhanced by super-resolution image preprocessing, to predict EC risk in postmenopausal women. The study followed the TRIPOD (Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis) guidelines and enrolled 1,861 consecutive postmenopausal women from two hospitals in Shanghai between 2019 and 2024.

TL;DR: Endometrial cancer incidence has risen 132% over 30 years. Current ultrasound screening using endometrial thickness has only 51.5% specificity at the 5 mm threshold. This study developed a hybrid AI model combining radiomics and deep learning on transvaginal ultrasound images from 1,861 postmenopausal women across two centers to improve diagnostic accuracy.

2. Study Design and Patient Population: A Two-Center, Temporally Split Cohort

The study was an observational investigation conducted at two hospitals in Shanghai. Data were collected both retrospectively and prospectively. From an initial pool of 2,545 eligible women who had both ultrasound and endometrial pathology data available, 684 were excluded: 256 had intrauterine devices, 77 had cancers other than EC (including 53 cervical cancers, 11 primary ovarian cancers, 7 breast cancers, and others), 236 lacked endometrial pathology results, and 115 had images that were unavailable or unsuitable for ROI delineation. The final cohort comprised 1,861 women.

Dataset partitioning strategy: The authors employed a rigorous temporal and institutional split to minimize data leakage. Patients retrospectively collected from Hospital One between April 2021 and June 2023 formed the training set. Patients prospectively recruited from the same hospital between July 2023 and April 2024 constituted the internal testing set, providing a temporally distinct but institutionally consistent evaluation cohort. Patients retrospectively collected from Hospital Two between January 2019 and December 2023 comprised the external testing set, assessing generalizability across institutions.

Sample size justification: The authors performed a formal sample size calculation using the Buderer methodology. With expected sensitivity and specificity of 0.90, disease prevalence of 0.14, precision of 0.05, 95% confidence level, and 10% expected dropout rate, the minimum required testing set was 275 patients. Assuming a 4:1 training-to-testing ratio, the total required sample was 1,375, which the actual enrollment of 1,861 exceeded. Patients with endometrial cancer or atypical hyperplasia were categorized as the malignant group, while those with other pathological diagnoses were classified as non-malignant.

Ultrasound acquisition: All examiners had more than 10 years of experience in gynecological ultrasound. Examinations were performed using high-performance equipment (Voluson E10, Mindray R8, HD15, and Voluson E8) following a standardized protocol based on the IETA (International Endometrial Tumor Analysis) consensus statement. This standardization across centers was essential for ensuring reproducible image quality and consistent ROI delineation.

TL;DR: From 2,545 eligible women, 1,861 were included after exclusions. The dataset was split by time and institution: training from Hospital One (2021-2023), internal testing from Hospital One prospective cohort (2023-2024), and external testing from Hospital Two (2019-2023). Formal sample size calculation required 1,375 patients, and enrollment exceeded this at 1,861.

3. Image Preprocessing: Super-Resolution GANs and Denoising Pipeline

A major challenge in ultrasound-based AI is that ultrasound images often suffer from low resolution, speckle noise, and artifacts that degrade the quality of extracted features. To address this, the authors implemented a three-step standardized preprocessing pipeline applied to all images before feature extraction. First, two gynecologists independently delineated two regions of interest (ROIs) on each image, the endometrium and the uterine corpus, with discrepancies resolved through expert consensus. The ROIs were then cropped to remove irrelevant background information.

Three preprocessing steps: (1) Denoising using the Non-local Means (NLM) algorithm to reduce speckle noise. (2) Normalization by zero-centering pixel intensities based on the global mean and standard deviation across the dataset. (3) Super-resolution using a deep learning-based Super-Resolution Generative Adversarial Network (SRGAN) model to enhance resolution by fourfold. The SRGAN was trained on a separate private dataset of 34,117 uterine ultrasound images from the Voluson E10 machine, with no overlap with the study data, ensuring domain specificity for gynecological ultrasound.

Sensitivity analysis of preprocessing effects: The authors systematically tested each preprocessing step individually and in combination using ResNet-50 as the baseline CNN. Raw images yielded an AUROC of 0.794. Denoising alone improved AUROC to 0.809, but the improvement was not statistically significant (P > 0.05). Normalization alone reached 0.820, also not significant (P > 0.05). Super-resolution alone achieved 0.834, which was statistically significant (P < 0.05). The combined pipeline of all three steps achieved the highest AUROC of 0.853, with a highly significant improvement (P < 0.01). This demonstrated that SRGAN-based super-resolution was the single most impactful preprocessing step, and combining all methods yielded the best results.

Why SRGAN matters for ultrasound: Unlike CT and MRI, ultrasound images are inherently lower resolution with more noise. SRGAN reconstructs high-resolution images from low-resolution inputs by using a generator network that upsamples images and a discriminator network that enforces perceptual quality. The authors chose SRGAN specifically for its demonstrated effectiveness in preserving critical anatomical details in medical images, with superior visual fidelity and structural similarity compared to traditional upsampling methods.

TL;DR: A three-step preprocessing pipeline (NLM denoising, normalization, SRGAN super-resolution at 4x upscaling) was applied to all images. SRGAN was trained on 34,117 separate ultrasound images. Combined preprocessing improved AUROC from 0.794 (raw) to 0.853 (P < 0.01). Super-resolution alone was the most impactful single step, reaching 0.834 (P < 0.05).

4. Feature Extraction: Radiomics, CNNs, and Variable Selection

Radiomics feature extraction: Features were extracted using the Pyradiomics package (version 3.0.1) in Python. Images were normalized with a fixed scale factor of 1,000, and intensity discretization used a fixed bin width of 5. No voxel resampling was performed due to the lack of consistent spatial resolution metadata in ultrasound images. Features were extracted from both original images and filtered versions, including wavelet, gradient, and logarithmic transforms. Laplacian of Gaussian (LoG) filtering used sigma = 1.0 for an optimal balance between smoothing and noise reduction. Extracted feature categories included shape, first-order statistics, and texture features: Gray Level Co-occurrence Matrix (GLCM), Gray Level Size Zone Matrix (GLSZM), Gray Level Run Length Matrix (GLRLM), Neighboring Gray Tone Difference Matrix (NGTDM), and Gray Level Dependence Matrix (GLDM). In total, 1,562 radiomic features were computed per image.

Deep learning feature extraction: Four CNN architectures were evaluated: VGG19, ResNet18, ResNet50, and Inception-v3. All were pretrained on ImageNet and fine-tuned with binary cross-entropy loss, Adam optimizer at learning rate 0.01, batch size of 32, 50 epochs, and dropout rate of 0.5. Images were resized to 224 x 224 pixels. Data augmentation included random rotation (plus or minus 20 degrees), scaling (plus or minus 10%), horizontal and vertical flipping (probability 0.5), and zoom range of 0.8 to 1.2. The average pooling layer nearest to the fully connected layer was used for feature extraction, yielding 2,048 deep learning features per image from each architecture.

Variable selection pipeline: Starting from 3,616 total variables (1,562 radiomic, 2,048 deep learning, and 6 clinical), the authors applied a multi-step dimensionality reduction process. Variables with variance less than 0.1 were removed, eliminating 2,679 features. After MICE (Multiple Imputation by Chained Equations) imputation and data normalization, Variance Inflation Factor (VIF) analysis excluded 793 variables with VIF greater than 5 to address multicollinearity. Finally, LASSO (Least Absolute Shrinkage and Selection Operator) regression with 10-fold cross-validation selected 36 variables for the endometrium-level DLR model: 18 radiomic features, 16 deep learning features, and 2 clinical features. For the uterine-corpus-level ROI model, only 12 variables were selected.

TL;DR: The study extracted 1,562 radiomic features (Pyradiomics) and 2,048 deep learning features (from VGG19, ResNet18, ResNet50, and Inception-v3), plus 6 clinical variables. After variance filtering, VIF analysis, and LASSO regression, 36 final variables (18 radiomic, 16 deep learning, 2 clinical) were selected for the endometrium-level DLR model.

5. Three Model Types and Six ML Algorithms: A Systematic Comparison

The authors constructed and compared three distinct model types to isolate the contributions of different feature sets. The Radiomics (R) model used only handcrafted radiomic features related to tissue heterogeneity, shape, and texture. The CNN model used only automatically learned deep features that capture hierarchical spatial patterns from the images. The DLR model integrated both radiomics and deep learning features to test whether their combination could achieve superior classification performance over either approach alone.

Six machine learning classifiers: Each model type was trained using six different ML algorithms: Logistic Regression (LR), Decision Tree (DT), Random Forest (RF), Support Vector Machine (SVM), Adaptive Boosting (AdaBoost), and Extreme Gradient Boosting (XGBoost). Hyperparameters were tuned using 10-fold cross-validation within the training dataset, with a grid search strategy optimizing for average AUROC. This design resulted in 18 total model configurations per ROI level (3 model types x 6 classifiers), enabling a thorough comparison of feature extraction and classification strategies.

Evaluation methodology: Model performance was assessed using bootstrap resampling with 1,000 iterations. In each iteration, random samples from the testing set were drawn with replacement and predictions were generated. Discrimination was measured using accuracy, sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), AUROC, and area under the precision-recall curve (AUPRC). DeLong's test was used to determine statistically significant differences between AUROC curves. Calibration was assessed visually with calibration plots and quantitatively with Brier scores. Decision curve analysis calculated the net benefit for potential clinical use.

Subgroup analyses by ROI level: All models were evaluated using two different ROI levels. The endometrium-level ROI focused specifically on the endometrial tissue. The uterine-corpus-level ROI encompassed a broader area including both the endometrium and the surrounding myometrium. This allowed the authors to assess whether a targeted or broader ROI approach yielded better diagnostic performance.

TL;DR: Three model types (Radiomics-only, CNN-only, and hybrid DLR) were each tested with six ML algorithms (LR, DT, RF, SVM, AdaBoost, XGBoost), yielding 18 configurations per ROI level. Performance was evaluated via 1,000-iteration bootstrap resampling with AUROC, AUPRC, calibration, and decision curve analysis.

6. Results: The DLR-SVM Model Achieves Best Diagnostic Performance

Best-performing model: Among all DLR model configurations using endometrium-level ROIs, the SVM classifier achieved the highest performance. On the internal testing dataset, the DLR-SVM model reached an AUROC of 0.893 (95% CI: 0.847-0.932), sensitivity of 0.847 (95% CI: 0.692-0.944), specificity of 0.810 (95% CI: 0.717-0.910), PPV of 0.404, NPV of 0.973, and AUPRC of 0.660 (95% CI: 0.532-0.770). Compared to SVM, the AUROC of all other DLR classifiers was significantly lower (all P < 0.05, DeLong test), confirming its superior discriminative ability.

External validation: Consistent performance was observed in the external testing dataset from a different hospital. The DLR-SVM model achieved an AUROC of 0.871 (95% CI: 0.804-0.930), sensitivity of 0.792 (95% CI: 0.622-0.955), specificity of 0.829 (95% CI: 0.644-0.936), and AUPRC of 0.649 (95% CI: 0.501-0.775). Pairwise AUROC comparisons showed that SVM significantly outperformed most other algorithms (P < 0.05), except XGBoost and LR in the external dataset (P > 0.05). Decision curve analysis confirmed a higher net benefit compared to treating all patients or treating none.

DLR outperforms both R and CNN models: The hybrid DLR approach consistently surpassed single-modality models. In the internal testing set, the DLR model (AUROC 0.893) significantly outperformed the R model (AUROC 0.778, P < 0.001) and the CNN model (AUROC 0.828, P < 0.01). In the external testing set, the DLR model (AUROC 0.871) again beat both the R model (AUROC 0.785, P < 0.001) and the CNN model (AUROC 0.798, P < 0.001). The improvement from radiomics-only to the DLR fusion was approximately 11.5 AUROC points internally and 8.6 points externally, highlighting the value of feature integration.

Endometrium-level ROI outperforms uterine-corpus-level ROI: Models built on the focused endometrium-level ROI consistently outperformed those using the broader uterine-corpus-level ROI. This was somewhat counterintuitive since the uterine ROI provides more anatomical information, including the endometrial-myometrial junction. The authors explained that additional structures such as uterine fibroids or adenomyosis likely introduce noise and confounding factors, disrupting feature extraction and reducing classification accuracy.

TL;DR: The DLR-SVM model using endometrium-level ROIs achieved the best results: AUROC 0.893 (internal) and 0.871 (external), with sensitivity of 0.847/0.792 and specificity of 0.810/0.829. The DLR model significantly outperformed both the R-only model (AUROC 0.778/0.785) and CNN-only model (AUROC 0.828/0.798), with all differences statistically significant (P < 0.01 or better).

7. Model Interpretability and Clinical Context: SHAP Analysis and Comparison to Existing Tools

Variable importance and SHAP analysis: The top three contributing variables in the DLR-SVM model were a deep learning feature (ResNet_1179), a radiomic feature (wavelet_HLL_firstorder_Mean), and a clinical feature (age). SHapley Additive exPlanations (SHAP) analysis was used to provide interpretability for individual predictions. SHAP values showed how each feature pushed the prediction toward or away from an EC diagnosis for each patient. Positive SHAP values indicated increased likelihood of EC, while negative values suggested reduced likelihood. The SHAP force plot across the entire internal test dataset demonstrated the distribution of cumulative feature contributions across all samples.

Comparison to the REC score: The Risk of Endometrial Cancer (REC) score, a previous model combining ultrasound-derived features and clinical parameters through logistic regression, achieved an AUROC of only 0.75 (95% CI: 0.70-0.79) with sensitivity of 79% and specificity of 61%. The DLR model substantially surpassed this performance, achieving an AUROC of 0.893 with sensitivity of 84.7% and specificity of 81.0%. The REC score was limited by its reliance on predefined metrics and its inability to capture complex spatial and textural patterns inherent in ultrasound images.

Comparison to prior radiomics-only models: A recently published radiomics-based model reported an AUROC of 0.90 in validation and 0.88 in the test set for EC diagnosis. While the DLR model achieved comparable performance (0.893 and 0.871), the current study differed in two important ways. First, the prior study focused on patients with postmenopausal bleeding, a population with higher pretest probability of malignancy. The current study included a broader cohort of all postmenopausal women, enhancing generalizability for screening scenarios. Second, the current study employed SRGAN-based super-resolution preprocessing, which was not implemented in the prior work.

Computational efficiency: All analyses were performed on a single consumer-grade GPU (NVIDIA RTX 3090), with inference times of approximately 0.2 seconds per image. The model's small file size and low computational requirements make it suitable for deployment via a mobile app (Android Studio) or web application (Docker containers), supporting real-time clinical use on standard devices such as smartphones or desktops without significant computational burden.

TL;DR: SHAP analysis revealed that the top predictive features were a ResNet deep learning feature, a wavelet radiomic feature, and patient age. The DLR model (AUROC 0.893) substantially outperformed the REC score (AUROC 0.75) and was comparable to the best prior radiomics-only models (AUROC 0.88-0.90), while using a broader, more generalizable patient population. Inference took only 0.2 seconds per image on a consumer GPU.

8. Limitations and Future Directions

Missing clinical risk factors: Key variables such as obesity, nulliparity, and hormone therapy usage were not available in the dataset. These are established risk factors for EC due to their association with increased lifetime exposure to unopposed estrogen, a major contributor to endometrial carcinogenesis. The absence of these variables may have limited the model's ability to capture individualized risk profiles and could have affected prediction accuracy. Including these factors in future iterations could further improve the model's discriminative power, particularly in subgroups where these variables play a dominant role.

Retrospective design and limited subgroup analysis: Due to the retrospective component of the study, clinically relevant factors such as diabetes, polycystic ovary syndrome (PCOS), and detailed hormone therapy history were either not recorded or present in too few cases to allow for meaningful subgroup analyses. This limits the generalizability of the model in specific clinical subpopulations. The inability to evaluate performance across diverse clinical conditions, such as patients with obesity or those undergoing hormone therapy, represents a significant gap that prospective studies will need to address.

Grayscale-only imaging: The analysis was based solely on grayscale transvaginal ultrasound images. While grayscale imaging is routinely used in clinical practice, it does not capture blood flow information available through Doppler imaging. Doppler imaging can assess vascular patterns that may carry additional diagnostic value for EC detection. The authors acknowledged this limitation and reported that they are currently planning a prospective study integrating Doppler imaging and additional clinical and laboratory parameters to further enhance model performance.

Geographic and institutional scope: Both hospitals in the study were located in Shanghai, China. While the two-center design with temporal and institutional data splits enhanced generalizability compared to single-center studies, differences in ultrasound equipment, operator experience, image quality, and annotation consistency between retrospective and prospective data may still introduce bias. The authors implemented a unified imaging protocol and consistent preprocessing pipeline to mitigate these effects, and the model did maintain stable performance across both internal and external testing sets. Nonetheless, future large-scale, multicenter, prospective studies across geographically diverse settings are needed to fully validate the model's clinical impact and broad applicability.

TL;DR: Key limitations include missing clinical risk factors (obesity, nulliparity, hormone therapy), retrospective design limiting subgroup analyses, use of grayscale-only imaging without Doppler, and geographic restriction to two Shanghai hospitals. A prospective study integrating Doppler imaging and additional clinical parameters is planned to address several of these gaps.

Ultrasound-Based Machine Learning Model to Predict the Risk of Endometrial Cancer Among Postmenopausal Women

Original Paper (PDF)