Artificial Intelligence for Prediction of Endometrial Intraepithelial Neoplasia and Endometrial Cancer Risks in Pre- and Postmenopausal Women

AJOG Global Reports 2022 AI 7 Explanations View Original
Original Paper (PDF)

Unable to display PDF. Download it here or view on PMC.

Plain-English Explanations

1. Why This Study Matters: The Endometrial Cancer Screening Gap

Endometrial cancer is the most common gynecologic malignancy, affecting approximately 25.7 women per 100,000 each year. Despite advances in diagnostics and treatment, the mortality rate for this cancer has increased by 21% over the past two decades. The incidence is rising not only in industrialized nations but also in Asian countries and lower-middle-income regions. In Japan, endometrial cancer surpassed cervical cancer in morbidity by 2007, reaching over 13,600 cases by 2012.

The screening problem: Unlike cervical cancer, there is no established screening test for endometrial cancer. The current approach relies entirely on patients recognizing their own symptoms, such as abnormal bleeding, and reporting them to a clinician. This system creates a significant equity gap. Studies show that Black women are less likely than White women to recognize postmenopausal bleeding as a warning sign, and this disparity is associated with a higher five-year mortality rate. Women in rural areas and minority communities face additional barriers, including inadequate medical access, incomplete diagnosis, and delayed treatment.

The diagnostic gray zone: Even when symptoms are reported, clinicians face challenges. As women approach menopause, distinguishing physiological bleeding from abnormal premenopausal bleeding and postmenopausal bleeding becomes increasingly difficult. Different guidelines also use inconsistent cutoff values for endometrial thickness to decide when a biopsy should be performed, adding further variability to diagnostic decisions.

The AI opportunity: The authors note that while AI is growing rapidly in cancer diagnosis and risk prediction, it remains vastly understudied for endometrial cancer. Of 13 published studies on AI and endometrial cancer, only one had analyzed demographic data for prediction, and that study was limited to postmenopausal women with bleeding and endometrial thickness above 5 mm. This paper set out to fill that gap by testing multiple machine learning methods on a broader patient population that includes both pre- and postmenopausal women.

TL;DR: Endometrial cancer mortality has risen 21% in two decades despite better treatments, largely because there is no screening test and detection depends on patients recognizing symptoms. This study applied AI to predict endometrial cancer risk across both pre- and postmenopausal women, addressing a major gap in existing research.

2. Study Design: 564 Patients, Nine Features, Six Machine Learning Models

Patient cohort: The study was conducted at the Division of Gynecologic Oncology, Suleyman Demirel University, Turkey. Data were collected from 564 consecutive patients aged 35 or older, enrolled between January 2015 and May 2022. Patients with Lynch syndrome, those on hormone replacement therapy or selective estrogen receptor modulators, and those with a history of fertility-preserving endometrial cancer treatment were excluded.

Nine clinical features were collected: age, menopause status, premenopausal abnormal bleeding, postmenopausal bleeding, obesity, hypertension, diabetes mellitus, smoking history, endometrial thickness, and history of breast cancer. All postmenopausal women with bleeding underwent endometrial sampling, as did asymptomatic postmenopausal women with an endometrial thickness of at least 3 mm. Premenopausal women received biopsies if they had abnormal uterine bleeding or suspected endometrial lesions.

Outcome classification: Histopathological diagnoses from biopsies or hysterectomy specimens were classified according to 2014 World Health Organization guidelines. Benign lesions and hyperplasia without atypia were grouped as "benign," while atypical endometrial hyperplasia, endometrial intraepithelial neoplasia (EIN), and carcinoma were grouped as "precancerous." The primary target was the highest histopathological diagnosis, with hysterectomy as a secondary outcome.

Machine learning pipeline: Six classification algorithms were tested: Random Forest (RF), Logistic Regression (LR), Multilayer Perceptron (MLP), CatBoost, XGBoost, and Naive Bayes. Data were split 80/20 into training (451 patients) and internal validation (113 patients) sets. Feature selection used the Boruta algorithm, a wrapper built around Random Forest that estimates feature importance. Because only 7.9% of cases were precancerous or cancerous, the Synthetic Minority Oversampling Technique (SMOTE) was applied to balance the training data. All models were tuned using 5-fold cross-validation.

TL;DR: The study enrolled 564 patients aged 35+, collected nine clinical features, and tested six ML algorithms (RF, LR, MLP, CatBoost, XGBoost, Naive Bayes). Data were split 80/20, SMOTE corrected class imbalance (only 7.9% precancerous cases), and 5-fold cross-validation was used for hyperparameter tuning.

3. Feature Selection: AI Drops Symptoms, Keeps Age, BMI, and Endometrial Thickness

The Boruta algorithm selected just 3 of 9 features as important for predicting precancerous and cancerous endometrial disease: age, body mass index (BMI), and endometrial thickness. This was one of the most striking findings of the study. Despite being offered information about menopause status, abnormal bleeding, postmenopausal bleeding, hypertension, diabetes, smoking, and breast cancer history, the AI model discarded all of these and built its predictions from three simple clinical measurements.

Why this matters clinically: The features dropped by the AI are exactly the ones that create barriers in the current screening approach. Recognizing symptoms, understanding menopausal status, and distinguishing physiological from pathological bleeding all require a level of health literacy that varies across populations. Age and BMI, by contrast, are two basic pieces of information that any patient can provide regardless of educational background or social class. Endometrial thickness requires an ultrasound measurement, but it is a straightforward, objective value.

Feature coefficients: The study generated a feature importance chart (Figure 1) showing the relative contribution of each of the three selected features. All three were positively associated with the risk of developing EIN or endometrial cancer. The fact that the AI independently identified these factors aligns with decades of epidemiological research establishing age, obesity, and endometrial thickening as key risk factors, but now packages them into a predictive model rather than a list of isolated risk indicators.

TL;DR: The Boruta feature selection algorithm chose only 3 of 9 features: age, BMI, and endometrial thickness. The AI dropped symptoms and menopause status entirely, selecting instead the simplest, most universally accessible patient data points for its predictive model.

4. Model Performance: MLP Achieves 0.94 AUC for Cancer Prediction

Multilayer Perceptron (MLP) was the top performer for predicting precancerous and cancerous endometrial disease. After hyperparameter fine-tuning, MLP achieved an area under the receiver operating characteristic curve (AUC) of 0.938 in the internal validation cohort. This AUC of 0.94 indicates near-perfect discriminative ability, meaning the model can reliably separate patients who have precancerous or cancerous disease from those who do not.

Detailed performance metrics: The MLP model achieved an overall accuracy of 0.94 on the 113-patient test set. Precision (positive predictive value) was 0.71, meaning that when the model flagged a patient as having precancerous disease, it was correct 71% of the time. Recall (sensitivity) was 0.50, meaning it correctly identified half of all actual precancerous cases. The F1 score, which balances precision and recall, was 0.59. The authors note that the F1 score is particularly valuable here because the 7.9% disease prevalence creates an uneven class distribution where accuracy alone can be misleading.

Model comparison: Random Forest achieved the highest AUC for predicting hysterectomy as a treatment outcome, but this prediction task was far less successful overall. The AUC for predicting hysterectomy was only 0.53, essentially no better than a coin flip. This suggests that the decision to perform hysterectomy involves clinical factors beyond what the three selected features can capture.

Ablation test: When endometrial thickness was excluded from the model and only age and BMI were used, precision dropped from 0.71 to 0.14, recall dropped from 0.50 to 0.25, and the F1 score fell from 0.59 to 0.18. This dramatic decline demonstrates that endometrial thickness is by far the most critical feature in the model and that age and BMI alone, while associated with risk, are insufficient for reliable prediction.

TL;DR: The Multilayer Perceptron achieved an AUC of 0.938 and 0.94 accuracy for predicting precancerous/cancerous disease using just age, BMI, and endometrial thickness. Precision was 0.71, recall 0.50, and F1 score 0.59. Removing endometrial thickness caused the F1 score to collapse from 0.59 to 0.18.

5. Health Equity Implications: AI That Does Not Require Symptom Recognition

The model's independence from symptoms is its most important clinical feature. Traditional endometrial cancer detection depends on women recognizing and reporting symptoms like postmenopausal bleeding or abnormal uterine bleeding. Research has shown that Black women are less likely to recognize postmenopausal bleeding as a warning sign, contributing to higher five-year mortality rates compared to White women. Women in rural areas and minority communities also face barriers to adequate medical care and complete diagnosis.

A self-monitoring pathway: Because the AI model requires only age, BMI, and endometrial thickness, the authors propose that age and BMI could function as a self-monitoring system. Women could input these two values into a smartphone app to get an initial risk assessment. Those flagged as higher risk could then be directed to obtain an ultrasound measurement of endometrial thickness. The authors envision that new community health plans could place at least one sonographer in underserved areas to provide this measurement.

Patient engagement and prevention: The authors argue that having patients provide their own input data promotes engagement and motivation. When women understand their personal risk level, they may be more likely to pursue protective interventions such as weight loss, dietary changes, or bariatric surgery. Digital tools and AI can further promote healthy behaviors through gamification and personalized risk communication.

This approach could be particularly impactful in countries with lower-middle incomes, where endometrial cancer incidence is rising but access to specialists remains limited. By shifting the initial risk assessment from symptom-dependent clinical evaluation to a simple, data-driven model, the study proposes a path toward more equitable cancer detection across diverse populations.

TL;DR: Because the model does not require symptom recognition or menopausal status, it could reduce health disparities for Black women, minority groups, and rural populations who face barriers in the current screening approach. The authors propose a smartphone-based self-monitoring system using age and BMI, with ultrasound follow-up for high-risk patients.

6. Methodological Context: How This Study Compares to Prior Work

Only one prior study had attempted demographic-based AI prediction of endometrial cancer. That study (Pergialiotis et al.) included only 178 women, all of whom were postmenopausal with bleeding and an endometrial thickness greater than 5 mm. Among those 178 women, 106 had endometrial cancer, an unusually high prevalence that likely inflated model performance and limits generalizability. By contrast, the present study enrolled 564 patients across both pre- and postmenopausal women, with a disease prevalence of 7.9%, which is consistent with the 8% to 10% rates reported in the broader literature.

Biopsy methodology matters: The authors highlight an important methodological distinction: their study used dilation and curettage (D&C) or high-pressure vacuum biopsy for all patients, rather than the pipelle biopsy used in some prior studies. Pipelle biopsy carries a reported 10% false-negative rate, which can introduce noise into the diagnostic labels that AI models learn from. By using D&C as the primary method, this study aimed to provide more reliable ground-truth labels for training.

AI vs. traditional statistics: The authors note that their goal was not to identify risk factors through conventional multivariate or logistic regression analysis. The known risk factors (age, obesity, hypertension, smoking, diabetes) were included in the feature set based on established epidemiological evidence. Instead, the study's contribution was building a predictive model that combines these factors into a practical tool. The AI approach also provided metrics such as precision and F1 score that are not standard in traditional diagnostic test evaluation, offering a more nuanced view of model performance in the context of imbalanced class distributions.

TL;DR: This study improves on the only prior demographic-based AI study by using a larger cohort (564 vs. 178 patients), including both pre- and postmenopausal women, reflecting real-world disease prevalence (7.9%), and employing D&C rather than pipelle biopsy for more reliable ground-truth labels.

7. Limitations and Future Directions: From Single-Center Study to Mobile AI

No external validation: The most significant limitation is that the model was only validated internally on 113 patients from the same institution. Without testing on data from different hospitals, populations, and clinical settings, it is unknown how well the model would generalize. The authors acknowledge that a larger database would strengthen their findings. To analyze all potential predictors with a 10% prevalence event rate, a sample size of at least 1,200 patients would be needed.

Moderate recall: The model's recall of 0.50 means that half of all actual precancerous and cancerous cases were missed. While the F1 score of 0.59 indicates a reasonable balance between false positives and false negatives, a 50% miss rate would be concerning in a clinical screening context. The authors present the F1 score as a fair evaluation metric given the class imbalance, but future work would need to improve sensitivity substantially before deployment as a standalone screening tool.

Treatment prediction failed: The model achieved an AUC of only 0.53 for predicting which patients would undergo hysterectomy. This is essentially random performance, suggesting that surgical treatment decisions depend on factors (such as patient preferences, comorbidities, and disease extent) that go well beyond the three features available to the model.

The vision for mobile AI: Despite these limitations, the authors articulate an ambitious future direction. They propose that AI could eventually be developed to measure endometrial thickness directly from ultrasound images, eliminating the need for an expert sonographer. Such a system, integrated into their prediction model, could be deployed in mobile health units or public AI centers in rural and underserved communities. This would create a complete, low-barrier screening pipeline: a woman provides her age and BMI, receives an AI-assisted ultrasound, and obtains an immediate risk assessment, all without requiring symptom recognition or specialist expertise.

TL;DR: Key limitations include the lack of external validation, a moderate recall of 0.50 (half of cancer cases missed), and failure to predict treatment decisions (AUC 0.53). The authors envision future integration with AI-powered ultrasound measurement to create a fully automated, mobile screening system for underserved communities.
Citation: Erdemoglu E, Serel TA, Karacan E, et al.. Open Access, 2023. Available at: PMC9860482. DOI: 10.1016/j.xagr.2022.100154. License: cc by-nc-nd.