Endometrial cancer is the fourth most common cancer among women, and while symptoms like abnormal bleeding often appear early enough for a relatively high five-year survival rate of 82%, incidence and mortality rates have been climbing since 2010. It is expected to surpass ovarian cancer as the leading cause of gynecological cancer death. The standard detection method, endometrial biopsy, is invasive. Current American Cancer Society guidelines, unchanged since 2001, recommend screening only for very high-risk women (those with Lynch syndrome or strong familial predisposition to colon cancer). Average-risk and moderately elevated-risk women receive no screening recommendation at all.
Previous attempts to build risk prediction models using traditional epidemiological approaches have produced only moderate accuracy. Pfeiffer et al. (2013) trained a model on the PLCO and NIH-AARP datasets (304,950 women, 1,559 cancer cases) and achieved an AUC of just 0.68. Husing et al. (2016) trained on 201,811 women (855 positive cases) and achieved a somewhat better AUC of 0.77 by explicitly adding interaction terms. Neither model was strong enough to guide population-level screening decisions with confidence.
This study from Yale University asks whether machine learning can do substantially better. The authors trained seven different algorithms on non-invasive personal health data from the PLCO Cancer Screening Trial and then compared the best-performing models head-to-head against 15 practicing physicians. The central result: a random forest model achieved a testing AUC of 0.96, far exceeding both prior epidemiological models and the judgment of gynecologic oncologists and primary care physicians.
The models were built on data from the Prostate, Lung, Colorectal, and Ovarian (PLCO) Cancer Screening Trial, a large prospective randomized controlled trial that enrolled participants from November 1993 through July 2001. From this dataset, the authors sub-selected 78,215 female participants aged 55 to 75 for whom endometrial cancer outcomes were known within five years of enrollment. Of these women, 961 (1.2%) developed endometrial cancer and 77,254 (98.8%) did not. This class imbalance is typical of cancer screening datasets and poses a challenge for model training.
The input features were entirely non-invasive, requiring no genomic data, imaging, biomarkers, or invasive procedures. The model used: age, BMI, weight at ages 20 and 50 and at enrollment, race, smoking habits, diabetes status, emphysema, stroke, hypertension, heart disease, arthritis, history of another cancer, family history of breast, ovarian, and endometrial cancer, ovarian surgery history, age at menarche, parity (number of pregnancies), use of birth control, and age at menopause. All inputs were normalized to a [0, 1] range.
Many of these features, such as BMI, diabetes, and family history, are well-established risk factors for endometrial cancer and were also used in the Pfeiffer and Husing models. Other features like smoking habits, emphysema, and heart disease were included because they had contributed to good performance in the authors' prior work on lung, prostate, and pancreatic cancer prediction. Notably absent are known risk factors like Lynch Syndrome (HNPCC), which was not captured in the PLCO dataset.
The dataset was randomly split 70/30 into training and testing sets, maintaining the same cancer prevalence ratio in both splits. This gives the study a TRIPOD (Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis) level 2a, meaning the model was validated on a held-out portion of the same dataset but not yet on an external cohort.
The authors trained and compared seven different machine learning algorithms, all producing continuous output from 0 to 1 representing the probability that a woman would develop endometrial cancer within five years. The algorithms were: logistic regression (LR), neural network (NN), support vector machine (SVM), decision tree (DT), random forest (RF), linear discriminant analysis (LDA), and naive Bayes (NB). All were implemented in MATLAB.
Neural network: The NN was a multilayer perceptron with two hidden layers of 12 neurons each and a logistic activation function, built using in-house MATLAB code from the authors' prior cancer prediction work. The logistic regression was essentially this same NN architecture with zero hidden layers. Support vector machine: The SVM used a Gaussian kernel via MATLAB's fitrsvm function. Random forest: The RF was built with MATLAB's TreeBagger function using 50 trees. LDA: Fitted with MATLAB's fitcdiscr using diagonal linear discrimination. Naive Bayes: Fitted with MATLAB's fitcnb with auto-optimized hyperparameters.
Model selection used 10-fold cross-validation within both the training and testing sets to determine the mean AUC of each algorithm. The two algorithms with the highest testing mean AUCs were selected as the final models, then retrained on the full training set and evaluated on the held-out testing set. For each final model, a decision threshold was chosen by maximizing the sum of training sensitivity and specificity (i.e., maximizing balanced accuracy).
Cross-validation results across all seven algorithms showed testing AUCs ranging from 0.68 (logistic regression) to 0.95 (random forest). Four algorithms (LR, NN, LDA, NB) generalized well with similar training and testing AUCs, while SVM, DT, and RF showed a significant drop from training to testing, indicating some degree of overfitting. Despite this, the random forest and neural network achieved the highest testing AUCs and were selected for further analysis.
When retrained on the full training set and evaluated on the held-out test set, the random forest achieved a training AUC of 0.99 (95% CI: 0.99-1.00) and a testing AUC of 0.96 (95% CI: 0.94-0.97). At the optimal threshold, its testing sensitivity was 75.7% and specificity was 98.3%. The PPV was 16.3% and the NPV was 99.9% on the testing set. The neural network achieved a training AUC of 0.91 (95% CI: 0.90-0.93) and a testing AUC of 0.88 (95% CI: 0.86-0.91). Its testing sensitivity was 67.7%, specificity 91.1%, PPV 3.3%, and NPV 99.8%.
The relatively low PPV values (16.3% for RF, 3.3% for NN) reflect the low prevalence of endometrial cancer in the population (1.2%). Even with a highly accurate model, most flagged individuals in a low-prevalence setting will be false positives. However, the extremely high NPV (99.9% for both models) means the models are very effective at ruling out cancer in women they classify as low risk. The authors note that logistic regression and naive Bayes performed worst, likely due to their inability to capture interaction terms between input features.
Following a recommendation from Kitson et al. (2017) that dividing the population into low-, medium-, and high-risk groups would enable tailored cancer prevention strategies, the authors used both models to create a 3-tier stratification. The boundaries were set so the bottom 15.9% of predicted risks were classified as below average, the top 15.9% as above average, and the middle 68.2% as average. These thresholds were determined from training data and then applied to the testing set.
For the random forest, the stratification was remarkably effective: 90.3% of women who developed endometrial cancer within five years were placed in the above-average risk group. Only 0.3% of cancer cases fell into the below-average group (just 1 out of 288 cancer cases). The incidence rates in the below-average, average, and above-average risk groups were 0.03%, 0.17%, and 6.17%, respectively. For the neural network, 72.3% of cancer cases were placed in the above-average group, and only 1.0% in the below-average group.
Kaplan-Meier survival plots over the full 13-year follow-up period confirmed clear separation between the three risk groups for both models. The hazard ratios between the above-average group and the other groups were statistically significant, demonstrating that the models could meaningfully stratify long-term risk, not just short-term outcomes. This suggests the models capture underlying risk factors that persist well beyond the initial five-year prediction window.
The most distinctive aspect of this study is its direct comparison between ML models and practicing clinicians. The authors created an online survey presenting de-identified data for 100 women (50 who developed cancer, 50 who did not) to 15 physicians from Yale, Harvard, University of Michigan, and INOR Cancer Hospital (Pakistan). Each physician was shown a random subset of 20 women and asked to classify them as below-, at-, or above-average risk. Clinicians received no instructions on classification criteria, simulating real-world clinical judgment.
True positive rate (above-average risk): The random forest identified 94.0% of women who developed cancer as above-average risk, and the neural network identified 70.0%. Physicians identified only 38.0% (SD 24%). False positive rate (above-average risk): The random forest flagged 14.0% of cancer-free women as above-average risk, the neural network flagged 8.0%, while physicians flagged 27.9% (SD 20%). False negatives (below-average risk): Both models classified 0.0% of women who developed cancer as below-average risk. Physicians misclassified 22.0% (SD 17%) of cancer cases as below-average risk.
In summary, the random forest was 2.5 times better than physicians at identifying above-average risk women who actually developed cancer, with a 2-fold reduction in false positives. The neural network was 2 times better at identifying above-average risk women, with a 3-fold reduction in false positives. Perhaps most importantly, neither model placed any true cancer case in the below-average risk category, while physicians misclassified nearly a quarter of cancer cases as low risk.
Another notable finding was the large inter-observer variability among the physicians, with standard deviations of 16-24% across their risk assessments. The ML models, by contrast, produce identical predictions every time, offering consistency that human judgment cannot match.
The authors introduce a compelling concept they call "statistical biopsy." Analogous to traditional tissue biopsy (which analyzes cells from a specimen) and liquid biopsy (which evaluates circulating DNA from a blood sample), statistical biopsy mines personal health data to predict cancer risk. The key difference is that statistical biopsy seeks to uncover invisible correlations and interconnections between multiple medical conditions and health parameters through machine learning, rather than looking for physical biomarkers.
The practical appeal is significant: a statistical biopsy costs almost nothing, requires no specimen collection, produces no side effects, and can generate a holistic risk profile across multiple cancer types. The same research group has already built strongly discriminatory models for non-melanoma skin cancer, prostate cancer, lung cancer, and pancreatic cancer using similar approaches. If integrated into electronic medical record (EMR) systems, these models could provide real-time risk predictions to primary care physicians during routine visits.
For endometrial cancer specifically, the models could help identify a population that would benefit from screening but currently receives none under existing guidelines. The authors note that when using stricter risk thresholds, the neural network model identified a high-risk subgroup in which 47% of women developed endometrial cancer within five years, with most developing it within one year. This suggests the model could serve as a powerful triage tool for directing invasive screening (endometrial biopsy, transvaginal ultrasound) to those most likely to benefit.
No external validation: The most significant limitation is that the model was validated only on a held-out portion of the same PLCO dataset (TRIPOD level 2a). By comparison, the Pfeiffer et al. model, despite its lower AUC of 0.68, was validated on an external dataset (TRIPOD level 3), making it more robust in terms of generalizability. The authors acknowledge this gap and state they are seeking access to external datasets for further validation. Until external validation is completed, the reported AUC of 0.96 should be interpreted with caution.
Limited interpretability: While the models excel at stratifying risk, they do not explain which individual factors drive the prediction for a given patient. This means the model can flag a woman as high risk, but it cannot tell her clinician whether dietary changes, progestin therapy, anti-estrogen therapy, or insulin-lowering therapy would be the most effective intervention. The authors plan to address this in future work, but for now the model functions as a black-box risk flag rather than a clinical decision guide.
Missing risk factors: The PLCO dataset does not include certain known risk factors such as Lynch Syndrome (HNPCC), which is one of the strongest predictors of endometrial cancer. Adding genetic and genomic features could potentially improve performance further. Additionally, the study population was aged 55-75, so the model's applicability to younger women is unknown.
Physician comparison design: Each physician assessed only 20 of the 100 women, and physicians received no standardized instructions. While this was intentional (to simulate real-world variability), it also means the physician performance estimates have high variance. The sample of 15 physicians is relatively small, and the aggregation method (averaging across physicians for each woman) may smooth out meaningful individual differences in clinical expertise.