Endometrial cancer (EC) is the sixth most commonly diagnosed cancer in women worldwide, with approximately 417,000 new cases and 97,000 deaths reported in 2020. In the United States alone, an estimated 66,570 new cases were expected in 2021, and the incidence rate has been rising by about 1% per year since the mid-2000s. Prognosis depends heavily on how early the disease is caught. Patients diagnosed at stage IA, where the tumor invades less than 50% of the myometrium (the muscular wall of the uterus), have a 5-year survival rate of 90 to 96%. For stage IB, where invasion reaches 50% or more, the 5-year survival rate drops to 78 to 87%. When myometrial invasion exceeds 50%, overall 5-year survival falls to roughly 66.0%, compared to 92.6% when invasion stays below that threshold.
Clinical impact of staging: The distinction between stage IA and stage IB directly determines treatment strategy. Low-risk patients (stage IA) can typically undergo a simple hysterectomy without adjuvant radiation therapy. High-risk patients (stage IB) usually require adjuvant radiation therapy and may be recommended for pelvic and para-aortic lymphadenectomy. Getting the staging right before surgery therefore shapes the entire treatment plan and has significant implications for patient quality of life.
MRI as the imaging standard: Magnetic resonance imaging has been the preferred modality for preoperative staging of EC since the European Society of Urogenital Radiology (ESUR) issued updated guidelines in 2009. MRI provides excellent soft-tissue contrast and is widely accepted as the first-choice imaging tool for initial EC staging. However, the accuracy of MRI interpretation is heavily dependent on the individual radiologist's experience, and studies have shown that two different radiologists can reach substantially different conclusions when evaluating the same MRI images. This inter-observer variability creates a clear opportunity for deep learning to provide a more objective, reproducible assessment.
The study's aim: Mao et al., working at Xiamen University of Technology and Fujian Maternity and Child Health Hospital, set out to build an automatic deep learning pipeline that segments the uterus and tumor on MRI images and then uses the tumor-to-uterus area ratio (TUR) to classify patients as stage IA or stage IB. The approach was designed to be both interpretable and efficient, avoiding the "black box" criticism that affects many deep learning classification models.
The study retrospectively enrolled 117 patients with pathologically confirmed early-stage endometrial cancer from Fujian Maternity and Child Health Hospital in China, covering the period from January 1, 2018 to December 31, 2020. Of these, 73 patients were stage IA and 44 were stage IB. The mean age was 54.8 years (standard deviation: 9.7 years), with stage IA patients averaging 51.4 years and stage IB patients averaging 58.9 years. All diagnoses were confirmed by postoperative biopsy, which served as the gold standard throughout the study. The Institutional Review Board approved the retrospective design, and informed consent was waived.
Clinical characteristics: Among the stage IA group, 49 patients had Grade 1 tumors, 22 had Grade 2, and 2 had Grade 3 endometrioid carcinoma. In the stage IB group, the distribution was 21 Grade 1, 18 Grade 2, and 5 Grade 3. For maximum tumor diameter, 55 stage IA patients had tumors smaller than 3 cm compared to 14 stage IB patients. Regarding myometrial invasion on pathology, 71 of 73 stage IA patients had less than 50% invasion (as expected by FIGO definition), while 41 of 44 stage IB patients had 50% or more invasion. The p-values across all clinical parameters showed no statistically significant differences between the groups in terms of histological grade, tumor diameter, or the presence of mixed carcinoma.
MRI acquisition: All imaging was performed on a 1.5 Tesla GE Optima MR360 scanner with patients in the supine position. Three MRI sequences were acquired for each patient: axial T2-weighted imaging (T2WI) with repetition/echo time of 4,000/45 msec, axial diffusion-weighted imaging (DWI) at b-values of 0 and 800 s/mm2 with repetition/echo time of 5,000/74 msec, and sagittal T2WI with repetition/echo time of 4,200/76 msec. Slice thickness was 5 mm for all sequences. Axial T2WI and sagittal T2WI images were acquired at 512 x 512 resolution with voxel sizes of 0.625 x 0.625 x 5 mm and 0.547 x 0.547 x 5 mm respectively, while axial DWI was acquired at 256 x 256 resolution with 1.25 x 1.25 x 5 mm voxels.
Image selection: An experienced radiologist selected the MRI slices that clearly visualized both the uterus and tumor from each patient's three sequences. This yielded a total of 455 MRI images: 161 axial T2WI, 161 axial DWI, and 133 sagittal T2WI. This manual slice selection step is important to note as a limitation, since it introduces radiologist involvement in what is otherwise intended to be an automated pipeline.
The core of the automatic staging system is a U-Net semantic segmentation model, a fully convolutional network (FCN) architecture originally designed for biomedical image segmentation. The U-Net has a symmetric encoder-decoder structure: the contracting (encoder) path captures context information through repeated 3 x 3 convolutions followed by ReLU activations and 2 x 2 max pooling with stride 2, doubling the number of feature channels at each step. The expansive (decoder) path performs upsampling of feature maps followed by 2 x 2 up-convolutions that halve the feature channels, concatenated with correspondingly cropped feature maps from the encoder path via skip connections. This architecture allows the network to propagate fine-grained spatial information to higher-resolution layers, which is essential for precise segmentation boundaries.
Training configuration: The model was trained using the Adam optimizer with an initial learning rate of 0.0003 and cross-entropy loss as the objective function. Training ran for up to 300 epochs with an early-stop strategy to prevent overfitting. The implementation used TensorFlow version 2.5.0, and experiments were conducted on a workstation equipped with an NVIDIA RTX 2080 Ti GPU. Total training time was approximately 3 hours. Ground-truth annotations were created by an experienced radiologist using the LabelMe annotation tool, which generated pixel-level labels for both the uterus region and the tumor region on each MRI slice.
Data splitting: The 455 MRI images from 117 patients were divided in a 6:1:3 ratio into training, validation, and test sets at the patient level. The training set contained 70 patients (44 IA, 26 IB) with 272 images. The validation set had 12 patients (7 IA, 5 IB) with 46 images. The test set included 35 patients (22 IA, 13 IB) with 137 images. Splitting at the patient level rather than the image level is important because it prevents data leakage, where different slices from the same patient could appear in both training and test sets.
Two-stage pipeline: The overall workflow has two distinct stages. First, the trained U-Net segments both the tumor and the uterus on each MRI image, producing a pixel-level segmentation map where the tumor is marked in red and the uterus in blue. Second, the tumor-to-uterus area ratio (TUR) is calculated from the segmentation map by dividing the number of tumor pixels by the total number of pixels in both the tumor and uterus regions combined. This TUR value is then compared against an optimal threshold derived from ROC analysis to classify the patient as stage IA or stage IB.
The segmentation model was evaluated using the Dice Similarity Coefficient (DSC), which measures the overlap between the predicted segmentation and the ground-truth annotation. DSC ranges from 0 (no overlap) to 1 (perfect overlap). Across all 137 test images, the average DSC for uterus segmentation was 0.959 (SD: 0.089, median: 0.974) and for tumor segmentation was 0.911 (SD: 0.123, median: 0.947). All differences between uterus and tumor DSCs were statistically significant (p < 0.001), reflecting the greater difficulty of tumor segmentation.
Sequence-specific results: For axial T2WI images, the mean uterus DSC was 0.964 (SD: 0.050, median: 0.978) and tumor DSC was 0.918 (SD: 0.128, median: 0.951). For axial DWI images, the mean uterus DSC was 0.952 (SD: 0.139, median: 0.975) and tumor DSC was 0.915 (SD: 0.143, median: 0.953). For sagittal T2WI images, the mean uterus DSC was 0.961 (SD: 0.023, median: 0.968) and tumor DSC was 0.897 (SD: 0.082, median: 0.923). All three sequences achieved DSC values above 0.9 for both uterus and tumor, demonstrating consistent performance.
Why uterus is easier to segment: The consistently higher DSC for uterus compared to tumor is explained by the relatively fixed geometric shape of the uterus in the human body, which allows the model to learn characteristic parameters more effectively. Tumors, by contrast, appear in various patterns and sizes, making them inherently harder to learn. Additionally, interference factors such as pelvic effusion, hematocele, uterine fibroids, and cervical cancer can confuse the model during tumor segmentation. The DSC variances for both uterus and tumor across all three MRI sequences were less than 0.15, indicating stable model performance with low fluctuation.
Comparison with prior work: These segmentation results compare favorably to previous studies. Kurata et al. achieved a DSC of 0.82 for uterine segmentation on sagittal T2WI images using a U-Net, but did not segment tumors. Hodneland et al. used a 3D CNN for tumor segmentation and achieved median DSCs of 0.84 and 0.77 against two raters, but did not segment the uterus or analyze the TUR. The current study's DSC values of 0.91 to 0.97 across both structures and all three sequences represent a notable improvement.
After segmenting both the tumor and uterus, the pipeline calculates the tumor-to-uterus area ratio (TUR) for each MRI slice. The TUR is defined as the ratio of tumor pixels to the total pixels in both the tumor and uterus regions. Stage IA patients showed consistently lower TUR values compared to stage IB patients across all three MRI sequences, confirming the clinical expectation that deeper myometrial invasion corresponds to a larger proportion of tumor within the uterine cross-section.
Mean TUR values: For axial T2WI images, the mean TUR was 0.165 (SD: 0.083) for stage IA and 0.307 (SD: 0.112) for stage IB (p < 0.001). For axial DWI images, the mean TUR was 0.190 (SD: 0.077) for stage IA and 0.335 (SD: 0.117) for stage IB (p < 0.001). For sagittal T2WI images, the mean TUR was 0.103 (SD: 0.077) for stage IA and 0.334 (SD: 0.125) for stage IB (p < 0.001). The sagittal T2WI sequence showed the largest separation between the two groups, with stage IB TUR more than three times larger than stage IA TUR.
ROC-derived thresholds: Receiver operating characteristic curves were plotted for each MRI sequence to determine the optimal threshold for classifying stage IA versus stage IB. For axial T2WI, a TUR threshold of 0.207 yielded 84.6% sensitivity, 86.4% specificity, and an AUC of 0.86. For axial DWI, a threshold of 0.331 yielded 69.2% sensitivity, 95.5% specificity, and an AUC of 0.85. For sagittal T2WI, a threshold of 0.198 yielded 92.3% sensitivity, 90.9% specificity, and an AUC of 0.94. The sagittal T2WI sequence clearly delivered the best overall performance, with the highest AUC, sensitivity, and specificity.
Clinical interpretation: These thresholds provide a concrete, interpretable decision rule. When the TUR on a sagittal T2WI image exceeds 0.198, the model classifies the patient as stage IB. This approach is more transparent than a purely neural network-based classifier because the clinician can directly see the segmentation overlay and the computed ratio, rather than relying on an opaque probability output. The combination of deep learning segmentation with traditional statistical analysis (ROC curves) is a deliberate design choice to improve clinical trust and interpretability.
Radiologists in clinical practice never rely on a single MRI sequence. They view multiple sequences to make staging decisions. To mirror this, the authors tested all possible combinations of two and three MRI sequences for staging, using a fuzzy logic voting approach. When combining sequences, a patient is classified as stage IB if the TUR exceeds the threshold on at least one, at least two, or all three of the combined sequences. This systematic comparison reveals the trade-offs between sensitivity and specificity at different levels of consensus.
Single-sequence performance: Among the individual sequences, sagittal T2WI delivered the best overall accuracy at 0.914, with 0.923 sensitivity and 0.909 specificity. Axial T2WI and axial DWI each achieved an accuracy of 0.857, but with different sensitivity-specificity profiles. Axial T2WI had balanced performance (0.846 sensitivity, 0.864 specificity), while axial DWI had lower sensitivity (0.692) but very high specificity (0.955).
Two-sequence combinations: Combining axial DWI with sagittal T2WI achieved the highest two-sequence accuracy at 0.886, with 0.923 sensitivity and 0.864 specificity. The axial T2WI plus sagittal T2WI combination achieved perfect sensitivity of 1.000 but lower specificity of 0.773, meaning it caught every stage IB case but over-classified some stage IA patients. Axial T2WI plus axial DWI maintained balanced performance at 0.857 accuracy.
Three-sequence voting: Using all three sequences with different consensus thresholds produced notable results. Requiring at least one sequence to exceed its threshold yielded the highest specificity of 1.000 (no false positives) but reduced sensitivity to 0.692. Requiring at least two sequences gave balanced performance with accuracy of 0.886, sensitivity of 0.769, and specificity of 0.955. Requiring all three sequences to agree produced perfect sensitivity of 1.000 and specificity of 0.773. These results suggest that clinicians could select their voting rule based on whether they prioritize catching every stage IB case (all-must-agree rule) or minimizing false positives (at-least-one rule).
To demonstrate the advantage of the deep learning approach, the authors compared the U-Net segmentation against three traditional machine learning segmentation algorithms: OTSU thresholding, region growing, and edge detection. These comparisons were performed on MRI images from the test set, and the results clearly illustrated why conventional image processing methods fail on pelvic MRI data.
OTSU thresholding: OTSU is a threshold-based algorithm that attempts to separate foreground from background by finding an optimal intensity threshold. On pelvic MRI images, this approach failed because the grayscale values of the uterus and tumor are not significantly different from those of surrounding pelvic tissues. The algorithm could not find a threshold that reliably distinguished the target structures from the rest of the pelvis.
Region growing: This region-based method requires manual specification of initial seed points and growth criteria for each image. Because the location and size of the uterus and tumor vary across patients and slices, the region growing algorithm could not be applied in a generalizable way. Its dependence on manual initialization makes it impractical for an automated staging pipeline.
Edge detection: Due to the complex and dense distribution of tissues and organs in the pelvis, MRI images contain many edge features that create noise for edge-based segmentation. The algorithm detected too many irrelevant boundaries, making it unable to isolate the uterus and tumor from the surrounding anatomy.
U-Net superiority: In contrast to all three traditional methods, the CNN-based U-Net successfully segmented both the uterus and tumor on MRI images, producing results that closely matched the ground-truth contours created by the experienced radiologist. The key advantage of the deep learning approach is its ability to learn high-level semantic features from the training data, rather than relying on hand-crafted rules about intensity thresholds, region homogeneity, or edge gradients.
Small sample size: With only 117 patients from a single institution, the study is limited in generalizability. The test set contained just 35 patients (22 IA, 13 IB), meaning the ROC thresholds and classification metrics are derived from a relatively small number of cases. No external validation cohort was used, so it remains unknown how well the TUR thresholds and segmentation model would perform on data from different scanners, institutions, or patient populations. The authors plan to collect additional cases, including MRI images from healthy individuals, to improve robustness.
Manual slice selection: An experienced radiologist manually selected the MRI slices that best visualized the uterus and tumor for each patient. This step introduces subjectivity and radiologist dependency into what is otherwise an automated pipeline. The authors acknowledge this bottleneck and state that developing an automatic slice-picking method is a priority for future work to make the model fully end-to-end.
Segmentation errors from confounding structures: The study identified a specific failure mode where pelvic effusion was falsely segmented as tumor by the model, likely due to brightness similarity between the effusion and the tumor on T2WI images. Adjacent slices without the effusion were segmented correctly, suggesting that the error is slice-specific rather than systematic. Similar confounding factors such as hematocele, uterine fibroids, and cervical cancer could produce analogous errors. The dataset did not include healthy controls, which could help the model learn to distinguish normal anatomical variants from pathology.
2D versus 3D approach: The study used 2D slices rather than volumetric 3D data, which avoids the labor-intensive annotation requirements of 3D datasets but also discards spatial information across slices. Prior work by Bonatti et al. found that a tumor/uterus volume ratio greater than 0.13 could distinguish low-grade from high-grade EC with 50% sensitivity and 89% specificity, but volume computation from manual measurements was error-prone. A future direction would be to combine the 2D segmentation approach with slice-by-slice volumetric reconstruction.
Comparison to prior DL staging methods: Chen et al. previously proposed a deep learning-based two-stage CAD method for assessing myometrial invasion depth that yielded 0.67 sensitivity, 0.88 specificity, and 0.85 accuracy, but used only T2WI data and suffered from the black-box interpretability problem. The current TUR-based approach offers a more interpretable alternative, though direct head-to-head comparison on the same dataset has not been performed. Future work should focus on developing methods that can imitate radiologist decision-making behavior more closely, incorporating multi-sequence fusion at the model level rather than the voting level.