Deep Learning Hysteroscopy for Endometrial Cancer Diagnosis

Plain-English Explanations

1. Why Hysteroscopy and Deep Learning? The Clinical Need for Automated Endometrial Cancer Detection

Endometrial cancer is the most common gynecologic malignancy worldwide, and its incidence has risen significantly in recent years. When detected early, particularly at stage IA without myometrial invasion, patients can receive a favorable prognosis and may even preserve their fertility through progestin therapy. However, patients diagnosed at later stages face limited treatment options and substantially worse outcomes. This makes early, accurate detection critically important.

The screening gap: Unlike cervical cancer, which benefits from Pap smear screening, there is no established screening technique for endometrial cancer. Endometrial cytology is unreliable because it is essentially a blind test that produces a high rate of false negatives. The standard diagnostic procedure, endometrial biopsy via dilation and curettage, is invasive and not suitable for routine screening. Hysteroscopy, which allows direct visualization of the uterine cavity, has been suggested by recent studies as an effective technique for endometrial cancer diagnosis, but it is not yet widely used in this role.

The deep learning opportunity: Deep neural networks (DNNs) have demonstrated impressive results in medical image analysis, including dermatology (matching dermatologist-level skin cancer classification on 129,000+ images), gastrointestinal endoscopy (achieving 98% polyp detection rates), and radiographic imaging for breast and lung cancers. However, at the time of this study, no DNN-based system had been developed specifically for endometrial cancer diagnosis from hysteroscopic images. This study by Takahashi, Sone, and colleagues at the University of Tokyo represents the first attempt to combine deep learning with hysteroscopy for endometrial cancer detection.

The small-sample challenge: A major barrier in this domain is data scarcity. Deep learning typically requires 100,000 to 1,000,000 images for training, but hysteroscopy-based cancer diagnosis is uncommon, making it difficult to collect large datasets from a single institution. The central challenge of this research was therefore to develop analytical methods that achieve high diagnostic accuracy despite a limited number of cases, establishing a foundation for future large-scale, multi-institutional studies.

TL;DR: Endometrial cancer lacks an established screening method. Hysteroscopy can visualize the uterine cavity directly but has not been widely adopted for cancer diagnosis. This is the first study to apply deep learning to hysteroscopic images for endometrial cancer detection, tackling the key challenge of achieving high accuracy with limited training data from a single institution (177 patients total).

2. Study Design and Dataset: 177 Patients, 411,800 Images, Five Diagnostic Categories

The study used hysteroscopy videos collected from 177 patients at the University of Tokyo Hospital between 2011 and 2019. These patients were categorized into five clinical groups: normal endometrium (60 patients), endometrial polyp (60 patients), uterine myoma (21 patients), atypical endometrial hyperplasia or AEH (15 patients), and endometrial cancer (21 patients). The pathological diagnoses for AEH and endometrial cancer were confirmed by biopsy or surgery, while benign conditions were diagnosed based on a combination of endometrial cytology, histology, hysteroscopic findings, MRI, ultrasound, and clinical course.

Image extraction from video: The hysteroscopy videos ranged from 10.5 seconds to 395.3 seconds in length, with a mean duration of 77.5 seconds and a median of 63.5 seconds. From these 177 videos, a total of 411,800 still images were extracted. The breakdown by category was: normal endometrium 113,357 images (27.5%), endometrial polyp 143,449 images (34.8%), uterine myoma 45,037 images (11.0%), AEH 42,146 images (10.2%), and endometrial cancer 67,811 images (16.4%). Because the videos were captured using different hysteroscopic systems with varying resolutions and image positions, all extracted frames were resized to 256 x 256 pixels for Xception and 224 x 224 pixels for MobileNetV2 and EfficientNetB0.

Binary classification setup: Given the limited number of cancer and AEH cases, the researchers defined just two classes for training and prediction: "Malignant" (comprising AEH and endometrial cancer, 36 videos and 109,957 images) and "Others" (comprising uterine myoma, endometrial polyps, and normal endometrium, 141 videos and 301,843 images). Grouping AEH with cancer was clinically justified because AEH is considered a precancerous condition of endometrial cancer, and detecting it early is equally important for patient management.

Cross-validation approach: The 177 videos were randomly divided into four groups (pair-A through pair-D). In each cross-validation fold, three groups served as training data and one as evaluation data. This 4-fold cross-validation design ensured that every patient appeared in the evaluation set exactly once, providing a fair estimate of generalization performance despite the small overall dataset.

TL;DR: 177 patients yielded 411,800 hysteroscopic images across five categories: normal (60), polyp (60), myoma (21), AEH (15), and cancer (21). The task was simplified to binary classification, grouping AEH and cancer as "Malignant" (36 videos, 109,957 images) versus all benign conditions as "Others" (141 videos, 301,843 images). Four-fold cross-validation ensured every patient was evaluated exactly once.

3. Three Neural Network Architectures: Xception, MobileNetV2, and EfficientNetB0

The researchers selected three convolutional neural network architectures specifically because they offer high accuracy with relatively small datasets and low computational costs, making them suitable for real-time clinical applications and eventual deployment in medical devices. All models were built using Keras on TensorFlow and trained on an Intel Core i7-9700 CPU with an Nvidia GTX 1080ti GPU.

Xception: Developed by Francois Chollet, Xception (Extreme Inception) replaces standard convolutional layers with "Depthwise Separable Convolutions." This technique decomposes a conventional convolution into two separate operations: Depthwise Convolution (which applies a single filter to each input channel independently) and Pointwise Convolution (a 1x1 convolution that combines the outputs). This decomposition dramatically reduces the number of parameters while preserving representational power. Xception uses Inception modules throughout its architecture and processes images at 256 x 256 pixel resolution. It required the longest training time of the three networks, roughly three times that of MobileNetV2.

MobileNetV2: Created by Google researchers, MobileNetV2 is optimized for mobile and edge devices. Its signature feature is the "Inverted Residual" block, which is applied across nearly every layer. Unlike standard residual blocks that narrow then widen, inverted residuals first expand the representation through a pointwise convolution, apply depthwise convolution, then project back to a thin bottleneck. This approach substantially reduces the total number of parameters. MobileNetV2 processes images at 224 x 224 pixels and had the shortest training time of the three architectures.

EfficientNetB0: Developed by Tan and Le at Google, EfficientNet introduces compound scaling coefficients that jointly optimize three dimensions of a convolutional network: depth (number of layers), width (number of channels per layer), and input resolution. Rather than scaling these independently, the compound coefficient balances all three for maximum efficiency. EfficientNetB0 is the baseline model in the EfficientNet family and processes images at 224 x 224 pixels. Its training time fell between Xception and MobileNetV2.

TL;DR: Three lightweight CNN architectures were chosen for their efficiency and suitability for real-time clinical use. Xception uses Depthwise Separable Convolutions and Inception modules. MobileNetV2 uses Inverted Residual blocks for minimal parameter count and fastest training. EfficientNetB0 uses compound scaling to optimize depth, width, and resolution simultaneously. All were built in Keras/TensorFlow on a single GTX 1080ti GPU.

4. Training Strategy: Two Dataset Variants, 144 Models, and the Continuity Analysis Method

Two training datasets (Set X and Set Y): The researchers created two versions of the training data for the "Malignant" class. Set X included all frames from hysteroscopy videos, including frames showing the cervical canal and extrauterine regions. Set Y was a curated subset that excluded those non-diagnostic frames, retaining only images of the uterine cavity and lesion sites. The rationale for comparing both was that even frames without visible lesions might contain subtle features (such as cloudy uterine luminal fluid associated with malignancy) that computers could detect but the human eye might miss.

Model generation at scale: Because neural networks produce different results even when trained with identical data and architecture (due to random weight initialization and stochastic training processes), each of the three network types was trained six times on each dataset variant, across all four cross-validation pairs. This yielded a total of 144 trained models (3 architectures x 6 iterations x 2 datasets x 4 cross-validation folds), providing a robust statistical basis for comparing performance.

Two evaluation methods: The trained models were evaluated in two ways. The first was standard image-by-image evaluation, where each individual frame received a Malignant or Others prediction. For malignant cases, 100 images clearly showing the lesion site were extracted from each patient's video. For benign and normal cases, all frames were used. The second method was the novel continuity analysis, a video-unit evaluation designed specifically for hysteroscopy. In this approach, a video was classified as "Malignant" only if 50 or more consecutive frames were predicted as Malignant.

Why continuity analysis matters: The threshold of 50 consecutive frames was determined through a pre-study analysis. Rather than choosing the threshold where malignant and benign average scores were optimally separated, the researchers deliberately set the threshold at the intersection point of the malignant and non-malignant score distributions. This conservative choice prioritizes reducing missed diagnoses (false negatives) over reducing false positives, which is the clinically appropriate trade-off when screening for cancer. A single misclassified frame should not trigger a cancer diagnosis, but sustained detection across many consecutive frames provides much stronger evidence of true malignancy.

TL;DR: Two dataset variants were tested: Set X (all video frames) and Set Y (lesion-only frames). Each of 3 architectures was trained 6 times across 2 datasets and 4 cross-validation folds, yielding 144 models. A novel continuity analysis method required 50+ consecutive "Malignant" frames to classify a video as positive, reducing false positives while prioritizing sensitivity for clinical cancer screening.

5. Image-by-Image Results: Baseline Accuracy of 78.9% to 80.9%

The initial evaluation assessed how well each of the 144 models performed on individual frames. The results were grouped by dataset (Set X vs. Set Y) and by neural network architecture. The average accuracy across all models trained on Set X was 78.91%, while the average for Set Y was 80.93%, a difference of 2.01 percentage points. This confirmed that curating the training data to include only images of the uterine cavity and lesion sites (Set Y) yielded better performance than using all extracted frames indiscriminately (Set X).

Architecture comparison: The differences between the three architectures were remarkably small at the image-by-image level. The minimum average accuracy was 79.69% and the maximum was 80.16%, a spread of just 0.47 percentage points. This suggests that for this particular classification task and dataset size, the choice of architecture mattered much less than the choice of training data curation strategy. All three networks performed comparably despite their different internal structures and parameter counts.

Training time trade-offs: Although all three networks achieved similar accuracy, their training times differed substantially. MobileNetV2 was the fastest to train, consistent with its design goal of efficiency on resource-constrained devices. Xception required approximately three times longer than MobileNetV2, reflecting its more complex Inception-based architecture. EfficientNetB0 fell between the two. For a task where accuracy differences are negligible, training speed becomes an important practical consideration, especially when generating 144 separate models.

The image-by-image results, while encouraging as a baseline, also highlighted the limitations of frame-level classification for a video-based modality like hysteroscopy. Many individual frames contain ambiguous content, partial views of lesions, or transitional images as the hysteroscope moves through the uterine cavity. This motivated the development of the continuity analysis approach, which evaluates sequences of frames rather than isolated images.

TL;DR: Image-by-image accuracy ranged from 78.91% (Set X) to 80.93% (Set Y), confirming that curated lesion-only training data outperformed all-frames data by about 2 percentage points. The three architectures performed nearly identically (spread of only 0.47%), but MobileNetV2 trained 3x faster than Xception. These baseline results motivated the more sophisticated continuity analysis approach.

6. Continuity Analysis and Model Combination: Accuracy Jumps to 90.29%

Continuity analysis results: Applying the video-unit continuity analysis method (requiring 50+ consecutive "Malignant" frames) substantially improved diagnostic accuracy compared to image-by-image evaluation. The average accuracy for Set X rose to 83.94%, and for Set Y it reached 89.13%, an improvement of roughly 5 to 8 percentage points over the frame-level baseline. The gap between datasets X and Y also widened to 5.12 percentage points, further confirming the value of curating training data to focus on lesion-containing frames. Again, the differences between the three network architectures remained small, with the minimum and maximum average accuracies separated by only 0.52 percentage points (86.22% vs. 86.75%).

Combining 72 models for maximum accuracy: The final and most powerful evaluation combined all 72 models trained on Set Y (6 iterations x 4 cross-validation pairs x 3 architectures) using the continuity analysis method. Ensemble prediction across all 72 models achieved an overall accuracy of 90.29%. The per-condition breakdown revealed: endometrial cancer was correctly classified in 18 of 21 cases (85.71%), AEH was correctly classified in all 15 of 15 cases (100%), myoma in 18 of 21 cases (85.71%), endometrial polyp in 51 of 60 cases (85.00%), and normal endometrium in 57 of 60 cases (95.00%).

Sensitivity, specificity, and F-score: The combined model achieved a sensitivity of 91.66% (95% CI: 77.53%-98.24%) and specificity of 89.36% (95% CI: 83.06%-93.92%). The precision was 68.75% for Malignant predictions and 97.67% for Others predictions. The overall F-score was 0.757. The high sensitivity is particularly important in a screening context, as it means the system correctly flagged the vast majority of malignant cases. The lower precision for the Malignant class (68.75%) reflects false positives, with 15 of 48 Malignant predictions being incorrect, but in a cancer screening setting, false positives that lead to further workup are generally more acceptable than missed cancers.

The perfect detection rate for AEH (100%, 15/15) is clinically notable because AEH is a precancerous condition. Catching every AEH case provides an opportunity for early intervention before progression to frank malignancy. The three misclassified cancer cases (3 of 21 predicted as "Others") represent the system's main clinical weakness, where cancers could be missed. The authors attributed these errors to specific challenging features: flat tumors that lack the typical exophytic appearance, and cases where excessive bleeding obscured the tumor.

TL;DR: Continuity analysis boosted accuracy from about 80% to 83.9%-89.1%. Combining all 72 Set Y models pushed overall accuracy to 90.29%, with sensitivity of 91.66% (95% CI: 77.53%-98.24%) and specificity of 89.36% (95% CI: 83.06%-93.92%). AEH detection was perfect at 100% (15/15). Endometrial cancer detection reached 85.71% (18/21), with 3 cancers missed due to flat morphology or excessive bleeding.

7. Limitations: Small Sample Size, Single Institution, and Computational Cost

Limited patient cohort: The most significant limitation is the small sample size, particularly for the malignant categories. Only 21 patients had endometrial cancer and 15 had AEH, for a total of just 36 malignant cases. While the 411,800 extracted images provide a seemingly large dataset, these all derive from only 177 unique patients, meaning the model's ability to generalize to new, unseen patients is uncertain. The wide confidence interval on sensitivity (77.53%-98.24%) directly reflects this small sample problem. A system intended for clinical deployment would need validation on a much larger, more diverse patient population.

Single-center design: All data came from the University of Tokyo Hospital, which means the model was trained and evaluated on images from a single hysteroscopic setup and patient demographic. Hysteroscopy equipment, technique, and image quality vary between institutions and operators. The lack of external validation makes it impossible to know how well the system would perform in different clinical settings with different equipment, patient populations, or operator techniques. The authors acknowledge this and specifically call for multi-institutional collaborative research as the next step.

Computational burden of the ensemble: Achieving the highest accuracy (90.29%) required combining predictions from 72 separate deep learning models. While this is a valid research strategy for demonstrating proof of concept, deploying 72 models simultaneously in a clinical device is impractical. The computational resources, memory, and inference time required for such an ensemble far exceed what would be feasible in a real-time hysteroscopy support tool. The authors acknowledge the need to develop a more compact system that could handle a larger case volume while maintaining high accuracy.

Error analysis gaps: The study identified two causes of diagnostic error, flat tumors and excessive bleeding, but did not systematically analyze error patterns by patient age, cancer stage, histological subtype, or other clinical variables. The authors note that future work with more cases should include subgroup analyses and head-to-head comparisons with hysteroscopic specialists. Without such comparisons, it is unclear whether the system adds diagnostic value beyond what an experienced hysteroscopist already provides.

TL;DR: Key limitations include a small malignant cohort (only 36 cases of cancer and AEH), single-center data from the University of Tokyo, the impracticality of deploying a 72-model ensemble in a real-time clinical device, and the absence of subgroup analyses or head-to-head comparison with expert hysteroscopists. Wide confidence intervals (sensitivity: 77.53%-98.24%) underscore the need for larger validation studies.

8. Future Directions: From Pilot Study to Multi-Institutional Clinical Tool

Establishing a foundation for scale: The authors position this study explicitly as a pilot, designed to determine whether large-scale research is feasible using the algorithms and methods they developed. The continuity analysis technique and the multi-model ensemble approach represent novel contributions that could be applied beyond hysteroscopy to other medical imaging domains where sample sizes are small. The same framework could potentially benefit other rare-disease imaging tasks where data scarcity is the primary bottleneck.

Multi-institutional collaboration: The most immediate next step is expanding the dataset through multi-facility joint research. Because hysteroscopy-based cancer diagnosis is uncommon and difficult to scale at a single institution, pooling data from multiple hospitals is essential. A larger, multi-center dataset would address several current limitations simultaneously: increasing sample size, introducing equipment and technique variability, diversifying patient demographics, and enabling proper external validation. The authors aim to use the system established in this pilot as the foundation for such collaborative studies.

Model compression and real-time deployment: For the system to become a practical clinical tool, the 72-model ensemble needs to be distilled into a compact, efficient architecture suitable for integration into hysteroscopy hardware. Techniques like knowledge distillation, model pruning, and quantization could reduce the computational footprint while preserving most of the ensemble's accuracy. The trade-off between real-time processing during the hysteroscopy procedure versus post-procedure analysis also needs investigation, as the speed requirements for live feedback during an examination are far more demanding than batch processing afterward.

Toward endometrial cancer screening: The broader vision is to establish hysteroscopy combined with AI as a viable screening method for endometrial cancer, filling a critical gap in gynecologic oncology. If the system can be validated at scale and achieves predictive values approaching 100%, it could transform clinical practice by enabling earlier detection, preserving fertility in young patients with early-stage disease, and reducing reliance on invasive biopsy procedures. The current diagnostic accuracy rate of 90.29%, while not yet sufficient for standalone screening, demonstrates that AI-assisted hysteroscopy has real potential to become a clinically useful tool once the training data and model architecture are further optimized.

TL;DR: This pilot establishes the technical foundation for future multi-institutional studies. Key next steps include expanding the dataset through hospital collaborations, compressing the 72-model ensemble into a deployable single model, comparing real-time vs. post-procedure analysis, and conducting subgroup analyses by age, stage, and histology. The ultimate goal is an AI-assisted hysteroscopy system for endometrial cancer screening, addressing a major unmet need in gynecologic oncology.

Automated System for Diagnosing Endometrial Cancer by Adopting Deep-Learning Technology in Hysteroscopy

Original Paper (PDF)