Artificial intelligence algorithms for real-time detection of colorectal polyps during colonoscopy: a review

PMC (Open Access) 2024 AI 8 Explanations View Original
Original Paper (PDF)

Unable to display PDF. Download it here or view on PMC.

Plain-English Explanations
Page 1
Why Real-Time Polyp Detection Matters for Preventing Colorectal Cancer

The cancer burden: Colorectal cancer (CRC) accounts for approximately 10% of all cancer diagnoses worldwide, ranking as the third most common malignancy and the second leading cause of cancer-related death globally. The five-year survival rate for advanced-stage CRC is a stark 14%, underscoring the life-or-death importance of catching the disease early. Colorectal polyps are precancerous growths that can develop into full-blown cancer over a span of 5 to 10 years, making their timely detection and removal the single most effective prevention strategy available.

The adenoma detection rate problem: The adenoma detection rate (ADR) is the standard quality metric for colonoscopy. Medical evidence shows that each single percentage-point increase in ADR correlates with a 3% to 6% decrease in interval colorectal cancer incidence. Despite its importance, gastroenterologists currently achieve only about 76% accuracy in detecting small polyps (under 1 cm) during real-time optical examination. Small, flat, and subtle polyps are routinely missed, and these missed lesions are a primary driver of cancers that appear between screening intervals.

Enter computer-aided detection: Artificial intelligence, specifically computer-aided detection (CADe) systems, has emerged as a tool to close this gap. Repici et al. conducted a multicenter randomized controlled study involving 685 subjects and found that colonoscopies augmented with CADe had a significantly higher ADR than the control group. CADe systems analyze endoscopic video in real time, identifying and localizing polyps to aid physician decision-making. The demand for algorithms that can operate at 25 frames per second (FPS) or higher has become a defining benchmark for clinical viability.

TL;DR: CRC is the third most common cancer globally with only 14% five-year survival at advanced stages. Gastroenterologists miss about 24% of small polyps, but CADe systems significantly improve adenoma detection rates, as shown in a 685-subject multicenter RCT.
Pages 1-2
Machine Learning Approaches: Hand-Crafted Features and Their Limits

Feature engineering era: Before deep learning became dominant, traditional machine-learning algorithms relied on hand-crafted descriptors for feature extraction. Researchers manually designed features based on shape, texture, color, and edge characteristics of polyps, then fed these into classifiers such as Support Vector Machines (SVMs) to separate lesions from the background. This approach required significant domain expertise and was inherently limited by what human engineers could design and encode.

Texture-based detection: Ameling et al. analyzed over four hours of high-resolution colonoscopy videos using four texture extraction methods based on grey-level co-occurrence matrices (GLCMs) and local binary patterns (LBPs), achieving an area under the ROC curve of up to 0.96. Iakovidis et al. compared four texture feature extraction methods for gastric polyp detection, finding that color wavelet covariance (CWC) produced the best results with an AUC of 88.6%. Sevo proposed a model using texture analysis for automatic inflammation detection that achieved over 84% accuracy in real time, reaching above 90% on some video segments.

Shape and edge approaches: Hwang proposed an ellipse shape-based polyp detection method using least-squares fitting that achieved a detection rate of 15 FPS. Wang et al. introduced "Polyp-Alert," a software system using edge cross-section visual features and a rule-based classifier that correctly detected 97.7% (42 out of 43) of polyp shots across 53 randomly selected colonoscopy videos at up to 10 FPS. Kominami et al. developed a real-time image recognition system for narrow-band imaging that used SVM classification to achieve 94.9% accuracy at 20 FPS.

Fundamental limitations: Despite occasionally high accuracy numbers, these methods suffered from critical weaknesses: manual feature extraction was time-consuming and lacked robustness, processing speeds were generally below 20 FPS (insufficient for real-time clinical use), and false-positive rates remained high. The inability to generalize across varying polyp appearances, lighting conditions, and patient anatomies made these approaches unsuitable as standalone clinical tools.

TL;DR: Pre-deep-learning methods used hand-crafted features (texture, shape, color, edges) with classifiers like SVMs. While some achieved AUCs up to 0.96, frame rates remained below 20 FPS, false-positive rates were high, and robustness was poor.
Pages 2-4
Deep Learning for Speed: Anchor-Free, Lightweight, and One-Stage Architectures

The 25 FPS threshold: For clinical utility, real-time colonoscopy algorithms must process at least 25 frames per second. The review organizes speed optimization into three main strategies: anchor-free detection, lightweight network architectures, and one-stage detection methods. Each approach reduces computational overhead in different ways while attempting to maintain detection accuracy.

Anchor-free detection: Traditional anchor-based detectors use predefined bounding boxes generated by K-means clustering, adding computational overhead. Anchor-free methods detect targets by directly predicting the center point of objects. Yang et al. proposed YOLO-OB, which employs an ObjectBox detection head with a center-based anchor-free regression strategy, achieving 39 FPS on an RTX 3090. Wang et al. developed AFP-Net (Anchor-Free Polyp Net), which formulates objects as centroids with a context-enhanced module and feature pyramid design, reaching 52.6 FPS.

Lightweight architectures: These networks reduce model parameters and complexity while preserving accuracy. Ou et al. designed Polyp-YOLOv5-Tiny by reducing convolutional kernels and removing the large target detection head from YOLOv5s, achieving a remarkable 113.6 FPS with only a slight loss in accuracy. Yoo et al. proposed YOLOv5-TST, which replaces the CNN neck with a Token-Sharing Transformer (TST) that fuses local and global features through the attention mechanism. This model achieved 138.3 FPS with a precision of 0.9369 on the Kvasir dataset, outperforming Polyp-YOLOv5-Tiny (precision 0.9072) while being even faster.

One-stage detectors: Unlike two-stage algorithms (such as R-CNN and Faster R-CNN) that first generate region proposals and then classify them, one-stage detectors perform classification and localization in a single forward pass. The original YOLO processes images at 45 FPS with 24 convolutional layers and 2 fully connected layers on 448x448 pixel inputs. SSD (Single Shot MultiBox Detector) adds multiple feature layers for multi-scale detection, achieving 59 FPS. Lee et al. validated YOLOv2 on 8,075 images and detected all 38 polyps plus 7 additional ones at 67.16 FPS. Pacal et al. used YOLOv4 with Cross-Stage-Partial connections on Darknet-53 to reach 122 FPS, and with NVIDIA TensorRT optimization, exceeded 250 FPS with under 4 ms latency.

TL;DR: Three strategies boost speed: anchor-free methods (AFP-Net at 52.6 FPS), lightweight networks (YOLOv5-TST at 138.3 FPS with 0.937 precision), and one-stage detectors (YOLOv4 at 122 FPS, exceeding 250 FPS with TensorRT). All surpass the 25 FPS clinical threshold.
Pages 4-6
Ensemble Models and 3D Methods: Pushing Detection Correctness Higher

Ensemble learning strategies: Integrated learning combines multiple classification models into a single high-quality classifier, balancing training speed and accuracy by compensating for the shortcomings of any individual model. Zhao et al. developed the Adaptive Small Object Detection Ensemble (ASODE), combining SSD with YOLOv4, and achieved 92.70% adenoma detection accuracy in video analysis. Ma et al. integrated Swin Transformer blocks into a CNN-based YOLOv5m network, improving accuracy by 5.3% over the baseline to reach 83.6% on the CVC-ClinicalVideoDB dataset. Sharma et al. combined ResNet, GoogLeNet, and Xception into a unified model that achieved 98.6% precision and 98.3% accuracy, with the added ability to distinguish cancerous from non-cancerous polyps.

Three-dimensional deep learning: Most existing research uses two-dimensional (2D) analysis of individual frames, but 3D methods capture richer spatial and temporal information from colonoscopy video sequences. Yu et al. proposed a 3D Fully Convolutional Network (3D-FCN) that converts fully connected layers to convolutional layers, reducing redundant computations and encoding spatiotemporal information. Their approach achieved 88.1% precision. Misawa et al. developed a CADe system using a 3D Convolutional Neural Network (3D-CNN) that detected 94% of polyps tested (47 out of 50) and demonstrated superior performance on video datasets compared to other deep learning methods, with high sensitivity regardless of polyp size and morphology.

CNN-based refinements: Urban et al. utilized CNNs for computer-aided image analysis and achieved 96.4% accuracy for polyp identification, processing one frame in 10 ms. A key challenge with single-frame detection is the "jittery effect" where results fluctuate between consecutive frames. Zheng et al. addressed this with OptCNN, combining a real-time trained CNN with a spatial voting algorithm and optical flow model to achieve 84.58% precision. Livovsky et al. proposed DEEP2, a RetinaNet-based system that uses a temporal logic layer to leverage knowledge from previous frames, achieving 99.8% sensitivity for polyps with appearance times exceeding 30 seconds.

TL;DR: Ensemble models (ResNet + GoogLeNet + Xception) achieve up to 98.3% accuracy. 3D-CNNs detect 94% of polyps by leveraging temporal information. RetinaNet-based DEEP2 reaches 99.8% sensitivity for polyps visible longer than 30 seconds.
Pages 6-7
Detecting What Gets Missed: Specialized Methods for Small-Sized Polyps

Why small polyps are hard: Small-sized polyps are the most likely to be overlooked during real-time colonoscopy. Traditional SSD architectures primarily use shallow features for predictions that lack semantic information, making them poorly equipped to capture both local detail and global context. This results in consistently poor performance on small objects, which is clinically significant because these diminutive polyps represent early-stage precancerous lesions whose removal prevents future cancer development.

Multi-scale fusion approaches: Souaidi et al. developed MP-FSSD (Multiscale Pyramidal Fusion Single-Shot Multibox Detector), which introduces an edge pooling layer, a splicing module, and a downsampling block on top of SSD to generate a new pyramid layer. This multi-scale feature fusion enhances the network's ability to detect small targets, achieving a mean average precision (mAP) of 91.56% at a test speed of 62.5 FPS. Fu et al. proposed D2polyp-Net, which uses a double pyramid structure combining shallow spatial information with deep semantic information for improved polyp localization, reaching 80.1% detection precision with particular effectiveness for small-sized polyps.

Attention mechanisms and focal loss: Wan et al. proposed a YOLOv5 model with a self-attention mechanism that integrates attention into the feature extraction process, enhancing informative feature channels while suppressing irrelevant ones. This yielded high accuracy on small polyps and polyps with low contrast. Livovsky et al.'s RetinaNet-based DEEP2 system leverages Focal Loss, a loss function specifically designed to address category imbalance, which performs especially well in detecting small targets by down-weighting easy negatives and focusing training on hard-to-classify examples.

TL;DR: Small polyps are the most commonly missed lesions. Multi-scale fusion (MP-FSSD, mAP 91.56% at 62.5 FPS), double pyramid structures (D2polyp-Net, 80.1% precision), and attention mechanisms in YOLOv5 specifically improve small-polyp detection.
Pages 7-8
Reducing False Positives: Post-Processing and Temporal Filtering Techniques

Why post-processing matters: Raw detection outputs from deep learning models inevitably contain false positives that can distract endoscopists and erode trust in AI systems. Post-processing methods improve detection accuracy by eliminating spurious detections and reducing misses. The review highlights several approaches that apply temporal and spatial filtering after the initial neural network prediction to clean up results before displaying them to the clinician.

Median filtering: Lee et al. employed a median filter as a post-processing step to minimize false alarms. The median filter removes impulse noise from the detection signal while retaining edge information. Their algorithm achieved 96.7% and 90.2% sensitivity on two image datasets and 87.7% on a video dataset. Critically, the false-positive rate was reduced from 12.5% to 6.3% after applying the median filter, demonstrating that a simple statistical approach can halve the number of incorrect alerts.

Target tracking algorithms: Nogueira et al. built a real-time polyp detection model on the pre-trained YOLOv3 architecture with Darknet-53 for improved feature extraction, then reduced false positives through a post-processing step based on a target tracking algorithm, achieving approximately 24 FPS. Krenzer et al. developed ENDOMIND-Advanced, which features a real-time post-processing method based on Robust and Efficient Post-Processing (REPP). This system connects bounding boxes across different frames using linking scores and discards those that fail to meet linking and prediction thresholds. ENDOMIND-Advanced achieved 99.06% precision on the CVC-VideoClinicDB dataset while maintaining real-time detection speeds.

Spatiotemporal integration: Zhang et al. introduced a post-processing method within the YOLOv4 prediction phase that uses neighboring frames to assess detection accuracy of the current frame and integrates single-frame results with spatiotemporal information. This approach reached 96.1% precision on the CVC-ClinicVideoDB dataset while fulfilling real-time requirements. The consistent theme across these methods is that leveraging temporal continuity in video, rather than treating each frame in isolation, substantially reduces false detections.

TL;DR: Post-processing slashes false positives: median filtering cut the false-positive rate from 12.5% to 6.3%, ENDOMIND-Advanced (REPP-based) achieved 99.06% precision, and spatiotemporal integration in YOLOv4 reached 96.1% precision, all at real-time speeds.
Pages 9-10
Obstacles to Clinical Deployment: Generalization, Data Quality, and Video Gaps

Image versus video performance gap: Among the many models reviewed, sensitivity for image-based analysis is consistently higher than for video analysis. The authors identify two key reasons: first, most studies are trained on static image datasets, resulting in algorithms that do not transfer well to continuous video streams; second, certain frames in real-time colonoscopy videos are of lower quality than curated static images due to motion blur, poor insufflation, and transient occlusion. This gap highlights a fundamental disconnect between how models are trained and how they are actually deployed.

Polyp diversity and generalization: Polyps observed during colonoscopy exhibit enormous variation in size, shape, texture, color, and orientation. This diversity makes it difficult for any single model to generalize across all polyp types. Furthermore, generalizing across different hospitals, endoscopy devices, and patient populations introduces additional variability that can degrade performance. A model trained on data from one institution may fail when deployed at another with different equipment or patient demographics.

Annotation bottleneck: Endoscopic video annotation is time-consuming and error-prone, requiring expert clinicians to manually label large volumes of data. This creates a fundamental bottleneck in training data availability. Semi-supervised learning algorithms, which can learn from a combination of labeled and unlabeled data, are highlighted as a growing trend to address this limitation. Data augmentation techniques such as rotation, scaling, and flipping can also help generate more diverse training examples from limited annotated datasets.

Transfer learning as a partial solution: The authors highlight transfer learning, which uses models pre-trained on large-scale general datasets and then fine-tunes them on specific polyp detection data, as a practical strategy to compensate for limited training data. However, even with transfer learning, the challenge of domain shift between general images and endoscopic imagery remains, and careful validation on clinical datasets is necessary before deployment.

TL;DR: Key challenges include an image-to-video performance gap, poor generalization across polyp types and institutions, time-consuming annotation requirements, and limited training datasets. Transfer learning and semi-supervised approaches offer partial solutions.
Pages 10-11
The Road Ahead: From Benchmarks to Bedside Deployment

Bridging the video quality gap: The authors outline several strategies for improving algorithm performance on real-world video data. Exploiting similarity between consecutive frames can improve the quality of low-quality frames when neighboring high-quality frames are available. Video frame interpolation, which uses high-quality reference images to synthesize intermediate frames, can generate high-frame-rate video from low-frame-rate input. Both approaches aim to bring video-based detection performance closer to the levels achieved on curated image datasets.

Architectural evolution: The review documents a clear trajectory from traditional machine learning (below 20 FPS, AUC up to 0.96) through early CNNs (below 28 FPS) to modern architectures like YOLOv5-TST (138.3 FPS) and TensorRT-optimized YOLOv4 (over 250 FPS). This progression demonstrates that the speed problem has been largely solved. The remaining frontier is detection correctness, where ensemble models combining architectures like ResNet, GoogLeNet, and Xception have reached 98.3% accuracy, and temporal post-processing methods like ENDOMIND-Advanced have achieved 99.06% precision.

Semi-supervised and data-efficient learning: With annotation being a major bottleneck, semi-supervised learning algorithms that leverage unlabeled data alongside limited labeled examples represent an increasingly important research direction. Combined with data augmentation and transfer learning from large-scale pre-trained models, these approaches could dramatically reduce the amount of expert annotation required to build clinically viable systems.

Clinical integration outlook: The paper concludes that deep learning algorithms have demonstrated clear superiority over traditional methods in both speed and accuracy for real-time colorectal polyp detection. The next phase requires validating these algorithms in diverse clinical settings across different patient populations, devices, and operator skill levels. Successful integration will depend not only on algorithmic performance but also on practical considerations such as computational requirements, workflow compatibility, and clinician trust in AI-assisted decision-making.

TL;DR: Speed is largely solved (over 250 FPS achievable). Future priorities are closing the image-to-video gap, improving detection correctness beyond 99%, reducing annotation burdens through semi-supervised learning, and validating systems across diverse clinical environments.
Citation: Nie MY, An XW, Xing YC, Wang Z, Wang YQ, Lü JQ.. Open Access, 2024. Available at: PMC11626263. DOI: 10.62347/bziz6358. License: Open Access.