Deep Neural Networks for Colon Cancer Screening

Plain-English Explanations

Background

Pages 1-2

Why Colorectal Cancer Screening Needs Deep Learning

Colorectal cancer is the third most common cancer worldwide and was the second most common cause of cancer-related deaths in 2018. Endoscopic removal of precancerous lesions remains the best way to prevent it, and colonoscopy is the gold standard for screening. However, the rate of missed polyp detection during colonoscopy varies significantly depending on the endoscopist's experience and skill level, which creates a clear opportunity for AI-assisted tools.

Convolutional neural network (CNN)-based architectures have been extensively used to segment and classify colon lesions. Traditional computer-aided diagnostic systems relied on manual parameter setting for feature extraction, which introduces bias and inconsistency. Hand-crafted features with a feature selection module are required before implementing a neural network in these older approaches, adding complexity and user dependence to the diagnostic pipeline.

This review, published in Cancers (2022), surveys the latest deep learning approaches for colorectal cancer detection. The authors organize AI methods into five categories: hybrid learning, end-to-end learning, transfer learning, explainable AI (XAI), and sampling methods. They emphasize that while many models achieve high accuracy, the lack of transparency and interpretability in most deep networks is a major barrier to clinical adoption.

TL;DR: Colorectal cancer is the third most common cancer globally. Colonoscopy misses polyps at rates that depend on endoscopist skill, so AI tools based on CNNs are being developed. This review categorizes deep learning methods into five approaches: hybrid, end-to-end, transfer, explainable, and sampling-based learning.

Imaging & Data

Pages 2-3

Imaging Modalities: From Endoscopic Images to Whole Slide Images

The studies reviewed used a range of imaging inputs. One study used 224 x 224 RGB images at a resolution of 256 x 256 pixels, working with 200 normal tissue samples and 200 tumor samples, and employed a sliding-window technique to break these into smaller patches. Others used endoscopic images and whole slide images (WSIs) for colon cancer detection. A key finding was that larger image sizes, such as 768 x 768 pixels, preserved tissue architecture information better, while smaller patch sizes like 384 x 384 pixels produced similar results but at higher computational cost.

For studies using WSIs of cytokeratin immunohistochemistry obtained from a digital slide scanner, images were standardized to 1 micrometer = 1 pixel and saved as non-layered JPEG images. These were then converted into binary images after deletion of non-cancerous areas. Another study utilized an automatic cropping approach that removed black margins and produced a square image with a 1:1 ratio.

Some studies used H&E-stained histology slides (hematoxylin and eosin staining), which are the standard staining method in pathology for visualizing tissue structure. Others used magnifying narrow-band imaging (M-NBI), a specialized endoscopic technique that enhances surface vessel patterns. The diversity of imaging modalities reflects the multiple points in the clinical workflow where AI can assist, from live endoscopy to post-procedure histopathological analysis.

TL;DR: Studies used diverse imaging inputs including endoscopic images, H&E-stained histology slides, WSIs, and magnifying narrow-band imaging. Image sizes ranged from 224 x 224 to 768 x 768 pixels, with larger patches preserving tissue architecture better but at higher computational cost.

Hybrid Learning

Pages 3-5

Hybrid Learning: Combining Algorithms for Better Colon Cancer Detection

Hybrid learning methods combine multiple algorithms or processes from different domains. Ghosh et al. developed a model combining supervised and unsupervised learning techniques, using K-means clustering, the Girvan-Newman algorithm, and Mahalanobis distance-based clustering, followed by principal component analysis (PCA) for dimensionality reduction. The data was then fed into an artificial neural network (ANN) for classification, achieving the highest classification accuracy of 98.60% among all tested classifiers (range: 88.71% to 98.40%).

A modified VGG-based CNN (ConvNet from the Visual Geometry Group) was evaluated in five configurations. The best-performing model had the most weight layers and depths. Yamada et al. trained a Faster R-CNN model first on ImageNet, then on colonoscopy images, achieving a sensitivity of 97.30% and specificity of 99.00% for AI-based detection, compared to 87.40% sensitivity and 96.40% specificity for endoscopists. The AI processing time was 0.022 seconds per image, compared to 2.4 seconds per image for endoscopists.

Ho et al. applied a classical machine learning classifier combined with a Faster R-CNN with ResNet-101 for glandular segmentation and achieved an AUC of 0.917 with 97.4% sensitivity in detecting high-risk features of dysplasia and malignancy. Another hybrid approach using Inception V3 pre-trained on ImageNet, combined with segmentation from digitized H&E-stained histology slides, demonstrated a median accuracy of 99.9% for healthy tissue and 94.8% for cancer slides.

Additional studies showed that a modified ZF-Net achieved 98.0% accuracy, 98.1% sensitivity, and 96.3% specificity for polyp detection with high interpretability through saliency maps. Urban et al. demonstrated real-time polyp detection at 1 frame per 10 ms with 96.4% accuracy and an AUC of 0.991 using a CNN pre-trained on ImageNet. The ARA-CNN (Accurate, Reliable, and Active CNN), a Bayesian deep learning model inspired by ResNet and DarkNet 19, outperformed other models by 18.78% on the same dataset.

TL;DR: Hybrid models combining multiple algorithms achieved top results: 98.60% classification accuracy (Ghosh et al.), 97.30% sensitivity with 0.022s processing (Yamada et al.), AUC of 0.917 (Ho et al.), and 96.4% real-time polyp detection accuracy with AUC 0.991 (Urban et al.).

End-to-End Learning

Pages 5-6

End-to-End Learning: Training Entire Pipelines Without Manual Feature Engineering

End-to-end (e2e) learning trains a complex system by applying gradient-based learning across all stages simultaneously, so the model learns all steps between initial input and final output. This approach eliminates the need for manual feature engineering. However, the authors note several limitations including poor local optima, vanishing gradients, ill-conditioned problems, and slow convergence in different circumstances.

Buendgens et al. applied e2e learning to a non-annotated routine database without manual labels, using a ResNet-18 model preprocessed in MATLAB R2021a. The model diagnosed inflammatory, degenerative, infectious, and neoplastic diseases from raw gastroscopy and colonoscopy images. The AUC for diagnosing 13 diseases ranged from 0.7 to 0.8, and the model predicted colorectal cancer with an AUC greater than 0.76. This demonstrates the potential of weakly supervised AI, which does not require extensive manual annotations.

Iizuka et al. trained CNNs and recurrent neural networks (RNNs), including Inception v3, to classify WSIs of biopsy specimens from the stomach and colon into adenocarcinoma, adenoma, and non-neoplastic tissue. When tested on datasets from The Cancer Genome Atlas (TCGA) with a mix of formalin-fixed paraffin-embedded (FFPE) and flash-frozen tissues, the model achieved high AUC values of 0.96 to 0.99 for adenoma detection, despite being largely trained on biopsies. Pinckaers and Litjens developed U-Node (a U-Net variant with ordinary differential equation blocks), which used fewer parameters and improved gland segmentation compared to baseline U-Net.

TL;DR: End-to-end learning removes manual feature engineering. A weakly supervised ResNet-18 model predicted colorectal cancer with AUC greater than 0.76 without manual annotations. Iizuka et al. achieved AUC values of 0.96 to 0.99 for adenoma detection using Inception v3 and RNNs on TCGA WSI data.

Transfer Learning

Pages 6-8

Transfer Learning: Leveraging Pre-Trained Models for Colorectal Cancer Tasks

Transfer learning transfers knowledge from a model trained on one problem to another related problem. It offers key advantages: a better starting model, higher accuracy, and faster training compared to training from scratch. The review identifies two main approaches: (1) using a pre-trained model and adapting its features to a target task, and (2) developing a new model from scratch for knowledge transfer and training it with available data.

Hamida et al. tested several CNN architectures for patch-level classification of colon cancer WSIs, including AlexNet, VGG, ResNet, DenseNet, and Inception. After fine-tuning, these models achieved accuracy rates of 89.42% (AlexNet), 95.25% (ResNet), 96.98% (DenseNet), 95.86% (VGG), and 92.43% (Inception). Notably, ResNet presented the highest accuracy for classification, while SegNet outperformed U-Net for pixel-wide segmentation, though SegNet had a higher computational cost. Models trained from scratch showed low accuracy, while fine-tuning produced the best performance overall.

Kather et al. trained VGG19, AlexNet, SqueezeNet 1.1, GoogLeNet, and ResNet50 to identify tissue types in histological images of colorectal cancer, including non-tumorous tissue. The VGG19 model performed best and was able to recreate morphological features from the datasets and visualize tissue structures via the DeepDream approach. The VGG19 model showed classification accuracy on par with human vision. A modified VGG-based CNN accurately classified 294 out of 309 normal tissue images and 667 out of 719 tumor tissue images.

Gessert et al. compared learning from scratch, partial freezing, and fine-tuning strategies in VGG16 and Inception v3 models with a small number of datasets. Training from scratch performed extremely poorly, while there were no significant differences between partial freezing and fine-tuning. However, the optimal strategy differed across models and tasks. Malik et al. found that, in contrast, training a CNN from scratch yielded 94.5% detection accuracy, which was 3.85% higher than the best-performing transfer-learned CNN, and specificity was 16.81% higher than other models.

TL;DR: Transfer learning generally outperforms training from scratch. Fine-tuned DenseNet achieved 96.98% accuracy on WSI classification. VGG19 performed best for tissue type identification. However, results vary by model and task, with one study finding training from scratch achieved 94.5% accuracy, outperforming transfer-learned alternatives.

Explainable AI

Pages 8-9

Explainable AI: Making Deep Learning Decisions Transparent for Clinicians

Explainable AI (XAI) involves methods that allow users to understand the results produced by machine learning models, their impacts, and potential biases. While most colorectal cancer models focus on accuracy, very few provide evidence contributing to decision outcomes. Korbar et al. developed a deep ResNet-101 visualization network for colorectal polyp detection that could project classification back to the input pixel space, indicating which parts of the input H&E-stained WSI were key to the classification decision.

Sabol et al. developed the Explainable Cumulative Fuzzy Class Membership Criterion (X-CFCMC) model, which complemented its classification decisions with three types of information: visualization of the most important regions, visualization of unwanted regions, and semantic explanation of possibilities. When pathologists evaluated this model against a plain CNN model, they preferred the X-CFCMC model because it was more useful and reliable for clinical decision-making.

Hagele et al. used layer-wise relevance propagation (LRP) with a GoogLeNet model from the Caffe Model Zoo to identify tumor entities. The resulting explainable heat maps assisted in detecting biases that could affect model generalization, including biases affecting the entire dataset, biases correlated to specific class labels by chance, and sampling biases. These visualizations provided transparency that is essential for clinical trust.

Yao et al. developed Deep Attention Multiple Instance Survival Learning (DeepAttnMISL) with a pretrained VGG model using ImageNet features and K-means clustering to aggregate interpretable features of colorectal cancer patterns. This approach was found to be more effective for large datasets and showed better interpretability in locating important patterns that contributed to accurate survival prediction in cancer patients.

TL;DR: Explainable AI methods like ResNet-101 visualization, X-CFCMC (preferred by pathologists over plain CNNs), layer-wise relevance propagation heat maps, and attention-based survival models provide transparency in classification decisions, which is critical for clinical trust and adoption.

Sampling Methods

Pages 9-10

Handling Imbalanced Data: Sampling Strategies for Reliable Cancer Diagnosis

Data imbalance poses a significant challenge for deep learning in cancer diagnosis. When trained on imbalanced data, AI systems become biased toward majority classes, producing high-precision but low-recall predictions. This is especially dangerous in cancer diagnosis because false negatives (missed cancers) are more clinically important than false positives. Most AI techniques assume balanced class distributions, and their overall performance degrades substantially when this assumption is violated.

Koziarski et al. addressed this by using oversampling techniques in the image space to expand training data for a MobileNet CNN, combined with sampling in the feature space to fine-tune the network's final layers. Their study revealed that higher levels of class imbalance significantly degraded classification performance, and that data imbalance itself, not just reduced training sample size, was a primary driver of performance decline.

Hong et al. developed a novel algorithmic-level loss function combining cross-entropy with asymmetric loss in EfficientNet B4, EfficientNet B5, and U-Net models. This approach identified each pixel individually by comparing class predictions, producing a better balance between precision and recall for colon cancer polyp segmentation. Shapcott et al. used systematic random sampling and adaptive sampling in a CNN architecture trained on 142 H&E-stained, 40x magnification colorectal cancer images from the TCGA COAD dataset, achieving significant improvements in diagnostic performance.

TL;DR: Class imbalance causes dangerous false negatives in cancer AI. Solutions include oversampling (MobileNet), asymmetric loss functions (EfficientNet + U-Net for polyp segmentation), and adaptive sampling strategies on H&E-stained images. Higher imbalance levels directly degrade classification accuracy.

Conclusions & Future Directions

Pages 10-12

Current Limitations and the Path Toward Clinically Deployable AI

The review identifies several critical limitations across the surveyed deep learning models. Most AI models for predicting invasive cancer are prone to over-detection, generating false positives that increase clinician workload. The user-dependent and non-transparent nature of complex deep networks does not provide appropriate evidence for the key factors used in classification, which is the primary reason for the slow adoption of these techniques in clinical practice.

Studies used relatively small datasets that limit generalization. For example, the modified VGG model trained on only 200 normal and 200 tumor samples, and several transfer learning studies acknowledged that small dataset sizes constrained their findings. The review notes that weak learning procedures do not always provide clear diagnostic support, and nontransparent high prediction accuracy in complex architectures is insufficient without understanding why a specific output was produced.

The authors emphasize that it is imperative for the connection between features and predictions to be comprehensible. If an AI algorithm contributes to a clinical decision, clinicians must understand how the output was characterized. The review found that only a small number of studies employed explainable or transparent approaches, despite their importance for clinical trust. The explainable models that do exist, such as X-CFCMC and gradient-based visualization, were strongly preferred by pathologists in evaluation studies.

The authors propose that AI using visualization methods for classification outcomes could significantly reduce clinician burden and improve diagnostic accuracy. They recommend developing cost-effective AI frameworks that combine high accuracy with interpretability, addressing data imbalance through advanced sampling, and validating models on larger, more diverse datasets to ensure generalization across clinical settings. Supporting evidence for AI-based diagnosis remains strongly required for practice-level validation.

TL;DR: Most deep learning models achieve high accuracy but lack transparency, which slows clinical adoption. Over-detection, small datasets, and non-interpretable architectures remain key barriers. The authors advocate for explainable AI with visualization methods, larger validation datasets, and cost-effective diagnostic frameworks.

Deep Neural Network Models for Colon Cancer Screening

Original Paper (PDF)