Melanoma Detection Using Deep Learning-Based Classifications

Diagnostics 2022 AI 8 Explanations View Original
Original Paper (PDF)

Unable to display PDF. Download it here or view on PMC.

Plain-English Explanations
Pages 1-2
Why Automated Melanoma Detection Matters

Skin cancer is one of the most prevalent cancers worldwide, and its incidence continues to rise as populations age. Among skin cancer subtypes, melanoma is particularly dangerous because malignant melanocyte cells proliferate, invade, and spread rapidly. Basal cell carcinoma (BCC) accounts for roughly 70% of skin cancer cases, while melanoma represents about 10% and squamous cell carcinoma (SCC) about 17%. Despite being less common than BCC, melanoma carries far higher mortality risk, making early detection critical to patient survival.

Clinical diagnostic methods: Dermatologists traditionally rely on dermoscopy and epiluminescence microscopy (ELM) to distinguish benign from malignant lesions. Established rule-based frameworks include the ABCD rule (Asymmetry, Border irregularity, Color discrepancy, Diameter, and Evolution), the 7-point checklist, and pattern analysis. Non-professional dermoscopic images have a predictive value of only 75% to 80% for melanoma, and interpretation is highly subjective, varying significantly with dermatologist experience.

The case for deep learning: Computer-Aided Diagnosis (CAD) systems have advanced considerably thanks to deep learning. In rural areas, dermatologists and diagnostic labs are in short supply, making automated screening tools especially valuable. This study proposes a pipeline that combines Enhanced Super-Resolution Generative Adversarial Networks (ESRGAN) for image preprocessing, segmentation for isolating regions of interest (ROI), data augmentation to correct class imbalance, and classification via a custom CNN and a modified ResNet-50. The system is tested on the HAM10000 dataset, which contains 10,015 dermoscopic images spanning seven diagnostic categories.

TL;DR: Melanoma accounts for about 10% of skin cancers but carries high mortality. Manual dermoscopic interpretation has only 75-80% predictive value and is subjective. This study proposes a deep learning pipeline (ESRGAN + segmentation + CNN/modified ResNet-50) tested on the 10,015-image HAM10000 dataset, aiming to automate and improve skin lesion classification across 7 categories.
Pages 3-4
Prior Deep Learning Approaches to Skin Cancer Classification

The authors survey a range of prior studies to contextualize their approach. Haenssle et al. tested a Google Inception V4 deep learning model against 58 dermatologists using 100 patient images, finding competitive performance. Brinker et al. (Albahar, 2019) developed an improved deep learning model and compared it to diagnoses from 145 dermatologists across 12 German hospitals. Li et al. reviewed CNN models and reported that residual learning and separable convolution achieved up to 99.5% accuracy, though only on binary classification problems (malignant vs. benign).

Multi-class approaches: Pacheco et al. built a smartphone app for skin lesion identification using clinical data from 1,641 patients across 6 cancer types. They compared a three-layer CNN, GoogleNet, ResNet, VGGNet, and MobileNet, achieving 0.69 accuracy with images alone and 0.764 when clinical metadata was added. Kadampur and Riyaee used the HAM10000 dataset with several deep learning models and achieved an AUC of 0.99 for malignant vs. benign classification. Jinnai et al. used two CNN models that outperformed dermatologists in classification accuracy.

Transfer learning and segmentation: Kassem et al. applied a GoogleNet pre-trained model for 8 skin cancer categories and achieved 0.949 accuracy. Panja et al. used CNN-based feature extraction after image segmentation for melanoma vs. benign classification. ResNet-50 was used in the ISIC 2019 challenge for 8-class classification using transfer training, while other researchers explored MASK-RCNN, DenseNet121, and AlexNet on the HAM10000 dataset. Across these studies, dataset sizes ranged from 300 to 10,015 images, and the number of classification categories ranged from 2 to 12.

TL;DR: Prior work includes Inception V4 vs. 58 dermatologists, CNN models achieving 99.5% binary accuracy, smartphone apps reaching 0.764 accuracy with clinical data, and GoogleNet achieving 0.949 on 8-class problems. Most used HAM10000 or ISIC datasets with 2-12 classes and architectures like ResNet-50, DenseNet121, MobileNet, and AlexNet.
Pages 5-6
The HAM10000 Dataset and Its Class Imbalance Problem

The study uses the HAM10000 (Human Against Machine with 10,000 training photos) dataset, a publicly available benchmark licensed under CC-BY-NC-SA-4.0 and sourced from Kaggle's Imaging Archive. It contains 10,015 JPEG dermoscopic images compiled from two clinical sites: one in Vienna, Austria, and one in Queensland, Australia. The Austrian site had images dating back to pre-digital cameras preserved in multiple formats, while the Australian site stored images in PowerPoint files and Excel databases.

Seven diagnostic categories: The dataset covers actinic keratosis (akiec), basal cell carcinoma (bcc), benign keratosis-like lesions (bkl), dermatofibroma (df), melanoma (mel), melanocytic nevi (nv), and vascular lesions (vasc). The class distribution is severely imbalanced: melanocytic nevi dominate at 67% of all images, while dermatofibroma accounts for only 1%, vascular lesions for 2%, and akiec for 3%. Melanoma and bkl each represent about 11%, and bcc about 5%.

Train/test split: The data was divided into 90% training (9,016 images) and 10% testing (984 images), with 10% of the training set (992 images) reserved for validation. The testing set reflected the original class distribution, meaning the nv class had 795 test images while df had only 7 and vasc had only 10. This extreme imbalance in both training and testing is a central challenge the paper attempts to address through augmentation.

TL;DR: HAM10000 contains 10,015 dermoscopic images across 7 skin lesion categories from two clinical sites. Class imbalance is severe: nv at 67% vs. df at 1%. The split is 90/10 train/test (9,016/984 images), with 992 images for validation. The testing set has 795 nv images but only 7 df and 10 vasc images.
Pages 6-7
ESRGAN Image Enhancement and Segmentation

The first step in the pipeline is image quality improvement using Enhanced Super-Resolution Generative Adversarial Networks (ESRGAN). ESRGAN is an improved version of SRGAN that removes batch normalization layers, which allows the generated images to better preserve sharp edges and fine details in lesion borders. Instead of a standard Convolution trunk or basic Residual Network, ESRGAN uses a Relativistic Discriminator to judge whether an image is real or generated. The loss functions combine Relativistic Average Loss and Pixelwise Absolute Difference.

Two-stage training: The generator is trained in two phases. First, the pixelwise L1 distance between source and target high-resolution images is minimized to avoid local minima. Second, fine details and small artifacts are refined through adversarial training. The final model interpolates between the adversarially trained weights and the L1-optimized weights for photo-realistic reconstruction. An adaptive contrast enhancement technique further improves border visibility and image contrast in the lesion regions.

Segmentation: After ESRGAN preprocessing, regions of interest (ROI) are extracted from each dermoscopic image. The HAM10000 dataset provides ground truth masks for general-purpose segmentation. These masks are applied to the enhanced images to isolate the lesion area from surrounding healthy skin, producing segmented ROIs that serve as input for the classification models. All images are then resized to 224 x 224 x 3 pixels, and normalization is applied.

TL;DR: ESRGAN enhances image quality by removing batch normalization and using a Relativistic Discriminator with two-stage training (L1 loss then adversarial refinement). Ground truth masks from HAM10000 segment ROIs from enhanced images. All images are resized to 224 x 224 x 3 and normalized before classification.
Pages 8-9
Data Augmentation to Address Class Imbalance

To address the severe class imbalance in the HAM10000 dataset, the authors applied oversampling through data augmentation before training. The augmentation transforms exploit the inherent properties of dermatological images, where horizontal flips, vertical flips, rotations, and magnification do not alter the diagnostic meaning. Specific transforms include horizontal shift augmentation (adjusting pixel positions horizontally with a step size between 0 and 1), random rotation between 0 and 180 degrees, a zoom range of 0.1, and a rescale factor of 1.0/255. The recommended input size was 244 x 244 x 3.

Before augmentation: The original training distribution was dominated by melanocytic nevi (nv) at 67% of all images, with dermatofibroma (df) at just 1%, vascular lesions (vasc) at 2%, akiec at 3%, bcc at 5%, and melanoma (mel) and benign keratosis (bkl) each around 11%. This distribution means a naive classifier could achieve 67% accuracy simply by predicting nv for every image.

After augmentation: The oversampling strategy expanded the training set from 9,016 to 39,430 images, with each class now roughly balanced: akiec at 5,684 (15%), bcc at 5,668 (14%), mel at 5,886 (15%), vasc at 5,570 (14%), nv at 5,979 (15%), df at 4,747 (12%), and bkl at 5,896 (15%). This represents a roughly 4.4x increase in total training data. The augmented segmented images were included as part of the data expansion process.

TL;DR: Oversampling augmentation expanded training data from 9,016 to 39,430 images (4.4x increase). Class distribution shifted from 67% nv dominance to roughly balanced 12-15% per class. Transforms included rotation (0-180 degrees), horizontal shifts, zoom (0.1 range), and rescaling (1.0/255).
Pages 10-11
Custom CNN and Modified ResNet-50 Architectures

Proposed CNN: The custom CNN architecture has four main layers plus an output layer. Each layer consists of three convolution sublayers: the first two use a kernel size of 3 with stride 1, and the third uses a kernel size of 5 with stride 2. All layers use the ReLU activation function. Three max-pooling layers with pool size 3 and stride 1 are used for down-sampling. The fully connected layer at the end acts as a standard multilayer perceptron that outputs the 7-class classification probabilities. Segmented ROIs feed directly into the first convolution layer.

Modified ResNet-50: The second model builds on the pre-trained ResNet-50 architecture, originally developed by He et al. in 2015 and winner of the LSVRC2015 competition with a sub-3.6% error rate on ImageNet. ResNet's key innovation is skip connections that allow input signals to pass through residual units, enabling gradient flow in very deep networks. The authors modified ResNet-50 by removing the original fully connected (FC) and softmax layers and replacing them with a new FC layer of size 512, followed by another FC layer of size 3, and a new softmax layer. The first two layers were pre-trained on ImageNet, with additional layer weights initialized randomly.

Training strategy: Batch normalization was applied to combat overfitting, which is especially problematic when the training dataset is small. The authors also employed a "many-runs ensemble" approach, running multiple training iterations of the same architecture and retaining only the best-performing run. This accounts for the inherent variability in DNN results caused by random weight initialization. The FC layer of size 1,024 bytes feeds into a 3-byte third FC layer, and the Softmax classifier computes probability distributions across the n output classes.

TL;DR: The custom CNN uses four main layers with 3x3 and 5x5 kernels, ReLU activation, and max-pooling. The modified ResNet-50 replaces the original FC/softmax layers with FC-512, FC-3, and a new softmax layer, pre-trained on ImageNet. Both models use batch normalization and a many-runs ensemble strategy to combat overfitting and random initialization variability.
Pages 12-15
Classification Performance and Comparative Analysis

Training configuration: All experiments ran on a Linux PC with an RTX 3060 GPU and 8 GB RAM using TensorFlow Keras. The Adam optimizer was used with a learning rate schedule that reduces the rate when validation performance stagnates. Hyperparameters included batch sizes ranging from 2 to 64 (doubling each step), 50 training epochs, patience of 10, momentum of 0.9, and learning rates of 1 x 10-4, 1 x 10-5, and 1 x 10-6 for CNN and ResNet-50 respectively. Images were scaled to 227 x 227 x 3. ResNet-50 was further fine-tuned by freezing varying numbers of layers.

CNN results: The proposed CNN achieved 86% accuracy (0.8598), 0.84 precision, 0.86 recall, and an F-score of 0.8598. Top-2 accuracy reached 94%, and top-3 accuracy reached 97.26%. In per-class performance, the nv class had the highest recall at 97% (770 out of 790 correctly classified), with precision of 91% and F-score of 94%. The melanoma class performed poorly, with precision of 0.39, recall of 0.27, and F-score of just 0.32. Dermatofibroma was the weakest class overall, with precision of 0.25, recall of 0.14, and F-score of 0.18.

Modified ResNet-50 results: The modified ResNet-50 achieved 85.3% accuracy (0.8526), 0.86 precision, 0.85 recall, and 0.85 F-score. Top-2 accuracy was 93.29%, and top-3 accuracy was 96.95%. For nv, precision, recall, and F-score were all 94% (749 out of 790 correctly classified). The vascular class showed perfect precision (1.00) but only 0.70 recall. Melanoma had 0.37 precision, 0.59 recall, and 0.45 F-score, slightly better than the CNN for melanoma detection.

Comparison with prior methods: The proposed CNN at 86% outperformed RegNetY-3.2GF (85.8%), AlexNet (84%), MobileNet (83.9%), standalone CNN and ResNet-50 from prior references (77% and 78%), ensemble methods using SVM/RF/AdaBoost (74.75%), and MobileNet+LSTM (85%). All comparisons used the HAM10000 dataset. The authors attribute their improvement to three factors: ESRGAN preprocessing for general resolution enhancement, the use of diverse architectures with different generalization capabilities, and fine-tuning of network weights.

TL;DR: The proposed CNN achieved 86% accuracy (precision 0.84, recall 0.86, F-score 0.86), while modified ResNet-50 reached 85.3% accuracy. Top-2 accuracy hit 94% and 93.3%, respectively. Nv class was classified at 97% recall (CNN) and 94% (ResNet-50), but melanoma recall was only 0.27 (CNN) and 0.59 (ResNet-50). The 86% accuracy outperformed prior methods including AlexNet (84%), MobileNet (83.9%), and ensemble approaches (74.75%).
Pages 15-16
Key Limitations and Paths Forward

Melanoma detection gap: Despite the headline 86% overall accuracy, the model's performance on the most clinically critical class, melanoma, was concerning. The CNN achieved only 0.27 recall and 0.39 precision for melanoma, meaning roughly 73% of melanoma cases would be missed. The modified ResNet-50 was somewhat better at 0.59 recall, but a melanoma miss rate of 41% is far from clinically acceptable. This weakness likely stems from the visual similarity between melanoma and other pigmented lesions, combined with the relatively small number of melanoma test images (41 out of 984).

Single-dataset limitation: All training and testing was conducted on the HAM10000 dataset from just two clinical sites (Vienna and Queensland). There was no external validation on independent datasets such as ISIC 2018, ISIC 2019, PH-2, or PAD-UFES-20. Without multi-site external validation, the model's generalizability to different patient populations, camera equipment, lighting conditions, and skin tones remains unproven. The authors themselves note that experiments on larger and more complicated datasets, including future cancer cases, are needed to demonstrate efficacy.

Confounding factors: The study did not account for the fact that lesion-less skin and lesioned skin are not always caused by skin cancer. Non-cancerous skin conditions can mimic cancerous lesions visually, which is a known confounding factor in clinical diagnosis. The authors acknowledge this and plan to incorporate non-cancerous conditions into future datasets to test the model's ability to distinguish cancer from benign mimics.

Future work: The authors propose evaluating additional architectures such as DenseNet, VGG, and AlexNet on the cancer dataset. They also suggest expanding the dataset to include non-cancerous skin conditions that visually resemble malignant lesions. Transfer learning from natural images showed limited benefit for medical imaging because features learned from natural images lack semantic relevance to dermoscopic patterns. The observation that CNN outperformed ResNet-50 on medical images suggests that shallower, more generalizable networks may be more suitable for this domain than very deep architectures trained primarily on natural image datasets like ImageNet.

TL;DR: Melanoma recall was only 0.27 (CNN) and 0.59 (ResNet-50), far below clinical requirements. The study was limited to a single dataset (HAM10000) from two sites with no external validation. Non-cancerous skin conditions were not included. Future work includes testing DenseNet/VGG/AlexNet, adding non-cancerous mimics to the dataset, and addressing the finding that shallower CNNs outperformed deeper ResNet-50 on medical images.
Citation: Alwakid G, Gouda W, Humayun M, Sama NU.. Open Access, 2022. Available at: PMC9777935. DOI: 10.3390/healthcare10122481. License: cc by.