Convolutional neural networks for the detection of malignant melanoma in dermoscopy images

Computers in Biology and Medicine 2021 AI 8 Explanations View Original
Original Paper (PDF)

Unable to display PDF. Download it here or view on PMC.

Plain-English Explanations
Page 1
Why Automated Melanoma Detection in Dermoscopy Matters

Malignant melanoma is one of the most aggressive skin cancers, and early detection is the single most important factor in patient survival. The 5-year survival rate for patients with stage IA melanoma (tumors less than 1 mm thick, as measured by the Breslow scoring system) exceeds 90%. For tumors thicker than 4 mm, outcomes are far worse. Despite advances in anti-cancer immunotherapies that have improved survival for patients with advanced melanoma, a large number of patients still die from the disease. Many early melanomas are asymptomatic and do not arouse suspicion, even though they are visible to the naked eye.

The dermoscopy challenge: Dermoscopy is a non-invasive imaging technique that can aid early melanoma detection, but its accuracy depends heavily on the physician's skill and experience. This creates a natural bottleneck: not all clinicians have the training to reliably distinguish early melanoma from benign pigmented lesions. Convolutional neural networks (CNNs) have demonstrated strong performance in computer vision tasks including classification and object detection, sometimes even outperforming humans. This study set out to evaluate whether CNNs could serve as a reliable aid in classifying malignant melanoma from dermoscopy images.

Scope of the study: The authors trained and compared four CNN architectures on the publicly available HAM10000 dataset (part of the ISIC 2018 challenge), which contains 10,015 dermoscopy images spanning seven categories of skin lesions. All diagnoses in the dataset were confirmed by histological examination. The study measured precision, sensitivity, F1 score, specificity, and area under the receiver operating curve (AUC) for each model, and also tested whether combining all four models into an ensemble could improve melanoma classification further.

TL;DR: Stage IA melanoma has over 90% five-year survival, but late detection is common. Dermoscopy accuracy depends on physician skill. This study trained four CNN architectures on 10,015 histologically confirmed dermoscopy images from HAM10000 to evaluate automated melanoma classification.
Page 2
The HAM10000 Dataset: 10,015 Images Across Seven Lesion Types

The dataset was extracted from the "ISIC 2018: Skin Lesion Analysis Towards Melanoma Detection" grand challenge. It contained 10,015 dermoscopy images distributed across seven diagnostic categories: melanoma (1,113 images, 11.1%), melanocytic nevus (6,705 images, 66.9%), basal cell carcinoma (514, 5.1%), actinic keratosis/Bowen's disease (327, 3.3%), benign keratosis (1,099, 11.0%), dermatofibroma (115, 1.1%), and vascular lesion (142, 1.4%). The dataset is notably imbalanced, with melanocytic nevi accounting for roughly two-thirds of all images.

Data splitting: The authors used unique identifiers provided with the HAM10000 dataset to split images into training (8,123 images), validation (886 images), and test (1,006 images) sets. This identifier-based splitting was critical for avoiding training-test data leakage, where the same lesion photographed at different angles or times could appear in both training and test sets, artificially inflating performance metrics. The proportional distribution of each lesion type was maintained across all three splits.

Data augmentation: Because most images had lesions centered in the frame, the authors applied random rotation by 180 degrees during training, then center-cropped images from 600 x 450 pixels to 300 x 400 pixels to remove black borders and skin background. All images were then resized to 224 x 224 pixels, the standard input size for the CNN architectures used. To address the class imbalance problem, a weighted cross-entropy loss function was employed, with weights set to the inverse cardinality (frequency) of each class in the training set. This ensured the network did not simply learn to predict the majority class (melanocytic nevus) for every input.

TL;DR: HAM10000 contains 10,015 dermoscopy images across 7 categories, with melanocytic nevi comprising 66.9% of the data. Split into 8,123 training, 886 validation, and 1,006 test images using unique identifiers to prevent data leakage. Weighted cross-entropy loss addressed class imbalance, and augmentation included 180-degree rotation and center cropping to 300 x 400 before resizing to 224 x 224.
Pages 2-3
Four CNN Architectures: ResNet-101, ResNeXt, SE-ResNet, and SE-ResNeXt

The study compared four deep CNN architectures, all based on the ResNet-101 backbone. ResNet-101 consists of 33 residual blocks and 100 convolution operations total. The key innovation of residual networks is the shortcut (skip) connection, which adds the input of a residual block directly to its output. This identity mapping enables training of much deeper networks by mitigating the vanishing gradient problem. After each convolutional layer, the authors inserted batch normalization followed by a rectified linear unit (ReLU) activation function, a standard configuration for improving training speed and stability.

ResNeXt: This variant introduces a new hyperparameter called cardinality, implemented through grouped convolution. Instead of a single large convolution operation, the input is divided into 32 groups, and 32 smaller convolutions are performed in parallel. This architecture captures richer feature representations without significantly increasing computational cost, as the grouped convolution is equivalent in parameters to the standard approach but learns more diverse filters.

Squeeze-and-Excitation (SE) networks: SE-ResNet and SE-ResNeXt add an additional block to the standard residual or ResNeXt architecture. The SE block performs global average pooling, passes the result through two fully connected layers with a sigmoid activation, and then scales the original feature maps by the resulting channel-wise weights. This mechanism allows the network to recalibrate its feature responses, increasing sensitivity to the most descriptive features while suppressing less useful ones. The SE block adds minimal computational overhead but has been shown to improve classification accuracy across many benchmarks.

Transfer learning: All four models were pre-trained on ImageNet, a large-scale dataset with 1,000 object classes. The final fully connected layer (1,000 output nodes) was replaced with a new layer containing 7 output neurons corresponding to the 7 skin lesion categories, followed by a softmax activation function. Models were trained with the Adam optimizer for up to 20 epochs, with early stopping applied when overfitting was detected on the validation set.

TL;DR: Four architectures tested: ResNet-101 (33 residual blocks, 100 convolutions), ResNeXt (32-group cardinality), SE-ResNet, and SE-ResNeXt (channel recalibration via squeeze-and-excitation blocks). All pre-trained on ImageNet, fine-tuned with Adam optimizer for up to 20 epochs with early stopping. Final layer replaced with 7-class softmax output.
Pages 3-4
How Performance Was Measured: Precision, Sensitivity, Specificity, F1, and AUC

The authors evaluated each CNN on the held-out test dataset of 1,006 images using five standard classification metrics. Precision measures the proportion of predicted melanomas that were actually melanoma (minimizing false positives). Sensitivity (recall) measures the proportion of actual melanomas correctly identified (minimizing false negatives). Specificity measures how well the model identifies non-melanoma cases. The F1 score is the harmonic mean of precision and sensitivity, calculated as 2TP / (2TP + FP + FN), providing a single balanced metric. Area under the receiver operating curve (AUC) quantifies overall discriminative ability across all classification thresholds.

Per-class and average results: All four CNNs achieved their highest precision, sensitivity, and F1 scores on melanocytic nevi, which is expected given that this class comprised 66.9% of the training data. Conversely, specificity for melanocytic nevus detection was the lowest among all classes across every CNN, reflecting the challenge of distinguishing the majority class from visually similar minority classes. For melanoma specifically, ResNeXt achieved the best individual-model results: 0.77 precision, 0.72 sensitivity, 0.74 F1 score, 0.97 specificity, and 0.95 AUC.

Model ranking by average metrics: ResNeXt emerged as the top-performing single model with average precision of 0.88, sensitivity of 0.83, F1 score of 0.85, specificity of 0.99, and AUC of 0.99 across all seven lesion types. ResNet followed closely, matching ResNeXt on average sensitivity (0.83) but trailing slightly on precision (0.85). SE-ResNet achieved 0.84 average precision and 0.79 sensitivity, while SE-ResNeXt scored lowest with 0.82 average precision and 0.78 sensitivity. The SE variants, despite their theoretical advantages in feature recalibration, did not outperform the simpler architectures on this dataset.

TL;DR: ResNeXt was the best single model: 0.88 average precision, 0.83 sensitivity, 0.85 F1, 0.99 specificity, 0.99 AUC. For melanoma specifically, ResNeXt achieved 0.77 precision, 0.72 sensitivity, 0.74 F1, 0.97 specificity, and 0.95 AUC. SE variants underperformed their non-SE counterparts on this dataset.
Pages 4-5
Combining All Four CNNs Into an Ensemble Improved Melanoma Detection

After evaluating each model individually, the authors combined all four CNNs into a stacked ensemble. The ensemble's output was computed by averaging the prediction probabilities from all four networks. This approach leverages the complementary strengths of different architectures: where one model misclassifies an image, others may classify it correctly, and averaging smooths out individual errors. The ensemble method required no additional training, only inference through all four models followed by a simple averaging step.

Ensemble vs. best single model: For melanoma classification specifically, the ensemble surpassed ResNeXt (the best individual model) in precision, F1 score, and specificity. However, sensitivity remained similar to ResNeXt at 0.72. The normalized confusion matrix for the ensemble revealed that 72% of melanoma test images were correctly classified, while 19% were misclassified as melanocytic nevi, the most common error type. For melanocytic nevi, the ensemble achieved 96% correct classification. Basal cell carcinoma and vascular lesion classes also showed strong performance.

Error analysis: A detailed examination of ResNeXt's melanoma misclassifications showed that 28% of melanoma images were incorrectly classified, with half of those errors also made by the other three CNNs and the other half correctly classified by at least one other network. The most frequent misclassification pattern was melanoma predicted as melanocytic nevus or benign keratosis. This finding highlights both the difficulty of distinguishing melanoma from visually similar benign lesions and the value of ensemble methods in capturing predictions that individual models miss.

The confusion matrix also showed that dermatofibroma had a notable false-negative rate, with 18% of dermatofibroma images misclassified as melanocytic nevi and 9% as melanoma. Actinic keratosis had 73% correct classification but 17% were confused with benign keratosis, reflecting the visual overlap between these keratotic lesion types. These cross-class confusion patterns align with known clinical challenges in dermoscopy, where even experienced dermatologists struggle with the same differential diagnoses.

TL;DR: The ensemble of all four CNNs outperformed ResNeXt in melanoma precision, F1, and specificity, though sensitivity stayed at 0.72. The confusion matrix showed 72% correct melanoma classification, with 19% misclassified as melanocytic nevi. 28% of ResNeXt's melanoma errors were unique to that model, and half were correctable by other CNNs in the ensemble.
Pages 5-6
Grad-CAM Heatmaps Reveal What the CNN Focuses On

To address the "black box" problem inherent in deep learning, the authors applied Gradient-weighted Class Activation Mapping (Grad-CAM) to the best-performing single model, ResNeXt. Grad-CAM produces a heatmap overlay on the original image, where red regions indicate the areas most important for the model's prediction and light blue regions indicate the least important areas. This provides clinicians with a visual explanation of why the network classified a lesion the way it did, rather than simply outputting a label with no reasoning.

Correctly classified examples: For correctly identified melanocytic nevi, the Grad-CAM heatmaps showed that the CNN focused primarily on the lesion itself, with high activation concentrated on the pigmented area and low activation on the surrounding normal skin. Similarly, for correctly classified melanomas, the model's attention was centered on the lesion with particular emphasis on regions showing irregular features. The Guided Grad-CAM visualizations, which show fine-grained feature details, confirmed that the CNN was responding to clinically meaningful texture and color patterns within the lesions.

Misclassification insights: When ResNeXt misclassified a melanoma as a melanocytic nevus, the Grad-CAM heatmaps suggested that the spatial size of the lesion and its overall color had a strong influence on the incorrect assignment. The authors also demonstrated that image acquisition factors, specifically zoom level, lighting conditions, and camera angle, significantly affected classification accuracy. In one illustrative example, three dermoscopy images of the same melanoma taken at different angles and zoom levels yielded different predictions: only one was correctly classified as melanoma, while the others were misidentified as melanocytic nevus and vascular lesion. This finding underscores the sensitivity of CNNs to image quality and standardization.

The interpretability provided by Grad-CAM is clinically significant because it allows dermatologists to assess whether the CNN is making predictions based on relevant dermoscopic features or on artifacts like image borders, lighting gradients, or background skin. This transparency could increase clinician trust and adoption of AI-assisted diagnostic tools in dermatology practice.

TL;DR: Grad-CAM heatmaps showed ResNeXt focused on the lesion itself for correct predictions. Misclassifications were linked to lesion size, color similarity to nevi, and image acquisition variables (zoom, lighting, angle). The same melanoma photographed at three different angles yielded three different predictions, highlighting CNN sensitivity to image standardization.
Pages 6-7
Class Imbalance, Dataset Size, and Image Quality Constraints

Class imbalance: The most significant limitation was the severe class imbalance in the HAM10000 dataset. Melanocytic nevi represented 66.9% of images, while dermatofibroma (1.1%) and vascular lesion (1.4%) were extreme minorities. Although the authors mitigated this with weighted cross-entropy loss, the imbalance still biased all four CNNs toward predicting the majority class. This is directly reflected in the results: melanocytic nevus achieved the highest sensitivity across all models, while minority classes like dermatofibroma and actinic keratosis had markedly lower sensitivity (as low as 0.53 for actinic keratosis with SE-ResNeXt).

Dataset size and scope: The HAM10000 dataset, while valuable as a publicly available benchmark, is relatively small compared to datasets commonly used in general computer vision (ImageNet has over 14 million images, MS COCO has 330,000). With only 1,113 melanoma images for training and testing combined, the models had limited examples to learn the full spectrum of melanoma presentations. The authors note that gathering more images could help overcome issues related to camera angle, zoom, and lighting variability, and could decrease the bias toward overrepresented classes.

Image acquisition variability: As demonstrated by the Grad-CAM analysis, differences in dermoscopy image capture (camera angle, zoom, lighting) directly affected classification outcomes. The dataset did not control for these variables, and the augmentation strategy (rotation and center crop) addressed only a subset of possible image variations. Real-world dermoscopy images exhibit even greater variability depending on the device manufacturer, imaging protocol, and clinical setting. This limits the generalizability of the reported performance metrics to clinical practice.

No external validation: The study evaluated all models on a single test split from the same HAM10000 dataset. No external validation was performed on independent datasets from different institutions, devices, or patient populations. Without multi-center external validation, it is impossible to confirm that the reported precision, sensitivity, and specificity values would hold in diverse clinical settings. The absence of prospective clinical testing further limits the translational relevance of the findings.

TL;DR: Key limitations include severe class imbalance (66.9% nevi vs. 1.1% dermatofibroma), a relatively small dataset (10,015 images, only 1,113 melanomas), uncontrolled image acquisition variability (zoom, angle, lighting), and no external or multi-center validation. Actinic keratosis sensitivity dropped as low as 0.53 in SE-ResNeXt due to underrepresentation.
Pages 7-9
CNN-Assisted Dermoscopy: Clinical Potential and Next Steps

The authors contextualize their findings against a landmark study by Haenssle et al. (2018), in which a CNN achieved 0.86 AUC for melanoma recognition on dermoscopy images, compared to 0.79 for 58 dermatologists (including 30 experts) who had access only to dermoscopic images (p < 0.01). Even when dermatologists were given additional clinical information (age, sex, body site, close-up images) in a second-level study, their mean AUC rose only to 0.82, still below the CNN. This evidence supports the position that even experienced clinicians can benefit from CNN-based decision support in melanoma diagnosis.

Multimodal approaches: Yap et al. (2018) demonstrated that combining a CNN trained on dermoscopic images with another CNN trained on macroscopic (clinical) images improved classification accuracy from 0.707 (dermoscopy alone) to 0.721. While macroscopic CNN accuracy alone was lower at 0.647, the fusion provided complementary information. This multimodal strategy represents a promising direction, though it requires paired macroscopic and dermoscopic image datasets that are not yet widely available.

The black box problem: The authors raise an important concern about clinical adoption: deep learning models have low interpretability, and clinicians may be reluctant to trust recommendations from opaque algorithms. The Grad-CAM visualization used in this study offers a partial solution by showing which regions of the image drove the classification decision. The authors suggest that integrating such explainability methods into clinical AI tools could give dermatologists meaningful insight into the model's reasoning, improving trust and usability. However, Grad-CAM provides only a coarse spatial explanation and does not fully capture the complex feature interactions learned by deep networks.

Future directions: Several concrete next steps emerge from this work. First, larger and more balanced datasets are needed, both to improve CNN performance on underrepresented lesion types and to enable robust external validation. Second, standardization of dermoscopy image acquisition (consistent zoom, lighting, and angle protocols) could significantly reduce classification variability. Third, ensemble methods and multimodal fusion (combining dermoscopic, macroscopic, and clinical data) should be explored further, as they consistently outperform single-model approaches. Fourth, prospective clinical trials comparing CNN-assisted diagnosis to standard dermatologist assessment are essential before any AI tool can be recommended for routine clinical use. The authors conclude that CNNs should serve as a supportive tool for clinicians, not a replacement for skilled dermatologists.

TL;DR: Prior work showed CNNs achieved 0.86 AUC vs. 0.79 for 58 dermatologists (p < 0.01). Multimodal fusion (dermoscopy + macroscopy) improved accuracy from 0.707 to 0.721. Next steps include larger balanced datasets, standardized image capture protocols, multimodal ensemble methods, and prospective clinical trials. CNNs should augment, not replace, dermatologist expertise.
Citation: Kwiatkowska D, Kluska P, Reich A.. Open Access, 2021. Available at: PMC8330874. DOI: 10.5114/ada.2021.107927. License: cc by-nc-nd.