Malignant melanoma is one of the most aggressive skin cancers, and early detection is the single most important factor in patient survival. The 5-year survival rate for patients with stage IA melanoma (tumors less than 1 mm thick, as measured by the Breslow scoring system) exceeds 90%. For tumors thicker than 4 mm, outcomes are far worse. Despite advances in anti-cancer immunotherapies that have improved survival for patients with advanced melanoma, a large number of patients still die from the disease. Many early melanomas are asymptomatic and do not arouse suspicion, even though they are visible to the naked eye.
The dermoscopy challenge: Dermoscopy is a non-invasive imaging technique that can aid early melanoma detection, but its accuracy depends heavily on the physician's skill and experience. This creates a natural bottleneck: not all clinicians have the training to reliably distinguish early melanoma from benign pigmented lesions. Convolutional neural networks (CNNs) have demonstrated strong performance in computer vision tasks including classification and object detection, sometimes even outperforming humans. This study set out to evaluate whether CNNs could serve as a reliable aid in classifying malignant melanoma from dermoscopy images.
Scope of the study: The authors trained and compared four CNN architectures on the publicly available HAM10000 dataset (part of the ISIC 2018 challenge), which contains 10,015 dermoscopy images spanning seven categories of skin lesions. All diagnoses in the dataset were confirmed by histological examination. The study measured precision, sensitivity, F1 score, specificity, and area under the receiver operating curve (AUC) for each model, and also tested whether combining all four models into an ensemble could improve melanoma classification further.
The dataset was extracted from the "ISIC 2018: Skin Lesion Analysis Towards Melanoma Detection" grand challenge. It contained 10,015 dermoscopy images distributed across seven diagnostic categories: melanoma (1,113 images, 11.1%), melanocytic nevus (6,705 images, 66.9%), basal cell carcinoma (514, 5.1%), actinic keratosis/Bowen's disease (327, 3.3%), benign keratosis (1,099, 11.0%), dermatofibroma (115, 1.1%), and vascular lesion (142, 1.4%). The dataset is notably imbalanced, with melanocytic nevi accounting for roughly two-thirds of all images.
Data splitting: The authors used unique identifiers provided with the HAM10000 dataset to split images into training (8,123 images), validation (886 images), and test (1,006 images) sets. This identifier-based splitting was critical for avoiding training-test data leakage, where the same lesion photographed at different angles or times could appear in both training and test sets, artificially inflating performance metrics. The proportional distribution of each lesion type was maintained across all three splits.
Data augmentation: Because most images had lesions centered in the frame, the authors applied random rotation by 180 degrees during training, then center-cropped images from 600 x 450 pixels to 300 x 400 pixels to remove black borders and skin background. All images were then resized to 224 x 224 pixels, the standard input size for the CNN architectures used. To address the class imbalance problem, a weighted cross-entropy loss function was employed, with weights set to the inverse cardinality (frequency) of each class in the training set. This ensured the network did not simply learn to predict the majority class (melanocytic nevus) for every input.
The study compared four deep CNN architectures, all based on the ResNet-101 backbone. ResNet-101 consists of 33 residual blocks and 100 convolution operations total. The key innovation of residual networks is the shortcut (skip) connection, which adds the input of a residual block directly to its output. This identity mapping enables training of much deeper networks by mitigating the vanishing gradient problem. After each convolutional layer, the authors inserted batch normalization followed by a rectified linear unit (ReLU) activation function, a standard configuration for improving training speed and stability.
ResNeXt: This variant introduces a new hyperparameter called cardinality, implemented through grouped convolution. Instead of a single large convolution operation, the input is divided into 32 groups, and 32 smaller convolutions are performed in parallel. This architecture captures richer feature representations without significantly increasing computational cost, as the grouped convolution is equivalent in parameters to the standard approach but learns more diverse filters.
Squeeze-and-Excitation (SE) networks: SE-ResNet and SE-ResNeXt add an additional block to the standard residual or ResNeXt architecture. The SE block performs global average pooling, passes the result through two fully connected layers with a sigmoid activation, and then scales the original feature maps by the resulting channel-wise weights. This mechanism allows the network to recalibrate its feature responses, increasing sensitivity to the most descriptive features while suppressing less useful ones. The SE block adds minimal computational overhead but has been shown to improve classification accuracy across many benchmarks.
Transfer learning: All four models were pre-trained on ImageNet, a large-scale dataset with 1,000 object classes. The final fully connected layer (1,000 output nodes) was replaced with a new layer containing 7 output neurons corresponding to the 7 skin lesion categories, followed by a softmax activation function. Models were trained with the Adam optimizer for up to 20 epochs, with early stopping applied when overfitting was detected on the validation set.
The authors evaluated each CNN on the held-out test dataset of 1,006 images using five standard classification metrics. Precision measures the proportion of predicted melanomas that were actually melanoma (minimizing false positives). Sensitivity (recall) measures the proportion of actual melanomas correctly identified (minimizing false negatives). Specificity measures how well the model identifies non-melanoma cases. The F1 score is the harmonic mean of precision and sensitivity, calculated as 2TP / (2TP + FP + FN), providing a single balanced metric. Area under the receiver operating curve (AUC) quantifies overall discriminative ability across all classification thresholds.
Per-class and average results: All four CNNs achieved their highest precision, sensitivity, and F1 scores on melanocytic nevi, which is expected given that this class comprised 66.9% of the training data. Conversely, specificity for melanocytic nevus detection was the lowest among all classes across every CNN, reflecting the challenge of distinguishing the majority class from visually similar minority classes. For melanoma specifically, ResNeXt achieved the best individual-model results: 0.77 precision, 0.72 sensitivity, 0.74 F1 score, 0.97 specificity, and 0.95 AUC.
Model ranking by average metrics: ResNeXt emerged as the top-performing single model with average precision of 0.88, sensitivity of 0.83, F1 score of 0.85, specificity of 0.99, and AUC of 0.99 across all seven lesion types. ResNet followed closely, matching ResNeXt on average sensitivity (0.83) but trailing slightly on precision (0.85). SE-ResNet achieved 0.84 average precision and 0.79 sensitivity, while SE-ResNeXt scored lowest with 0.82 average precision and 0.78 sensitivity. The SE variants, despite their theoretical advantages in feature recalibration, did not outperform the simpler architectures on this dataset.
After evaluating each model individually, the authors combined all four CNNs into a stacked ensemble. The ensemble's output was computed by averaging the prediction probabilities from all four networks. This approach leverages the complementary strengths of different architectures: where one model misclassifies an image, others may classify it correctly, and averaging smooths out individual errors. The ensemble method required no additional training, only inference through all four models followed by a simple averaging step.
Ensemble vs. best single model: For melanoma classification specifically, the ensemble surpassed ResNeXt (the best individual model) in precision, F1 score, and specificity. However, sensitivity remained similar to ResNeXt at 0.72. The normalized confusion matrix for the ensemble revealed that 72% of melanoma test images were correctly classified, while 19% were misclassified as melanocytic nevi, the most common error type. For melanocytic nevi, the ensemble achieved 96% correct classification. Basal cell carcinoma and vascular lesion classes also showed strong performance.
Error analysis: A detailed examination of ResNeXt's melanoma misclassifications showed that 28% of melanoma images were incorrectly classified, with half of those errors also made by the other three CNNs and the other half correctly classified by at least one other network. The most frequent misclassification pattern was melanoma predicted as melanocytic nevus or benign keratosis. This finding highlights both the difficulty of distinguishing melanoma from visually similar benign lesions and the value of ensemble methods in capturing predictions that individual models miss.
The confusion matrix also showed that dermatofibroma had a notable false-negative rate, with 18% of dermatofibroma images misclassified as melanocytic nevi and 9% as melanoma. Actinic keratosis had 73% correct classification but 17% were confused with benign keratosis, reflecting the visual overlap between these keratotic lesion types. These cross-class confusion patterns align with known clinical challenges in dermoscopy, where even experienced dermatologists struggle with the same differential diagnoses.
To address the "black box" problem inherent in deep learning, the authors applied Gradient-weighted Class Activation Mapping (Grad-CAM) to the best-performing single model, ResNeXt. Grad-CAM produces a heatmap overlay on the original image, where red regions indicate the areas most important for the model's prediction and light blue regions indicate the least important areas. This provides clinicians with a visual explanation of why the network classified a lesion the way it did, rather than simply outputting a label with no reasoning.
Correctly classified examples: For correctly identified melanocytic nevi, the Grad-CAM heatmaps showed that the CNN focused primarily on the lesion itself, with high activation concentrated on the pigmented area and low activation on the surrounding normal skin. Similarly, for correctly classified melanomas, the model's attention was centered on the lesion with particular emphasis on regions showing irregular features. The Guided Grad-CAM visualizations, which show fine-grained feature details, confirmed that the CNN was responding to clinically meaningful texture and color patterns within the lesions.
Misclassification insights: When ResNeXt misclassified a melanoma as a melanocytic nevus, the Grad-CAM heatmaps suggested that the spatial size of the lesion and its overall color had a strong influence on the incorrect assignment. The authors also demonstrated that image acquisition factors, specifically zoom level, lighting conditions, and camera angle, significantly affected classification accuracy. In one illustrative example, three dermoscopy images of the same melanoma taken at different angles and zoom levels yielded different predictions: only one was correctly classified as melanoma, while the others were misidentified as melanocytic nevus and vascular lesion. This finding underscores the sensitivity of CNNs to image quality and standardization.
The interpretability provided by Grad-CAM is clinically significant because it allows dermatologists to assess whether the CNN is making predictions based on relevant dermoscopic features or on artifacts like image borders, lighting gradients, or background skin. This transparency could increase clinician trust and adoption of AI-assisted diagnostic tools in dermatology practice.
Class imbalance: The most significant limitation was the severe class imbalance in the HAM10000 dataset. Melanocytic nevi represented 66.9% of images, while dermatofibroma (1.1%) and vascular lesion (1.4%) were extreme minorities. Although the authors mitigated this with weighted cross-entropy loss, the imbalance still biased all four CNNs toward predicting the majority class. This is directly reflected in the results: melanocytic nevus achieved the highest sensitivity across all models, while minority classes like dermatofibroma and actinic keratosis had markedly lower sensitivity (as low as 0.53 for actinic keratosis with SE-ResNeXt).
Dataset size and scope: The HAM10000 dataset, while valuable as a publicly available benchmark, is relatively small compared to datasets commonly used in general computer vision (ImageNet has over 14 million images, MS COCO has 330,000). With only 1,113 melanoma images for training and testing combined, the models had limited examples to learn the full spectrum of melanoma presentations. The authors note that gathering more images could help overcome issues related to camera angle, zoom, and lighting variability, and could decrease the bias toward overrepresented classes.
Image acquisition variability: As demonstrated by the Grad-CAM analysis, differences in dermoscopy image capture (camera angle, zoom, lighting) directly affected classification outcomes. The dataset did not control for these variables, and the augmentation strategy (rotation and center crop) addressed only a subset of possible image variations. Real-world dermoscopy images exhibit even greater variability depending on the device manufacturer, imaging protocol, and clinical setting. This limits the generalizability of the reported performance metrics to clinical practice.
No external validation: The study evaluated all models on a single test split from the same HAM10000 dataset. No external validation was performed on independent datasets from different institutions, devices, or patient populations. Without multi-center external validation, it is impossible to confirm that the reported precision, sensitivity, and specificity values would hold in diverse clinical settings. The absence of prospective clinical testing further limits the translational relevance of the findings.
The authors contextualize their findings against a landmark study by Haenssle et al. (2018), in which a CNN achieved 0.86 AUC for melanoma recognition on dermoscopy images, compared to 0.79 for 58 dermatologists (including 30 experts) who had access only to dermoscopic images (p < 0.01). Even when dermatologists were given additional clinical information (age, sex, body site, close-up images) in a second-level study, their mean AUC rose only to 0.82, still below the CNN. This evidence supports the position that even experienced clinicians can benefit from CNN-based decision support in melanoma diagnosis.
Multimodal approaches: Yap et al. (2018) demonstrated that combining a CNN trained on dermoscopic images with another CNN trained on macroscopic (clinical) images improved classification accuracy from 0.707 (dermoscopy alone) to 0.721. While macroscopic CNN accuracy alone was lower at 0.647, the fusion provided complementary information. This multimodal strategy represents a promising direction, though it requires paired macroscopic and dermoscopic image datasets that are not yet widely available.
The black box problem: The authors raise an important concern about clinical adoption: deep learning models have low interpretability, and clinicians may be reluctant to trust recommendations from opaque algorithms. The Grad-CAM visualization used in this study offers a partial solution by showing which regions of the image drove the classification decision. The authors suggest that integrating such explainability methods into clinical AI tools could give dermatologists meaningful insight into the model's reasoning, improving trust and usability. However, Grad-CAM provides only a coarse spatial explanation and does not fully capture the complex feature interactions learned by deep networks.
Future directions: Several concrete next steps emerge from this work. First, larger and more balanced datasets are needed, both to improve CNN performance on underrepresented lesion types and to enable robust external validation. Second, standardization of dermoscopy image acquisition (consistent zoom, lighting, and angle protocols) could significantly reduce classification variability. Third, ensemble methods and multimodal fusion (combining dermoscopic, macroscopic, and clinical data) should be explored further, as they consistently outperform single-model approaches. Fourth, prospective clinical trials comparing CNN-assisted diagnosis to standard dermatologist assessment are essential before any AI tool can be recommended for routine clinical use. The authors conclude that CNNs should serve as a supportive tool for clinicians, not a replacement for skilled dermatologists.