A Non-Invasive Interpretable Diagnosis of Melanoma Skin Cancer Using Deep Learning and Ensemble Stacking of Machine Learning Models

Diagnostics 2022 AI 8 Explanations View Original
Original Paper (PDF)

Unable to display PDF. Download it here or view on PMC.

Plain-English Explanations
Pages 1-3
Why Early Melanoma Detection Matters, and Why AI Can Help

Malignant melanoma is one of the most dangerous forms of skin cancer, caused by UV-induced mutations in melanocytes. In the United States, 6% of all cancers diagnosed in 2021 were melanoma, affecting roughly 1 in 27 males and 1 in 40 females. Australia reports among the highest incidence globally, with 16,878 new melanoma cases predicted in 2021 alone. The 5-year survival rate is 98% when detected early, but plummets to just 18% once the cancer has spread to distant organs. This enormous gap makes early, accurate diagnosis a matter of life and death.

Current clinical practice: Dermatologists use the ABCDE criteria (Asymmetry, Border irregularity, Color variation, Diameter over 6 mm, Evolving shape) to evaluate suspicious moles. However, this manual approach is subjective and error-prone. Research has shown that the ABCDE criteria are not always reliable, and accurate application requires highly trained specialists, leading to elevated rates of both false positives and false negatives.

The role of AI: Deep learning models can automatically extract disease-related features from dermoscopic images, bypassing the need for manual feature engineering. However, most prior approaches lacked interpretability, functioning as "black boxes" that clinicians are reluctant to trust. This paper addresses that gap by combining ensemble methods with SHAP-based explainability to produce results that dermatologists can actually understand and verify.

The authors set out four specific objectives: (1) build a stacked ensemble of classical machine learning models trained on hand-crafted features, (2) create an ensemble of fine-tuned deep learning models using automated features, (3) evaluate both approaches on a publicly available ISIC dataset, and (4) generate interpretable heatmaps using SHAP values to highlight the image regions most indicative of melanoma.

TL;DR: Melanoma survival drops from 98% to 18% once metastasized. Current ABCDE-based clinical screening is subjective and error-prone. This study proposes an interpretable AI pipeline combining classical ML stacking and deep learning ensembles with SHAP explainability to improve early, non-invasive melanoma detection.
Pages 4-6
Prior Approaches to Automated Melanoma Classification

The literature on automated melanoma detection spans a range of classical and deep learning techniques, each with distinct strengths and limitations. Devansh et al. tackled two persistent problems with public datasets, namely small size and class imbalance, plus the presence of occlusions in dermoscopic images. They used de-coupled DCGANs for data augmentation and achieved an ROC-AUC of 0.880 on the melanoma class from the ISIC 2017 dataset, a 4% improvement over the challenge winners. Esteva et al. separately demonstrated that a deep learning model could reach 81.6% accuracy, surpassing two dermatologists who scored 65.56% and 66%.

Hybrid strategies: Daghrir et al. combined CNN predictions with KNN and SVM classifiers trained on texture, boundary, and color features, then fused them via majority voting. Working with only 640 images from the ISIC repository, they achieved 88.4% accuracy after the vote, up from 85.5% for the standalone CNN. Filali et al. explored fusing hand-crafted features (shape, texture, color) with automated CNN features and found stronger performance on the smaller PH2 dataset. Mahbod et al. extracted deep features from three pre-trained CNN models and trained SVM classifiers, reaching 83% accuracy on the ISIC challenge dataset.

Segmentation-based methods: Yuan and Lo presented deep fully convolutional-deconvolutional neural networks (CDNNs) with 29 layers for pixel-level skin lesion segmentation, achieving 93.4% accuracy and a Jaccard index of 0.765 on the ISBI 2017 dataset. Bi et al. proposed multi-scale lesion-biased representation (MLR) with joint reverse classification (JRC), yielding 92.0% accuracy on the PH2 dataset. These segmentation approaches show high performance but do not focus on classification interpretability.

Key gap: While accuracy has steadily improved, the majority of prior work lacks adequate explainability. Without interpretability, clinicians cannot verify why a model classified a lesion as malignant, which limits real-world adoption. The authors position their work as addressing this specific shortcoming by integrating SHAP-based heatmap explanations into the classification pipeline.

TL;DR: Prior work ranges from 81% to 93% accuracy using CNNs, SVMs, majority voting, and segmentation networks on various ISIC and PH2 datasets. Most methods lack interpretability, which this paper explicitly aims to fix using SHAP-based explanations.
Pages 7-8
ISIC 2018 Subset: Balanced Binary Classification of Skin Moles

The study used a curated miniature version of the ISIC 2018 challenge dataset, filtered to include only melanoma-related images and balanced across two classes: benign and malignant skin moles. The full dataset contains 3,297 dermoscopic images at 224 x 224 resolution. The training set comprises 2,637 images (1,440 benign, 1,197 malignant), while the test set contains 660 images (360 benign, 300 malignant). All non-melanoma disease categories were excluded during curation.

Normalization: Input images were scaled to a range of -1 to +1 before being fed into the models. Proper scaling is critical for stable and efficient convergence during deep learning training. Without normalization, gradient updates can become erratic, slowing or destabilizing the learning process.

Data augmentation: To expand the effective training set size and improve generalization, several augmentation techniques were applied. These included a rotation range of 90 degrees, shear and zoom ranges of 0.1, horizontal and vertical flips, and shuffling of the training data. These transformations help the model become invariant to common variations in how dermoscopic images are captured, such as orientation and slight scale differences.

Although the curated dataset is relatively small (only 2,637 training images), the balanced class distribution addresses a major issue in dermatology AI, namely class imbalance. The original ISIC 2018 dataset is heavily skewed toward certain lesion types, which can cause models to become biased toward the majority class. By using a balanced subset, the authors ensure that both classes receive equal representation during training.

TL;DR: The dataset contains 3,297 balanced dermoscopic images (2,637 train, 660 test) from the ISIC 2018 challenge, filtered to melanoma only. Images were normalized to [-1, +1] and augmented with 90-degree rotation, flips, shear, and zoom to improve generalization.
Pages 8-9
Two-Track Architecture: Hand-Crafted ML Stacking and Deep Learning Ensembles

The proposed system follows a dual-track architecture. The first track uses classical machine learning models trained on hand-crafted image features. Three types of features were extracted from each image: Hu moments (shape descriptors that are invariant to translation, scaling, and rotation), Haralick texture features (computed from the gray level co-occurrence matrix, or GLCM, which captures spatial relationships between pixel intensities), and color histograms (capturing the color distribution). These 49 total feature vectors were fed into five base classifiers: logistic regression, SVM, random forest, KNN, and gradient boosting machine (GBM).

Level-one stacking: The predictions from all 49 classifiers (produced with different hyperparameter configurations) were concatenated into a new feature vector. A level-one meta-learner was then trained on this vector using 5-fold cross-validation, with the final prediction computed as the average across all folds. This stacking approach leverages the diversity of the base models to produce a more robust overall prediction.

Deep learning track: Five CNN architectures were used for transfer learning: ResNet50, ResNet50V2, MobileNet, Xception, and DenseNet121, all pre-trained on ImageNet. The top fully connected layers of each pre-trained model were removed and replaced with a global average pooling layer followed by a new classification head with softmax activation. Training proceeded in two phases: first, only the classification head was trained while all backbone weights remained frozen, and then all layers were unfrozen for full fine-tuning.

Ensembling: After individual evaluation, the best-performing deep learning models were combined in different groupings (five, four, three, and two models) to find the optimal ensemble. The ensemble predictions were averaged across the constituent models. Model training used SGD with Nesterov momentum of 0.9, an initial learning rate of 0.001 with decay of 1e-6, binary cross-entropy loss, batch size of 16, and up to 100 epochs per training phase. Images were processed at both 299 x 299 and 224 x 224 resolutions depending on the architecture.

TL;DR: Two parallel pipelines: (1) 49 ML classifiers (LR, SVM, RF, KNN, GBM) trained on Hu moments, Haralick, and color features, combined via level-one stacking with 5-fold CV; (2) five ImageNet-pretrained CNNs (ResNet50, ResNet50V2, MobileNet, Xception, DenseNet121) fine-tuned in two phases, then ensembled by averaging predictions.
Pages 10-12
Classical ML Results: Stacking Boosts Accuracy to 88%

Among the five classical machine learning base models, gradient boosting machine (GBM) performed best with 87% accuracy, 87% F1-score, and a Cohen's kappa of 0.74. SVM followed at 85% accuracy and 0.69 kappa, while logistic regression and random forest each achieved 84% accuracy. KNN came in last at 82% accuracy and 0.64 kappa, the lowest agreement score among the base models.

Stacking improvement: After level-one stacking combined the predictions of all 49 classifiers, accuracy increased to 88%, F1-score to 88%, and Cohen's kappa to 0.76. The confusion matrix for the stacked model shows 305 true positives, 274 true negatives, 55 false positives, and 26 false negatives out of 660 test images. The stacking approach provided a consistent 1-6 percentage point improvement over every individual base model.

Interpretation of kappa scores: Cohen's kappa measures agreement between predicted and true labels, correcting for chance. A kappa of 0.76 (stacking) indicates "substantial agreement," while the base models ranged from 0.64 (moderate) to 0.74 (substantial). These scores confirm that the hand-crafted feature approach, while useful, has a ceiling. The limited feature representation (shape, texture, color histograms) cannot capture the full complexity of dermoscopic image patterns.

TL;DR: Classical ML base models ranged from 82% (KNN) to 87% (GBM) accuracy. Level-one stacking pushed accuracy to 88%, F1 to 88%, and kappa to 0.76. The stacking ensemble outperformed every individual base classifier by 1-6 percentage points.
Pages 12-14
Deep Learning Results: Three-Model Ensemble Achieves 92% Accuracy and 0.97 AUC

All five deep learning models outperformed the classical ML approaches. Among individual models, ResNet50 led with 91% accuracy, 91% F1-score, 0.82 kappa, and an AUC of 0.96. DenseNet121 matched ResNet50 on accuracy (91%) and achieved the highest individual AUC of 0.97. MobileNet and ResNet50V2 both reached 90% accuracy with kappa scores around 0.80. Xception was the weakest deep learning model at 88% accuracy and 0.95 AUC, though it still matched the stacked ML ensemble.

Ensemble combinations: Ensembling all five models yielded 91% accuracy and 0.97 AUC, identical to the four-model ensemble. However, the three-model ensemble (combining the top three performers) achieved the best results: 92% accuracy, 92% F1-score, 0.83 kappa, and 0.97 AUC. The confusion matrix for this best ensemble shows 331 true positives, 274 true negatives, 29 false positives, and 26 false negatives. All ensemble configurations produced an AUC of 0.97, indicating excellent discriminative ability.

Statistical significance: The authors used corrected paired Student's t-tests to compare the best ensemble against each base CNN model. The p-values for accuracy were all below 0.05: MobileNet (p = 0.014), Xception (p = 0.006), ResNet50 (p = 0.016), and DenseNet121 (p = 0.016). For kappa, all comparisons also reached significance, with Xception showing the strongest difference (p = 4 x 10-4). These results confirm that the ensemble's performance gain is statistically meaningful and not attributable to chance.

Comparison with prior work: The ISIC 2018 challenge winner, MetaOptima, achieved 88.5% accuracy on their top-10 model average. The proposed three-model ensemble surpasses this by roughly 4 percentage points. Compared to other published methods using the ISIC 2018 dataset, the model achieves the best accuracy (92.0%), precision (91.0%), and sensitivity (92.0%).

TL;DR: The three-model deep learning ensemble achieved the best results: 92% accuracy, 92% F1, 0.83 kappa, and 0.97 AUC, outperforming the ISIC 2018 challenge winner by roughly 4%. All ensemble-vs-base comparisons were statistically significant (p < 0.05).
Pages 15-16
SHAP-Based Explainability: Making the Black Box Transparent

A key differentiator of this study is its use of SHAP (SHapley Additive exPlanations) values to generate heatmaps that highlight which regions of a dermoscopic image contributed most to the model's prediction. The authors applied SHAP analysis to the best-performing individual model, DenseNet121, to visualize the decision-making process for both positive (malignant) and negative (benign) predictions.

How SHAP works in this context: SHAP assigns each pixel region a contribution score indicating how much it pushed the prediction toward malignant or benign. In the resulting heatmaps, red regions indicate features that increase the probability of a malignant melanoma classification, while blue regions indicate features that decrease it. The saliency of a region of interest is measured by the sum of the intensity of its feature contributions for a given class.

Clinical relevance: This interpretability layer is not just an academic exercise. Dermatologists are unlikely to adopt a classification system they cannot understand or verify. By showing which parts of a lesion image drove the prediction, SHAP heatmaps allow clinicians to cross-reference the AI's reasoning against their own clinical judgment. If the model highlights the same irregular border or color variation that a dermatologist would flag, confidence in the result increases. Conversely, if the model focuses on irrelevant artifacts (such as hair or ruler markings), clinicians can identify unreliable predictions.

The authors emphasize that this AI system is not intended to replace dermatologists. Instead, the combination of high accuracy (92%) and transparent reasoning positions it as a clinical decision-support tool. Health care providers can benefit from automated screening to identify suspicious cases while awaiting more comprehensive testing, particularly in settings where specialist access is limited.

TL;DR: SHAP heatmaps applied to DenseNet121 show which image regions drive malignant vs. benign predictions (red = increases malignant probability, blue = decreases it). This interpretability layer enables clinicians to verify the AI's reasoning, bridging the gap between model accuracy and clinical trust.
Pages 16-18
Dataset Constraints, Occlusions, and the Path Forward

Small dataset size: The study used only 3,297 images (2,637 for training), which is modest by deep learning standards. While the balanced class distribution is a strength, the limited sample size raises questions about generalization to larger, more diverse patient populations. The models were trained and tested on a single curated subset of ISIC 2018, so performance on other datasets or in real-world clinical workflows remains unvalidated.

Binary classification only: The dataset includes only two classes (benign and malignant), whereas the full ISIC 2018 dataset contains seven lesion types. Real-world clinical scenarios require distinguishing among multiple differential diagnoses, not just a binary benign/malignant split. The authors note that their methods are "capable of classifying the dataset with more categories," but this claim is untested in the current study.

Image occlusions: Dermoscopic images frequently contain artifacts such as hair, ruler markings, and glare. While data augmentation helps improve robustness, the study does not implement a dedicated preprocessing step for artifact removal. Prior work by Devansh et al. demonstrated that data purification (removing occlusions) can meaningfully boost performance. The authors acknowledge this as a limitation and propose developing a more efficient data purification method in future work.

Future directions: The authors outline three main avenues for follow-up research. First, they plan to combine the classical ML and deep learning tracks into a unified hybrid system, rather than evaluating them separately. Second, they aim to develop better data purification pipelines to handle image occlusions. Third, they expect that combining improved preprocessing with a larger balanced dataset could push classification accuracy beyond 92%. Additionally, prospective clinical validation and multi-class extension remain important next steps that the paper does not address.

TL;DR: Key limitations include a small dataset (3,297 images), binary-only classification, no dedicated artifact removal, and lack of external validation. Future work aims to merge the ML and DL pipelines, add data purification, and scale to larger multi-class datasets.
Citation: Alfi IA, Rahman MM, Shorfuzzaman M, Nazir A.. Open Access, 2022. Available at: PMC8947367. DOI: 10.3390/diagnostics12030726. License: cc by.