Melanoma identification and classification model based on fine-tuned convolutional neural networks

Healthcare 2024 AI 8 Explanations View Original
Original Paper (PDF)

Unable to display PDF. Download it here or view on PMC.

Plain-English Explanations
Pages 1-2
Why Melanoma Detection Needs Better AI Models

Melanoma is the most dangerous form of skin cancer, and early detection is critical for patient survival. Visual inspection by dermatologists remains difficult because benign and malignant skin lesions share a high degree of visual similarity, leading to potential misdiagnosis. Traditional methods relied on handcrafted features such as color, texture, and shape, but these approaches are limited in their ability to capture the complexity of dermoscopic images.

This paper introduces a CNN-based melanoma detection model that leverages deep learning image classification and transfer learning to improve diagnostic accuracy. The authors fine-tune the AlexNet architecture and compare it against DenseNet, using both a Soft-max classification layer and a support vector machine (SVM) classifier to evaluate deep feature performance. The model is designed to support Internet of Medical Things (IoMT) applications, enabling non-invasive melanoma screening via portable devices such as smartphones.

Key contributions: The study proposes a systematic hyperparameter fine-tuning approach for CNN-based melanoma detection, evaluates performance across three benchmark datasets (DermIS, DermQuest, and ISIC2019), and provides formal mathematical modeling of the CNN architecture. The authors report an average 5% accuracy improvement on DermIS, 6% on DermQuest, and 0.81% on ISIC2019 compared to existing state-of-the-art methods.

The paper also reviews prior work extensively, identifying recurring limitations across published studies: small dataset sizes, lack of external validation, limited interpretability of deep learning models, and insufficient comparison with competing techniques. These gaps motivated the authors' comprehensive evaluation across multiple datasets and classifier configurations.

TL;DR: This study presents a fine-tuned CNN model for melanoma detection, achieving 5-6% average accuracy improvements over state-of-the-art across DermIS and DermQuest datasets, with evaluations on 621, 1,233, and 25,000 images respectively.
Pages 3-6
Prior Work and Its Limitations

The authors conduct an extensive review of prior melanoma detection methods, spanning both traditional machine learning and deep learning approaches. SVM-based methods: Lingaraj et al. used SVMs to predict melanoma but were limited by small sample sizes and the absence of external validation. SVM models also suffered from poor interpretability, a critical shortcoming in clinical settings where clinicians need to understand and trust model decisions.

Pre-processing and augmentation: Earlier work by Khan et al. demonstrated that pre-processing techniques, particularly Contrast Limited Adaptive Histogram Equalization (CLAHE), could considerably improve classification accuracy. Liang and Wu compared decision trees and neural networks on dermoscopy images, finding that neural networks achieved higher sensitivity but neither study adequately addressed class imbalance between malignant and benign cases.

CNN-based approaches: Ashraf et al. proposed a CNN architecture incorporating global and local skin lesion attributes, while Bukhari et al. introduced multi-parallel depthwise separable and dilated convolutions with Swish activations for melanoma lesion segmentation. Hosny and Kassem combined residual learning with deep CNNs and transfer learning, achieving accuracy between 0.94 and 0.98 and AUC-ROC of 0.97 to 0.99. Olayah et al. reported 96.10% accuracy using fused CNN models with geometric active contour and Random Forest.

The common thread across these studies is a reliance on limited datasets, minimal comparison with competing methods, and a lack of detailed hyperparameter documentation. The authors identify this gap as the primary motivation for their systematic hyperparameter optimization approach across multiple datasets.

TL;DR: Prior melanoma detection studies achieved 83-98% accuracy but were limited by small datasets, poor interpretability, and inadequate benchmarking. This paper addresses those gaps through systematic multi-dataset evaluation and transparent hyperparameter tuning.
Pages 7-10
CNN Architecture Fundamentals and Deep Learning for Image Classification

The paper provides a thorough technical overview of the building blocks used in the proposed model. A CNN consists of three core layer types: convolutional layers (CLs), pooling layers (PLs), and fully connected layers (FCLs). Convolutional layers apply learnable filters (kernels) across the input image, computing dot products at each spatial position to produce feature maps that capture patterns such as edges, textures, and color gradients relevant to melanoma identification.

Pooling layers reduce the spatial dimensions of feature maps while retaining the most salient information. The authors use max pooling, which selects the highest value within non-overlapping rectangular regions. This operation makes the network translation-invariant, meaning it can detect the same feature regardless of its position in the image. After the convolutional and pooling layers, fully connected layers perform the final classification by computing linear combinations of the extracted features.

Deep CNNs (DCNNs) stack multiple convolutional and pooling layers to create hierarchical feature representations, learning progressively more abstract patterns at each level. The authors note that well-known architectures such as ResNet, AlexNet, GoogLeNet, VGGNet, DenseNet, and MobileNet have all been applied to various computer vision tasks, including medical imaging. Transfer learning is highlighted as a key technique for scenarios with limited labeled data, where a model pre-trained on a large dataset like ImageNet is fine-tuned on the target medical imaging task.

The loss function, which measures the discrepancy between predicted and actual labels, is minimized during training via stochastic gradient descent. The authors explain that cross-entropy and mean-squared error are common choices, with the goal of driving the loss function to its minimum value to achieve optimal classification performance.

TL;DR: The model builds on standard CNN components (convolutional, pooling, and fully connected layers) combined with transfer learning from ImageNet pre-trained architectures to overcome limited medical training data.
Pages 10-14
AlexNet Backbone and Layer-by-Layer Design

The proposed model is built on the AlexNet architecture, which won the 2012 ImageNet Large Scale Visual Recognition Challenge. AlexNet accepts input images of 227x227x3 pixels (height, width, and three RGB color channels). The first convolutional layer applies 96 kernels of size 11x11 with a stride of 4 pixels. The output size is computed as ((227 - 11 + 0)/4) + 1 = 55, yielding 55x55 feature maps. Subsequent layers further refine these feature maps through additional convolution and max pooling operations.

ReLU activation: After each convolutional and fully connected layer, the Rectified Linear Unit (ReLU) function is applied, defined as f(i) = max(0, i). This non-linear activation transforms all negative pixel values to zero while preserving positive values, enabling the network to learn complex, non-linear patterns such as irregular borders and asymmetric edges that are key indicators of malignancy in dermoscopic images.

Max pooling layers: After the first, second, and fifth convolutional layers, a max pooling layer with a 3x3 filter and stride of 2 is applied. This reduces spatial dimensions while preserving the most discriminative features for melanoma classification. Response normalization is applied after the first two convolutional layers to standardize activations and improve generalization, particularly important when analyzing skin lesion images captured under variable lighting and contrast conditions.

Soft-max and dropout: The Soft-max activation function converts the network output into a probability distribution across classes (benign vs. malignant), assigning each input to the class with the highest probability. The dropout layer randomly sets a fraction of neurons (with probability between 0.2 and 0.5) to zero during training, preventing overfitting by forcing the network to learn more distributed and robust feature representations. The two fully connected layers each contain 4,096 neurons.

TL;DR: The model uses AlexNet (227x227x3 input, 96 initial kernels, 5 convolutional layers, two 4,096-neuron FC layers) with ReLU activation, max pooling, response normalization, dropout (0.2-0.5), and Soft-max classification.
Pages 15-18
Proposed Fine-Tuning Pipeline and Hyperparameter Optimization

The proposed model follows a three-phase pipeline. Pre-processing: Each image is resized to fit CNN input specifications (227x227 pixels), grayscale images are converted to color, and data augmentation is applied through image enhancement, random orientation changes (horizontal and vertical), and flipping. Feature extraction: Two configurations are tested: MI-SVM (AlexNet with SVM classifier) and MII-SFMAX (AlexNet with Soft-max layer). The top five layers of the pre-trained AlexNet extract features, and the last three layers are modified for fine-tuning.

Hyperparameter optimization: The training uses stochastic gradient descent with momentum (SGDM) as the optimizer, with a learning rate of 1 x 10^-4, 60 epochs, a validation step of 6, and an 80:20 train-test split. The SGDM algorithm updates weights and biases iteratively in the negative gradient direction using mini-batches to balance computational efficiency and convergence quality.

Parameters systematically explored: The authors varied the number of layers and filters, tested activation functions (ReLU, Leaky ReLU, ELU), experimented with batch sizes for balancing training dynamics and generalization, adjusted learning rates from 0.001 to 0.1 (both fixed and adaptive scheduling), tuned dropout rates from 0.2 to 0.5, evaluated loss functions (including binary cross-entropy), tested validation splits (80-20 and 70-30), applied various data augmentation techniques (rotations, flips, brightness adjustments), and explored transfer learning strategies including which layers to freeze.

Additional regularization techniques included L2 regularization and weight decay to control model complexity. A systematic grid search was conducted across all hyperparameter ranges, and early stopping was employed to prevent overfitting while ensuring convergence. The WeightLearnRateFactor and BiasLearnRateFactor were increased for newly added fully connected layers to accelerate learning relative to the transferred layers.

TL;DR: The model uses SGDM optimizer with learning rate 1e-4, 60 epochs, 80:20 train-test split, and systematic grid search across layers, activation functions, batch sizes, dropout rates (0.2-0.5), and regularization techniques.
Pages 19-23
Classification Performance Across DermIS and DermQuest Datasets

The model was implemented using Python 3.0 with the Keras framework, running on Google Colab with an NVIDIA Tesla K80 GPU and an Intel i7 8th-generation CPU. Three benchmark datasets were used: DermIS (621 images after augmentation from 69 originals), DermQuest (1,233 images after augmentation from 137 originals), and ISIC2019 (25,331 images based on HAM10000 and BCN 20000 datasets).

DermIS results: The proposed model achieved 98.4% accuracy with AlexNet (1.6% error rate) and 98.8% accuracy with DenseNet (1.2% error rate). Compared to eight prior methods, including Khan et al. (96%), Shoieb et al. (94%), Amelard et al. (94%), Almansour and Jaffar (90%), Hosny and Kassem (93.5%), Bukhari et al. (97.4%), Mustafa et al. (95.5%), and Jeyakumar et al. (97.6%), the proposed model showed an average 5% improvement. For DenseNet on DermIS, the F1 score reached 79.9%, precision 78%, and recall 82%. AlexNet achieved an F1 score of 75.5%, precision of 74%, and recall of 77%.

DermQuest results: The model achieved 97.2% accuracy with AlexNet and 98.4% accuracy with DenseNet, representing an average 6% improvement over competing methods. After fine-tuning, DenseNet achieved an F1 score of 85%, precision of 86%, and recall of 83%. AlexNet produced an F1 score of 80%, precision of 81%, and recall of 79%. The competing methods ranged from 92.86% (Arasi et al.) to 97.7% (Hosny et al.) in accuracy.

The consistent superiority of DenseNet over AlexNet across both datasets is attributed to DenseNet's dense connectivity architecture, which enables feature reuse across layers and more efficient gradient flow during training. The performance gap is especially evident in precision and F1 score, suggesting that DenseNet is better at reducing false positives in melanoma detection.

TL;DR: DenseNet achieved 98.8% accuracy on DermIS (F1: 79.9%, precision: 78%, recall: 82%) and 98.4% on DermQuest (F1: 85%, precision: 86%, recall: 83%), outperforming AlexNet and eight prior methods by an average of 5-6%.
Pages 24-25
ISIC2019 Large-Scale Dataset Performance

The proposed model was further evaluated on the ISIC2019 dataset, a significantly larger benchmark containing 25,331 dermoscopic images derived from the HAM10000 and BCN 20000 collections. Images in HAM10000 have 600x450 pixel resolution, while BCN 20000 images are 1024x1024 pixels. This dataset represents a substantially more challenging evaluation than DermIS or DermQuest due to its scale and diversity.

DenseNet performance: On ISIC2019, DenseNet achieved 97% accuracy, 98.7% F1 score, 90% precision, and 90.4% recall. AlexNet performance: AlexNet achieved 96.8% accuracy, 98.4% F1 score, 89.5% precision, and 91% recall. The reference model by Olayah et al. achieved 96.1% accuracy, 96.9% F1 score, 88.69% precision, and 89.5% recall.

DenseNet outperformed the reference model by 0.83% in accuracy, 1.8% in F1 score, 1.11% in precision, and 1.34% in recall. Compared to AlexNet, DenseNet showed improvements of 0.73% in accuracy, 0.65% in F1 score, and 0.91% in precision. Notably, AlexNet had slightly higher recall (91% vs. 90.4%), representing a 1.68% difference in favor of AlexNet on this metric alone.

The improvements on ISIC2019 are smaller than those observed on DermIS and DermQuest (0.81% average vs. 5-6%), which reflects the more competitive baseline performance on this larger, better-curated dataset. The dense connectivity and feature reuse inherent to DenseNet contributed to its consistent edge in accuracy, precision, and F1 score, confirming its suitability as the preferred backbone for melanoma classification tasks at scale.

TL;DR: On the 25,331-image ISIC2019 dataset, DenseNet reached 97% accuracy, 98.7% F1 score, and 90% precision, outperforming AlexNet and the reference model by Olayah et al. (96.1% accuracy) by 0.83%.
Pages 25-26
Constraints, Clinical Translation, and Next Steps

Backbone architecture limitations: The authors acknowledge that AlexNet, while historically significant, is now considered outdated compared to modern architectures such as image transformers and more advanced CNNs. They recognize the need to transition to more performant backbone architectures in future work. The relatively modest improvements on ISIC2019 (0.81% average) suggest that the model may be approaching a performance ceiling with older architectures on larger, well-curated datasets.

Dataset and generalization concerns: The DermIS and DermQuest datasets are relatively small, with only 69 and 137 original images respectively before augmentation. Although data augmentation expanded these to 621 and 1,233 images, the underlying diversity of skin lesion presentations remains limited. The model's generalization to clinical scenarios requires validation across heterogeneous, real-world datasets that include variations in image quality, lighting conditions, and patient demographics across different skin tones.

Interpretability gap: The study does not include gradient-based class activation maps (Grad-CAM) or other explainability techniques that would allow dermatologists to understand which image regions influenced the model's classification decisions. Building trust among medical professionals requires transparency into model predictions, and the authors identify this as a promising avenue for future work.

Future directions: The authors plan to apply the model to larger datasets and incorporate more advanced deep learning techniques to improve classification accuracy. Comprehensive clinical investigations are needed to evaluate the model's effectiveness in practical settings and measure its impact on patient outcomes. The authors also highlight the need for careful examination of transfer learning-induced biases to ensure equitable predictions across diverse patient populations, along with exploration of alternative architectures to further improve performance.

TL;DR: Key limitations include the use of an outdated AlexNet backbone, small original dataset sizes (69 and 137 images before augmentation), lack of model interpretability, and absence of clinical validation. Future work targets more modern architectures, larger datasets, explainability techniques, and real-world clinical trials.
Citation: Almufareh MF, Tariq N, Humayun M, Khan FA.. Open Access, 2024. Available at: PMC11119457. DOI: 10.1177/20552076241253757. License: cc by-nc.