Malignant melanoma is the deadliest form of skin cancer, and its incidence rate has been rising rapidly worldwide. The most effective path to successful treatment is early diagnosis, because melanoma caught early can often be cured with a simple excision. However, the visual similarity between melanoma and benign skin lesions (such as nevi) makes manual classification unreliable, even for experienced clinicians. This paper introduces a framework that combines deep transfer learning with ensemble classification to automatically distinguish melanoma from non-melanoma lesions in clinical images.
The core challenge: Melanoma classification from images is complicated by several factors. The degree of visual similarity between melanoma and benign lesions is high. Segmentation of the affected area is difficult due to variations in texture, size, color, shape, and location. Additional noise from hair, veins, and inconsistent image capture conditions further degrades performance. Traditional hand-crafted features (such as color histograms or texture descriptors) and standard segmentation approaches often produce unsatisfactory results because they lack the representational power needed to handle this variability.
Deep learning as a solution: In recent years, convolutional neural networks (CNNs) have emerged as an effective tool for medical image analysis. Deep learning automates the feature extraction process, replacing hand-crafted descriptors with learned representations that capture far more nuanced patterns. However, two persistent problems remain: the high variation among melanoma subtypes and the class imbalance present in most datasets, where benign lesions vastly outnumber melanoma cases. These issues lead to overfitting and poor generalization.
This paper's contribution: The authors propose a three-stage framework. First, preprocessing handles image resizing and data balancing. Second, transfer learning extracts features from pretrained deep neural networks (AlexNet, GoogLeNet, ResNet18, ResNet50). Third, an ensemble learning layer combines multiple classifiers (SVM, Logistic Label Propagation, KNN) with different feature sets to make a final melanoma or non-melanoma decision. The key insight is that using multiple feature representations and multiple classifiers together outperforms any single model approach.
The paper surveys a broad range of prior methods for melanoma detection, spanning hand-crafted feature descriptors, segmentation-based approaches, and deep learning solutions. Early methods relied on custom-designed features. For example, color pigment boundary descriptors used standard camera images with textural and morphological features fed into a multilayer perceptron. High-level intuitive features (HLIFs) simulated human-observable characteristics such as color asymmetry, structural asymmetry (via Fourier descriptors), and border irregularity (via morphological opening and closing). The BIBS (boundary intersection-based signature) descriptor was designed specifically for evaluating concave contours in lesion boundaries, with SVM used for classification.
Texture and segmentation methods: Local Binary Patterns (LBP) combined with block difference of inverse probabilities were explored for texture analysis, compared against raw pixel intensity values, and classified using both CNNs and SVMs. Other approaches used non-dermoscopic clinical images with illumination correction algorithms and extracted color and texture descriptors, with final predictions made via majority voting. The PECK (Predict-Evaluate-Correct K-fold) algorithm merged deep CNNs with SVM and random forest classifiers, and introduced SCIDOG segmentation for detecting lesion contours even in the presence of significant noise and hair.
Deep learning approaches: Several studies applied CNNs directly. One approach proposed an objective feature extraction function for CNNs using principal component analysis (PCA) during training to increase variance between images and make features more discriminative. Another built a computer-aided diagnosis system that combined CNN features with statistical and contrast location features from segmented raw images. DeepPCA used a novel objective function maximizing variation separability rather than categorical cross-entropy.
Multiple instance learning (MIL): MIL-based approaches received attention for melanoma detection, where sets of instances (bags) are labeled collectively. One MIL approach used spherical separation surfaces to discriminate melanoma from dysplastic nevi. A preliminary comparison between SVM and MIL highlighted the key role of feature selection using color and texture features for dermoscopic image classification.
The framework operates in three integrated stages. The first stage handles preprocessing: data balancing and image resizing. Class imbalance is a major problem in melanoma datasets, where benign lesions typically outnumber melanoma cases significantly. To address this, the authors augment the minority class (melanoma) by adding images altered through K-Means color segmentation. This approach generates new training samples by isolating color segments of the image that could contain melanoma, providing a balance between the two classes without simply duplicating existing images.
Image resizing: Each pretrained network requires a specific input dimension. AlexNet requires 227 x 227 pixels, while GoogLeNet, ResNet18, and ResNet50 all require 224 x 224 pixels. All images are resized to match these dimensions before feature extraction. The authors note that this normalization step does not alter the information content in a way that degrades performance, which is notable because resizing can sometimes cause quality loss and detail degradation.
Transfer learning and feature extraction: The second stage uses pretrained CNNs originally trained on ImageNet (1 million images across 1,000 classes) as feature extractors. Rather than training networks from scratch, the authors leverage the rich feature representations already learned by these networks. Features are extracted from specific layers chosen for each architecture: AlexNet uses the fc7 layer (4,096 neurons), GoogLeNet uses the pool5-7x7_s1 layer (1,024 neurons), ResNet18 uses the pool5 layer (512 neurons), and ResNet50 uses the avg_pool layer (2,048 neurons). The last two layers of each network are replaced with new layers matching the binary classification task (melanoma vs. not-melanoma).
Network adaptation: The pretrained networks are fine-tuned with specific hyperparameters: mini-batch size of 5, maximum epochs of 10, initial learning rate of 3 x 10^-4, and the SGDM (stochastic gradient descent with momentum) optimizer. Learning rate factors are increased in the new layers relative to transferred layers to speed up learning. Optionally, weights of earlier layers can be frozen (learning rate set to zero) to prevent overfitting, which is especially important for small datasets where the risk of overfitting is high.
The third and most novel stage of the framework is the ensemble learning layer, which combines the outputs of multiple classifiers trained on different feature sets. Given a set of n classifiers (C) and m vectors of transferred learning features (F), the framework constructs an n x m matrix of all possible classifier-feature combinations. Each combination produces a binary decision (melanoma = 1, not-melanoma = -1) for a given image. This creates a decision matrix D where each entry represents the prediction from a specific classifier using a specific feature set.
Score aggregation: Alongside the decision matrix, the framework also tracks a score matrix S, where each entry is the posterior probability P(i|x) that an image i belongs to class x. This probability score adds nuance beyond the binary decision, capturing how confident each classifier-feature pair is in its prediction. The combination of hard decisions and soft probability scores allows the ensemble to make more informed final predictions.
Mode-based fusion: The columns of the decision matrix (corresponding to each feature type) are analyzed using the statistical mode, which identifies the most frequently occurring decision across all classifiers for a given feature set. This step determines which classifiers agree most strongly when using the same features. For each modal value (the most frequent decision), the corresponding posterior probabilities from the score matrix are extracted and averaged, creating a decision-score vector DS. The final classification selects the decision from the modal value vector DM that corresponds to the position of the highest average score in DS. In essence, the framework selects the feature representation that produces the most confident and consistent classification across all classifiers.
Classifier configurations: The ensemble includes 5 classifier configurations. SVM with polynomial kernel (auto kernel scale), SVM with Gaussian kernel (auto kernel scale), Logistic Label Propagation with RBF kernel (regularization parameter 1, max iterations 1,000), KNN with k=3 using Spearman distance, and KNN with k=4 using correlation distance. This diversity of classifiers ensures that the ensemble captures different decision boundaries and distance metrics.
The framework uses bootstrapping as its train-and-test strategy. Bootstrapping is a statistical technique that creates samples of size B by randomly sampling with replacement from a dataset of size N. The resulting bootstrap samples are approximately independent and identically distributed (iid), provided the original dataset is large enough to capture the underlying distribution and the bootstrap sample size B is smaller than N to avoid excessive correlation. In the proposed framework, bootstrapping is applied to the feature set F to generate training and testing splits for the classifiers.
Training protocol: For both experimental procedures, the data is split into 80% training and 20% testing sets. This split is repeated for 10 iterations, with each iteration drawing a new bootstrap sample. At each iteration, the framework evaluates all classifier-feature combinations and selects the best-performing one via the ensemble fusion strategy. This repeated sampling approach creates a competitive environment where different classifiers and features are tested across multiple data splits, helping to identify the most robust combination.
MED-NODE dataset: The first dataset was created by the Department of Dermatology at the University Medical Center Groningen (UMCG). It contains 170 non-dermoscopic images: 70 melanoma and 100 nevi. Image dimensions vary significantly, ranging from 201 x 257 to 3,177 x 1,333 pixels. For this dataset, the framework uses AlexNet, GoogLeNet, and ResNet50 for feature extraction.
Skin-lesion dataset: The second dataset contains 206 images of skin lesions captured with standard consumer-grade cameras under varying and unconstrained environmental conditions. These images were sourced from the Dermatology Information System (DermIS) and DermQuest online databases. Of these, 119 are melanomas and 87 are non-melanoma. For this dataset, the framework uses ResNet50 and ResNet18. The choice of network combinations was not random but based on a study of network characteristics most suitable for each dataset, with alternative combinations not yielding expected results.
The framework was evaluated using 7 standard metrics: true positive rate (sensitivity), true negative rate (specificity), positive predictive value (precision), negative predictive value, accuracy, F1-score, and Matthew's correlation coefficient (MCC). On the MED-NODE dataset, the proposed approach achieved the best accuracy among all compared methods. The ResNet50+GoogLeNet+AlexNet combination achieved 93% accuracy, outperforming all 24 competing methods, including the previous best of 92% from Benjamin Albert's PECK algorithm.
Sensitivity and specificity: The ResNet50+GoogLeNet+AlexNet combination achieved a true positive rate (sensitivity) of 0.90 and a true negative rate (specificity) of 0.97, meaning it correctly identified 90% of melanoma cases and 97% of non-melanoma cases. The specificity of 0.97 was the highest among all methods tested. High specificity is clinically important because it reduces false positives, meaning fewer unnecessary biopsies and less patient anxiety. The ResNet50+Resnet18 combination also performed well, with sensitivity of 0.81 and specificity of 1.00 (perfect), though with slightly lower overall accuracy at 90%.
Precision and predictive values: The positive predictive value for the three-network combination reached 0.97, and the negative predictive value was 0.90. These values indicate that when the framework predicts melanoma, it is correct 97% of the time, and when it predicts non-melanoma, it is correct 90% of the time. This balance between precision and recall is important for clinical deployment, where both missed melanomas and false alarms have significant consequences.
MCC and F1 scores: The MCC reached 0.87 for the three-network combination, which is considered excellent for a binary classification task and indicates strong correlation between predictions and actual labels. The F1-positive score was 0.93 and F1-negative was 0.94, demonstrating well-balanced performance across both classes. For comparison, the previous best MCC was 0.83 from the PECK algorithm, and many competing methods scored below 0.50.
On the Skin-lesion dataset, the proposed framework achieved competitive results against 10 comparison methods. The ResNet50+ResNet18 combination achieved an accuracy of 88%, which was the second-highest result, surpassed only by the BIBS descriptor method at 90%. The ResNet50+GoogLeNet+AlexNet combination achieved 76% accuracy on this dataset, which was notably lower than its MED-NODE performance, suggesting that the three-network combination may not generalize as well to consumer-grade camera images with unconstrained conditions.
Sensitivity and specificity trade-offs: The ResNet50+ResNet18 combination achieved a sensitivity of 0.84 and a specificity of 0.92 on the Skin-lesion dataset. The specificity of 0.92 was the highest among all methods, consistent with the strong specificity performance observed on MED-NODE. The ResNet50+GoogLeNet+AlexNet combination showed higher sensitivity (0.87) but lower specificity (0.65), revealing a trade-off: the three-network ensemble captures more true melanomas but at the cost of more false positives. The best sensitivity overall was 0.96 from HLIFs, though that method had lower specificity at 0.73.
Negative predictive values: The ResNet50+ResNet18 combination achieved the highest negative predictive value of 0.88 on the Skin-lesion dataset, meaning that when the framework predicts a lesion is not melanoma, it is correct 88% of the time. This is clinically significant because a high NPV provides confidence that negative predictions are reliable, reducing the risk of missed melanomas. The positive predictive value reached 0.91 for HLIFs and 0.92 for BIBS, while the proposed method's ResNet50+ResNet18 combination achieved a competitive PPV.
Cross-dataset insights: The different performance patterns across the two datasets highlight important practical considerations. MED-NODE images have highly variable dimensions (201 x 257 to 3,177 x 1,333 pixels) but are non-dermoscopic clinical images. The Skin-lesion dataset uses consumer-grade camera images under uncontrolled conditions. The framework's stronger performance on MED-NODE suggests that the deep transfer learning features are particularly effective for clinical-quality non-dermoscopic images, while consumer-grade images with uncontrolled lighting and angles present additional challenges.
Computational complexity: The authors acknowledge that the main weakness of the proposed framework is its computational load. Even though the pretrained networks include layers with already-tuned weights (reducing training time compared to training from scratch), the feature extraction phase remains expensive. The fully connected layers in each network make the architecture extremely dense, and running feature extraction through 4 separate deep networks multiplies this cost. For clinical deployment, where real-time or near-real-time predictions may be needed, this could be a significant barrier.
Small dataset limitations: Both evaluation datasets are small by deep learning standards. MED-NODE contains only 170 images and Skin-lesion contains 206 images. While bootstrapping and data balancing via K-Means segmentation help mitigate this limitation, the results may not generalize to larger, more diverse patient populations. The framework was not tested on larger benchmark datasets like ISIC (International Skin Imaging Collaboration), which contains tens of thousands of dermoscopic images and would provide a more rigorous evaluation of scalability and generalization.
Dermoscopic vs. non-dermoscopic images: The framework was evaluated exclusively on non-dermoscopic images captured with standard or consumer-grade cameras. While this is practically useful (since dermoscopes are not available in all clinical settings), dermoscopic images are the clinical standard for melanoma assessment. The absence of dermoscopic image evaluation means the framework's performance on the most common clinical imaging modality remains unknown. Future work should include datasets like PH2, which the authors specifically mention as a target.
Future directions: The authors outline several directions for future research. First, exploring additional CNN architectures beyond AlexNet, GoogLeNet, ResNet18, and ResNet50, as newer architectures like DenseNet, EfficientNet, and Vision Transformers have since shown strong performance on medical imaging tasks. Second, applying the framework to additional datasets, including PH2 and larger collections. Third, extending the approach beyond melanoma detection to other skin lesion classification tasks. Finally, the authors suggest investigating alternative dataset balancing approaches, such as adding zero-mean Gaussian noise with variance of 0.0001 to duplicated melanoma images, as an alternative to the K-Means segmentation augmentation used in this study.