Skin cancer is one of the most common cancers worldwide, with 97,160 Americans diagnosed in 2023 alone, accounting for 5.0% of all cancer cases in the United States. Melanoma, the most aggressive subtype, was responsible for 7,990 deaths (1.3% of total cancer deaths). Between 2016 and 2020, approximately 21 per 100,000 people were diagnosed with melanoma in the U.S., and 1,413,976 individuals were living with melanoma as of 2020. The five-year survival rate is 93.5% overall, but it climbs to 99.6% when melanoma is caught early at the localized stage. The problem is that only 77.6% of melanomas are diagnosed at this favorable local stage, meaning nearly one in four cases is caught after the cancer has begun to spread.
Current diagnostic accuracy: Visual inspection by dermatologists, the most common screening method, achieves only about 60% accuracy. Dermoscopy improves this to approximately 89% overall, with a sensitivity of 82.6% for melanocytic lesions, 98.6% for basal cell carcinoma, and 86.5% for squamous cell carcinoma. However, dermoscopy still struggles with featureless and early-stage melanomas that lack distinctive dermoscopic patterns.
Computer-aided detection pipeline: Automated skin cancer diagnosis typically follows five steps: image acquisition, pre-processing, segmentation, feature extraction, and classification. A key challenge is that artifacts in dermoscopic images, including hairs, dark corners, water bubbles, marker signs, ink marks, and ruler marks, can cause misclassification and inaccurate segmentation. Deep learning methods have shown promising results in handling these complexities because they can learn to extract relevant features directly from raw image data.
This 2023 review from Naqvi et al. surveys the most recent deep learning research (primarily 2021 and 2022 publications) on skin cancer classification. It builds on earlier reviews by Pacheco and Krohling (2019), Lucieri et al. (2021), Adegun and Viriri (2021), Dildar et al. (2021), and Gilani and Marques (2023, focused on GANs), adding coverage of the latest architectures, datasets, and performance benchmarks.
The review catalogs the foundational convolutional neural network (CNN) architectures that underpin nearly every skin cancer classification study. CNNs learn directly from image data through stacked layers of convolution (applying filters to detect features), activation (ReLU, which zeroes out negative values to speed training), and pooling (downsampling to reduce parameters). Across tens to hundreds of layers, these operations progress from detecting basic edges and brightness to identifying complex, diagnosis-relevant features.
AlexNet (2012): Proposed by Krizhevsky et al., AlexNet contains 5 convolutional and 3 fully connected layers trained with 60 million parameters. It achieved top-1 and top-5 error rates of 37.5% and 17.0% on ImageNet LSVRC-2010, and solved overfitting through dropout layers. VGG (2014): From the Visual Geometry Group at Oxford, VGG-16 uses 13 convolutional layers with small 3x3 kernel filters instead of larger ones, achieving 92.7% top-5 accuracy on ImageNet's 14 million images across 1,000 classes. VGG-19 adds more depth.
ResNet (2016): He et al. introduced skip connections to solve the vanishing and exploding gradient problems that plagued very deep networks. These residual connections allow layers causing training difficulties to be bypassed, enabling networks with 50, 101, or even 152 layers. DenseNet (2017): Huang et al. connected every layer to every other layer, so each layer receives feature maps from all preceding layers and passes its own forward. This approach strengthened feature propagation, enabled feature reuse, and reduced total parameter count.
MobileNet (2017): Howard et al. designed a lightweight architecture for mobile and edge deployment by replacing standard 3x3 convolutions with depthwise separable convolutions (a 3x3 depthwise convolution followed by a 1x1 pointwise convolution), dramatically reducing the parameter count. This architecture is particularly relevant for real-time skin cancer screening on mobile devices and IoT hardware like Raspberry Pi boards.
The review surveys over 25 studies and reports a wide range of classification performance. Among the strongest results, Gajera et al. evaluated eight pre-trained CNNs (AlexNet, VGG-16, VGG-19, Inception v3, ResNet-50, MobileNet, EfficientNet B0, DenseNet-121) across four datasets (ISIC 2016, ISIC 2017, PH2, HAM10000). DenseNet-121 as a feature extractor with an MLP classifier achieved the best accuracy of 98.33% and an F1 score of 0.96, though the training sets were small (200 to 2,000 images), raising overfitting concerns.
Artifact removal approaches: Alenezi et al. used wavelet transform and pooling to remove hair artifacts from dermoscopic images before feeding them to a deep residual network, achieving 96.91% accuracy and an F1 score of 0.95 on ISIC 2017/HAM10000 with ReLU activation. Shinde et al. combined a Squeeze algorithm with MobileNet (Squeeze-MNet) for IoT deployment on a Raspberry Pi 4 board, reaching 99.36% accuracy on ISIC, though with lower sensitivity and specificity than baselines, and more parameters than MobileNetV2.
Transfer learning and NASNet: Abbas and Gul used NASNet with geometric data augmentation on ISIC 2020 to achieve 97.7% accuracy and an F1 score of 0.97. Alenezi et al. separately used ResNet-101 features with an SVM classifier, reaching 96.15% on ISIC 2019 and 97.15% on ISIC 2020, though the first dataset contained only 1,168 images. Kousis et al. benchmarked eleven CNN architectures on HAM10000 for seven-class classification and found DenseNet169 achieved the best accuracy of 92.25% with an F1 score of 0.932.
Ensemble and stacked methods: Shorfuzzaman proposed an explainable stacked ensemble combining DenseNet121, Xception, and EfficientNetB0, achieving 95.76% accuracy and AUC of 0.957 on 3,297 ISIC images for binary (melanoma vs. non-melanoma) classification. Bassel et al. used a Stacked CV method combining deep learning, SVM, random forest, neural networks, KNN, and logistic regression in three levels, with Xception features yielding 90.9% accuracy on only 2,637 training images. Kausar et al. ensembled ResNet, InceptionV3, DenseNet, InceptionResNetV2, and VGG-19 with majority voting to achieve 98.6% on the ISIC archive.
Several studies integrated segmentation as a pre-processing step before classification, aiming to isolate the lesion region of interest (ROI) for more accurate diagnosis. Alam et al. proposed S2C-DeLeNet1, which replaced the encoder of U-Net with EfficientNet-B4 for segmentation, then used an encoder-decoder network for feature extraction and classification. Trained on HAM10000, it achieved a mean Dice score of 0.9494 for segmentation and mean accuracy of 91.03% for classification across seven lesion classes.
InSiNet architecture: Reis et al. developed InSiNet (Inception Block Skin Network), combining a custom CNN with U-Net segmentation. InSiNet outperformed GoogleNet, DenseNet-201, ResNet152V2, EfficientNetB0, and machine learning baselines (SVM, logistic regression, random forest), achieving 94.59% accuracy on ISIC 2018, 91.89% on ISIC 2019, and 90.54% on ISIC 2020. However, the melanoma vs. non-melanoma model was trained on only 1,323 images, limiting generalizability.
Feature selection and fusion: Khan et al. proposed a multi-stage pipeline using ResNet-101 and DenseNet-201 for feature extraction, an improved moth flame optimization (IMFO) algorithm for feature selection, multiset maximum correlation analysis (MMCA) for feature fusion, and Kernel Extreme Learning Machine (KELM) for classification. This achieved 98.70% segmentation accuracy on PH2 and 98.70% classification accuracy on HAM10000. Khan et al. separately used a 20-layer and 17-layer CNN for segmentation with joint probability distribution (JPD) and marginal distribution function (MDF) for fusion, reaching 92.70% segmentation accuracy on ISIC 2018 and 87.02% classification accuracy on HAM10000.
Lightweight segmentation: Adegun et al. built a fully convolutional encoder-decoder integrated with a probabilistic model using Gaussian kernels to refine lesion borders, achieving 98% accuracy on ISBI 2016 and PH2 with only 6.97 million parameters, compared to the next-lowest 10 million (DSNet). Malibari et al. combined Wiener filtering for noise removal, U-Net for segmentation, and SqueezeNet for feature extraction, feeding into a DNN classifier that achieved 99.90% accuracy and an F1 score of 0.990 on ISIC 2019's 253,331 images.
A recurring theme across the reviewed studies is the scarcity of labeled dermoscopic images, which drives the use of data augmentation and synthetic image generation. Gouda et al. used ESRGAN (Enhanced Super-Resolution Generative Adversarial Network) to generate synthetic skin lesion images, expanding the training set for a CNN trained on ISIC 2018 (3,533 images). The approach achieved 83.2% accuracy, comparable to more complex architectures like ResNet-50, InceptionV3, and Inception ResNet. However, this accuracy remains below dermoscopy's diagnostic threshold, limiting its clinical utility.
ESRGAN for image enhancement: Alwakid et al. took a different approach, using ESRGAN not just for augmentation but for enhancing image quality as a pre-processing step, combined with segmentation to extract ROIs. Their CNN achieved an F1 score of 0.859, while ResNet-50 reached 0.852 on HAM10000. Although the improvement was modest, the combination of enhancement and segmentation demonstrated that image quality matters as much as model architecture.
Addressing class imbalance: Rashid et al. used geometric data augmentation with MobileNetV2-based transfer learning on ISIC 2020 to tackle the severe class imbalance between benign and malignant samples, achieving 92.8% average accuracy for binary classification. Abbas and Gul similarly relied on geometric transformations (rotation, flipping, scaling) with NASNet to reach 97.7% accuracy on ISIC 2020.
Domain-specific augmentation: Bian et al. addressed a different bias problem entirely, using WGAN (Wasserstein GAN) to augment images collected from Asian populations. Most existing skin lesion datasets are curated from Western countries with predominantly fair-skinned patients, creating a dataset bias that degrades performance on darker skin tones. Their YOLOv3-based approach optimized with Dynamic Convolution Kernel (YoDyCK) achieved 96.2% accuracy on ISBI 2016, demonstrating that diversity-aware augmentation can partially mitigate racial bias in skin cancer AI.
The review catalogues 12 major datasets used across the surveyed studies. HAM10000 is the most widely used, containing 10,015 dermoscopic images across seven classes: 6,705 melanocytic nevi, 1,113 melanomas, 1,099 benign keratoses, 514 basal cell carcinomas, 327 actinic keratoses, 142 vascular lesions, and 115 dermatofibromas. This severe class imbalance (67% nevi vs. 1.1% dermatofibromas) is a persistent challenge for model training and evaluation.
ISIC challenge datasets: The International Skin Imaging Collaboration (ISIC) has released progressively larger datasets: ISIC 2016 (900 training images, binary), ISIC 2017 (2,000 training images, 3 classes), ISIC 2018 (12,500+ training images), ISIC 2019 (25,331 images, 8 classes), and ISIC 2020 (33,126 images from 2,000+ patients). Other notable datasets include PH2 (200 manually segmented dermoscopic images), BCN20000 (19,424 images from Hospital Clinic Barcelona, 2010-2016), Dermofit (1,300 images in 10 classes), PAD-UFES-20 (3,939 images), and Atlas of Dermoscopy (1,024 images).
Skin cancer subtypes: The paper reviews five major types. Melanoma is the most dangerous, capable of rapid metastasis, with a death rate of 2.1 per 100,000 diagnosed cases. Basal cell carcinoma (BCC) is the most prevalent form of skin cancer, typically appearing as flesh-colored growths or pearl-shaped bumps on sun-exposed areas. Squamous cell carcinoma (SCC) can penetrate deep into the skin if untreated. Dysplastic nevi are atypical moles that resemble melanoma features. Actinic keratoses (AKs) are pre-malignant lesions that can progress to SCC if left untreated.
A critical observation from the dataset analysis is that most benchmark datasets are curated from Western populations with predominantly fair skin. This creates a systematic bias: models trained on these datasets may underperform on patients with darker skin tones, a gap that only Bian et al. explicitly attempted to address by training on images from Asian populations.
The review provides a rare and valuable comparison of the computational resources used across the surveyed studies. The hardware ranged from modest consumer setups to dedicated research workstations. At the lower end, Reis et al. used an Intel i5 processor with 6 GB RAM and a GTX 940MX GPU (2 GB VRAM), while Bassel et al. trained on an Intel Core i4 processor with 12 GB RAM and no dedicated GPU mentioned. At the higher end, Fraiwan and Faouri used an HP OMEN with 64 GB RAM, an Intel Core i7-10700K, and an NVIDIA RTX 3080 GPU.
GPU specifications: The most common GPUs were consumer-grade NVIDIA cards: GTX 1050Ti (Gajera et al., Malibari et al.), RTX 3060 (Gouda et al., Alwakid et al.), RTX 3060Ti (Alam et al.), and GTX 1060 6GB (Kousis et al.). Professional GPUs included the Quadro P4000 (Alenezi et al., with 32 GB RAM), Tesla P100 with 16 GB (Shorfuzzaman), and P40 (Mazoure et al.). Bian et al. used dual 1080Ti GPUs. RAM allocations ranged from 8 GB to 64 GB across studies.
Edge deployment: Shinde et al. specifically targeted edge deployment on a Raspberry Pi 4 microprocessor with a 64 GB SD card, spy camera, and NeoPixel ring, using an Intel Core i5-7500 with 32 GB RAM and GTX 1050Ti for training before deploying the Squeeze-MNet model. This is one of only a handful of studies to consider real-world deployment constraints rather than purely optimizing benchmark accuracy.
The wide variation in hardware underscores an important point: many of the highest-accuracy models (using deep architectures like DenseNet-201 or ResNet-152V2) require substantial GPU resources that are not available in typical dermatology clinics. The disconnect between benchmark performance and practical deployability remains a significant barrier to clinical adoption.
Small dataset problem: The most pervasive limitation is that most models were trained and tested on very small datasets. Gajera et al. used only 200 to 2,000 training images. Maniraj and Maran tested on just 200 images. Reis et al. trained on 1,323 images. Bassel et al. used 2,637 training and 660 test images. Using deep architectures like DenseNet-121 or ResNet-101 with millions of parameters on such small datasets creates a high risk of overfitting, and the reported accuracy numbers likely do not generalize to larger, more diverse patient populations.
Binary vs. multi-class gap: Many of the highest-performing models were evaluated only on binary classification (melanoma vs. benign), which is a much simpler task than the clinically relevant multi-class problem. Mazoure et al., Ghosh et al., Shorfuzzaman, and Rashid et al. all reported results exclusively on binary tasks. When models were evaluated on multi-class datasets like HAM10000 (7 classes) or ISIC 2019 (8 classes), performance dropped significantly. Fraiwan and Faouri achieved only 82.9% accuracy and an F1 score of 0.744 on HAM10000, and Bechelli and Delhommelle reported just 70% accuracy and 0.68 precision on the same dataset.
Below-dermoscopy accuracy: The stated goal of deep learning in skin cancer is to match or exceed dermatologist-level accuracy, yet several reviewed models actually performed below dermoscopy's 89% baseline. Gouda et al. achieved only 83.2%. Aljohani and Turki reached 76.09%. Bechelli and Delhommelle hit 70% on HAM10000. These results suggest that simply applying deeper or more complex architectures does not automatically translate to clinical utility.
Skin-tone bias: Nearly all datasets are curated from Western countries with predominantly fair-skinned patients. Only Bian et al. addressed this by training on images from Asian populations. Models trained on biased data will systematically underperform on darker skin tones, which is a serious equity concern given that melanoma in darker-skinned patients is often diagnosed at later stages. Lack of real-time deployment: Very few studies considered hardware constraints or clinical workflow integration. The computational cost of deep architectures like DenseNet-169, ResNet-152V2, or multi-model ensembles makes them impractical for point-of-care use without significant optimization.
Larger and more diverse datasets: The most urgent need identified by the authors is the creation of larger skin lesion datasets with representation across skin tones, ethnicities, and geographic regions. Current datasets top out at approximately 33,000 images (ISIC 2020), which is small compared to the millions of images used to train general-purpose models. Multi-institutional data collection efforts, federated learning approaches, and synthetic data generation through GANs could help expand both the volume and diversity of training data.
Hardware implementation for real-time diagnosis: The authors specifically call for more research into deploying deep learning models on embedded hardware to assist dermatologists in real-time clinical settings. The Squeeze-MNet approach by Shinde et al. (targeting Raspberry Pi 4) represents a promising but isolated effort. Model compression techniques such as pruning, quantization, and knowledge distillation could make high-accuracy architectures like DenseNet or ResNet viable for mobile and point-of-care devices without sacrificing diagnostic performance.
Addressing the accuracy gap: Several studies in this review achieved accuracies below the 89% dermoscopy baseline, indicating that model architecture alone is insufficient. Future work should focus on combining improved pre-processing (artifact removal, image enhancement), better augmentation strategies, and architecturally efficient models. Uncertainty quantification, as explored by Abdar et al. using Monte Carlo dropout, ensemble MC dropout, and deep ensembles with three-way decision theory, could help identify cases where the AI system should defer to a human dermatologist rather than making a low-confidence prediction.
Explainability and clinical trust: While Shorfuzzaman's explainable CNN-based stacked ensemble is a step in the right direction, most reviewed studies treat their models as black boxes. For clinical adoption, dermatologists need to understand why a model classified a lesion as malignant, not just that it did so. Techniques such as Grad-CAM, SHAP, and attention map visualization should become standard components in skin cancer AI pipelines. The ultimate goal is an AI-powered diagnostic tool that is accurate, explainable, equitable across skin tones, and deployable at the point of care.