Melanoma is one of the deadliest forms of skin cancer because it progresses rapidly and has high mortality rates when caught late. Early detection is the single most important factor in improving patient survival. Traditionally, dermatologists examine skin lesions using a handheld magnification tool called a dermoscope, but this process is subjective and varies significantly from one clinician to another.
This review paper surveys the latest advances in computer vision and deep learning for automating and improving early melanoma detection. The authors evaluate cutting-edge neural network architectures including YOLO, GAN, Mask R-CNN, ResNet, and DenseNet, assessing how each one contributes to better detection, segmentation, and classification of skin lesions in dermoscopic images.
A key finding is that AI systems trained on primary care referral data have achieved a top-3 accuracy of 93% and a specificity of 83%, matching the performance of board-certified dermatologists and exceeding that of primary care physicians and nurse practitioners. This suggests AI can serve as a powerful triage tool, especially in settings where specialist access is limited.
The paper also emphasizes that robust AI depends on high-quality training data. The authors catalog major publicly available datasets, from the 200-image PH2 set to the massive ISIC 2024 challenge dataset containing over 401,000 images, and argue that dataset diversity is essential for building models that generalize across different skin types and clinical conditions.
Building effective melanoma detection AI requires large, well-annotated collections of skin lesion images. The paper provides a comprehensive catalog of the most important publicly available datasets. PH2 from Hospital Pedro Hispano in Portugal contains 200 dermoscopic images (40 melanomas, 160 other lesions) and is commonly used for benchmarking. The HAM10000 dataset is far larger with 10,015 images across multiple lesion types and 1,113 confirmed melanomas, making it one of the most widely used training sets in the field.
The ISIC (International Skin Imaging Collaboration) has released progressively larger challenge datasets each year. ISIC 2016 started with 900 images, ISIC 2017 grew to 2,000, and ISIC 2020 reached over 33,000 images. The latest, ISIC 2024, is a massive dataset of 401,059 images with over 10,000 confirmed melanomas, released for the Kaggle competition. Other important sets include DERMQUEST (126 images), MED-NODE (170 images), DERMNET (22,500 images), and DERMOFIT (1,300 images, purchase required).
These datasets vary not only in size but also in the depth of their annotations and the diversity of lesion types. For example, DERMIS and DERMOFIT feature detailed annotations that are critical for training sophisticated machine learning models, while DERMNET provides one of the largest image collections. The availability of such varied data supports training, testing, and validating everything from standard CNNs to hybrid models combining SVMs with decision trees.
The review provides a detailed breakdown of the proportional use of various AI model architectures for melanoma detection as of 2024. ResNet is the most commonly used architecture, accounting for 14.9% of all applications. It is followed by SVM at 13.1%, DenseNet at 10.0%, and Mask R-CNN at 8.2%. Other architectures such as EfficientNet, GAN, VGG, YOLO, Inception/GoogLeNet, U-Net, AlexNet, MobileNet, Xception, NASNet, and Random Forest each occupy smaller but notable shares of the distribution.
Despite the surge in advanced neural networks, SVM (Support Vector Machine) remains surprisingly popular because it is robust in high-dimensional image feature spaces, requires less data to train effectively, is more interpretable than deep networks, and is less resource-intensive. This makes SVM especially valuable in medical settings where annotated data is scarce or where model transparency is paramount for clinical acceptance.
The paper notes that from 2018 to 2024, ResNet and VGG were consistently the preferred architectures for handling and interpreting complex dermoscopic imagery. Over this period, newer architectures like the YOLO family (YOLOv3 through YOLOv8) were adapted for real-time skin lesion detection, while innovations such as attention mechanisms and capsule networks have been explored to boost model specificity and sensitivity.
Looking toward 2024 and beyond, the field is embracing lightweight network architectures, enhanced transfer learning techniques, federated learning for privacy-preserving model training, and explainable AI principles to increase the transparency of neural network decision-making for clinicians.
YOLO (You Only Look Once) is a family of object detection models originally designed for tasks like identifying cars or pedestrians in video feeds. Its key advantage is speed: it processes an entire image in a single forward pass through the network, rather than scanning it region by region. This makes it uniquely suited for real-time melanoma screening scenarios where a clinician needs instant feedback.
The YOLO architecture used for melanoma detection takes in 448 x 448 pixel images and passes them through a series of convolutional and max-pooling layers. The initial layer applies a 7 x 7 filter, followed by a 2 x 2 max-pool that reduces spatial dimensions while increasing the depth of feature maps. Subsequent layers alternate between convolutions and max-pooling with varying filter sizes, capturing features at different scales. The architecture ends with fully connected layers that consolidate these features to determine the presence and characteristics of melanomas.
The review highlights that versions from YOLOv3 through YOLOv8 (as well as YOLOX) have been progressively adapted for skin lesion work. For example, a hybrid YOLOv5 + ResNet approach by Elshahawy et al. achieved exceptional performance: precision of 99.0%, recall of 98.6%, and a mean average precision (mAP) of 98.7% at IoU thresholds from 0.0 to 0.95. This sets one of the highest benchmarks reported for melanoma detection algorithms.
Other YOLO-based studies have applied the architecture for detecting iris freckles as a potential biomarker for cutaneous melanoma (Slim-YOLO), and for combining YOLO with active contour methods for both lesion detection and segmentation (YOLOv4-DarkNet).
Generative Adversarial Networks (GANs) work through a competition between two neural networks: a generator that creates synthetic dermoscopic images from random noise, and a discriminator that tries to tell real images from fakes. As they train against each other, the generator produces increasingly realistic skin lesion images. This is valuable because melanoma datasets are often small or imbalanced, with far fewer melanoma images than benign lesion images.
The review catalogs several GAN variants applied to melanoma detection. StyleGAN + DenseNet201 (Zhao et al.) was used for dermoscopy image classification, achieving enhanced diagnostic accuracy by generating diverse training samples. StyleGANs with decision fusion (Gong et al.) improved classification reliability by combining outputs from multiple generated image sets. MelanoGANs (Baur et al.) specialize in high-resolution skin lesion synthesis, producing detailed training data that smaller datasets cannot provide on their own.
Other approaches include progressive transfer learning with adversarial domain adaptation (Gu et al.), which helps GAN-trained models generalize across different clinical environments and imaging conditions. Yi et al. explored unsupervised and semi-supervised learning using categorical GANs assisted by Wasserstein distance, enabling dermoscopy image categorization with minimal labeled data. Additional variants explored in the literature include SPGGAN, DCGAN, DDGAN, LAPGAN, PGAN, and Conditional GAN.
Mask R-CNN extends the Faster R-CNN object detection framework by adding a branch that predicts pixel-level segmentation masks for each detected region of interest. For melanoma detection, this means the model does not just identify a lesion; it precisely outlines its boundaries, separating melanoma from surrounding healthy skin. The architecture uses a backbone network for feature extraction, a Region Proposal Network (RPN) to identify candidate regions, and then applies fully connected and convolutional layers for classification, bounding box regression, and mask prediction simultaneously.
Mask R-CNN has been paired with Feature Pyramid Networks (FPN) to detect lesions at multiple scales. Studies by Khan et al. combined Mask R-CNN with transfer learning for attribute-based skin lesion detection and recognition, while Bagheri et al. compared Mask R-CNN segmentation with RetinaDeeplab and graph-based methods, finding that Mask R-CNN provided competitive segmentation quality.
ResNet (Residual Network) is the single most popular architecture in melanoma detection (14.9% of all applications). Its key innovation is the use of skip connections (residual connections) that allow certain layers to bypass others, enabling the construction of much deeper networks (ResNet-34, ResNet-50, ResNet-101, ResNet-152) without suffering from vanishing gradient problems. The ResNet-50 variant is particularly prominent, featuring convolutional layers, batch normalization, ReLU activation, and average pooling before the final classification output.
Multiple studies demonstrated ResNet's effectiveness: it has been used for ensemble learning on dermoscopy images, for multi-class skin lesion classification, and for discriminative feature learning. Variants such as SE-ResNet-50 (with squeeze-and-excitation attention) and FCRN (Fully Convolutional Residual Networks) have further extended its capabilities for segmentation tasks.
DenseNet (Densely Connected Convolutional Networks) takes a different approach to depth than ResNet. Instead of adding skip connections, DenseNet connects every layer to every other layer in a feed-forward fashion. Each layer receives feature maps from all preceding layers through depth concatenation. This ensures maximum information flow between layers, enhances feature retention, and makes the network both deep and parameter-efficient. Variants used in melanoma work include DenseNet-121, DenseNet-161, DenseNet-169, DenseNet-201, and DenseNet-264.
The DenseNet architecture has been applied across multiple melanoma-related tasks. DenseNet-II (Girdhar et al.) is an improved version specifically designed for melanoma cancer detection. Zhang and Wang applied DenseNet to the SIIM-ISIC melanoma classification challenge, while Nawaz et al. developed an improved DenseNet-77 combined with U-Net for melanoma segmentation. An FCN-based DenseNet framework by Adegun and Viriri demonstrated strong results for automated detection and classification in dermoscopy images.
U-Net was originally designed for biomedical image segmentation and has become a cornerstone for delineating melanoma boundaries. Its encoder-decoder architecture with skip connections between corresponding layers enables precise segmentation even with limited training data. Variants such as U-Net++ have been applied for multi-task lesion attribute segmentation, and a study on VGG-UNet by Rajinikanth et al. compared Adam vs. SGD optimizers for melanoma segmentation, finding that optimizer choice significantly impacts convergence and accuracy.
Other segmentation-focused work includes transfer learning approaches combining U-Net with DCNN-SVM for simultaneous segmentation and classification, and multiscale attention U-Net variants that adaptively focus on the most diagnostically relevant regions of dermoscopic images.
The paper presents standardized evaluation metrics used across the field: Precision (TP / (TP + FP)), Recall (TP / (TP + FN)), Accuracy ((TP + TN) / (TP + FN + FP)), and mean Average Precision (mAP). These metrics are essential for comparing models objectively and for assessing whether AI systems are reliable enough for clinical deployment, where both false positives (unnecessary biopsies) and false negatives (missed cancers) have serious consequences.
Several standout results are highlighted. Daniel Kvak developed a melanoma detector based on the CoAtNet architecture, a hybrid that combines CNNs and Vision Transformers. It achieved precision of 0.901, recall of 0.895, and average precision of 0.923, demonstrating robust performance by blending the local feature extraction of CNNs with the global attention capabilities of Transformers.
Iyatomi et al. applied a computer-based method for extracting melanoma tumors from dermoscopy images, achieving precision of 94.1% and recall of 95.3%, proving the system could effectively mimic the performance of experienced dermatologists. The highest-performing result reported in the review comes from Elshahawy et al., whose hybrid YOLOv5 + ResNet model reached precision of 99.0%, recall of 98.6%, and mAP of 98.7% at IoU thresholds from 0.0 to 0.95.
Additional architectures with strong track records include EfficientNet variants (B5, B6, B7), which optimize the scaling of network width, depth, and resolution for different computational budgets, and VGG-16/VGG-19, whose straightforward stacked convolutional layer design remains widely used for melanoma classification baseline comparisons and for melanoma thickness prediction.
Despite impressive accuracy numbers, the review identifies several barriers to clinical adoption. Data bias is a major concern: most datasets are heavily skewed toward lighter skin tones, meaning AI models may perform poorly on darker skin. The need for extensive data annotation by dermatopathologists also limits the speed at which training sets can be expanded. Additionally, some detection algorithms that rely on standard RGB camera photography still do not match the diagnostic precision of skilled dermatologists using dermoscopy.
The authors call for multimodal data integration, combining dermoscopic images with clinical patient data and genetic markers to enhance predictive accuracy and enable personalized melanoma diagnostics. They also emphasize the importance of explainable AI (XAI) principles, noting that clinicians are more likely to trust and adopt AI tools when they can understand why a model made a particular prediction, not just what it predicted.
Emerging technical directions include federated learning, which allows multiple hospitals to collaboratively train AI models without sharing patient data (addressing privacy concerns), lightweight network architectures optimized for deployment on mobile and edge devices for point-of-care screening, and Vision Transformers and hybrid CNN-Transformer architectures like CoAtNet that capture both local texture details and global structural patterns in lesion images.
The paper concludes that continuous advancements in AI and machine learning are redefining medical imaging standards for melanoma detection. The practical deployment of CNNs in dermatological clinics has already been shown to reduce false negative rates compared to traditional visual inspection. As these technologies mature, the integration of AI into routine clinical workflows promises to make melanoma screening more accessible, accurate, and cost-effective, ultimately improving early detection rates and patient survival.