Melanoma is the deadliest form of skin cancer, with a propensity to metastasize to vital organs including the eyes, limbs, and internal body sites. The World Health Organization projects that cancer will claim 13.1 million lives annually by 2030, and early detection is critical because skin problems caught early have a 90% cure rate, while late detection drops that figure to just 50%. In the United States alone, an estimated 20% of the population will develop some form of skin cancer during their lifetime.
The diagnostic challenge: Dermatologists traditionally rely on the ABCD-E rule (Asymmetry, Border, Color, Diameter, and Evolving) established by Stolz et al. in 1994, but this clinical framework has notable blind spots. Melanomas can occasionally be smaller than 6 mm, falling outside the diameter criterion entirely. Additionally, diagnostic accuracy varies widely based on a physician's training and experience, and misdiagnosis carries serious consequences. False-negative diagnoses delay treatment for malignant lesions, while false-positive diagnoses lead to unnecessary excision of benign tumors, driving up healthcare costs.
Prior approaches: Earlier computer-aided methods relied on manually coded feature extraction. Codella et al. combined edge and color histograms with local binary patterns. Barata et al. used local and global image elements with Laplacian pyramids and gradient histograms, achieving 96% sensitivity and 80% specificity on dermoscopy images from Hospital Pedro Hispano. However, these handcrafted feature methods struggled with the sheer variability of melanoma presentation, particularly the way lesions evolve over time in size, shape, and color.
This paper proposes a hybrid deep learning framework that integrates three distinct components: U-Net for image segmentation, Inception-ResNet-v2 for deep feature extraction, and a Vision Transformer (ViT) with multi-head self-attention for final classification. The authors tested this pipeline on the ISIC 2020 challenge dataset (33,126 dermoscopic images) and achieved 98.65% accuracy, 99.20% sensitivity, and 98.03% specificity.
The most frequently used imaging modalities for skin cancer diagnosis include dermatoscopy, ultrasound, optical coherence tomography, reflectance confocal microscopy, and hyperspectral imaging. However, melanoma lesions cannot be reliably identified as tumors using any single imaging method alone. This reality has driven the adoption of increasingly sophisticated computational approaches over the past decade.
Custom feature-based era: Earlier methods focused on handcrafted features. Al-Masni et al. (2018) proposed a full-resolution convolutional network (FrCN) for skin lesion segmentation, achieving 90.78% accuracy. The biopsy method remains the gold standard but is invasive, uncomfortable, and time-consuming. Computer-aided diagnosis (CAD) systems were developed as non-invasive alternatives, using image processing pipelines that typically followed a four-step workflow: preprocessing, segmentation, feature extraction, and classification.
Deep learning transition: The shift from handcrafted features to deep learning brought significant accuracy improvements. Kassem et al. (2020) used deep CNNs with transfer learning to reach 94.92%, while Mousannif et al. (2020) achieved only 86% with standard CNNs. Duggani and Nath (2021) combined deep CNNs with YOLO (You Only Look Once) object detection and achieved 97.49%. More recent work by Gouda et al. (2022) tested ResNet50, InceptionV3, and Inception-ResNet architectures, but only reached 85.7% accuracy.
State of the art in 2023-2024: The closest competing methods include Singh et al. (2023) using YOLO with L-Fuzzy Logic at 98% accuracy, Gamage et al. (2024) using CNN architectures with transfer learning at 98.37%, and Din et al. (2024) using LSCS-Net and U-Net at 98.62%. The proposed hybrid framework in this paper surpasses all of these with 98.65% accuracy, representing the highest reported result on these benchmarks.
The study used dermoscopic images from the International Skin Imaging Collaboration (ISIC) 2020 challenge dataset, a large-scale publicly available resource totaling 48 GB of data. The dataset contains 33,126 dermatoscopic images stored in DICOM format, including both malignant and benign skin lesions. Each image is linked to a unique patient ID, ensuring traceability to individual patients. DICOM is the standard format for medical image data exchange, providing comprehensive patient and image-related metadata alongside the image data itself.
Preprocessing pipeline: The authors applied a two-stage preprocessing procedure. First, images were resized and centered to standardize dimensions. Second, background elimination was performed using the k-means algorithm, along with saliency detection and convolution/deconvolution networks to isolate the lesion from surrounding skin and artifacts. Noise reduction and contrast enhancement were also applied to improve image quality before feeding data into the model.
Data augmentation: To address class imbalance between benign and malignant samples, the authors generated approximately 6,562 additional images of malignant scans through augmentation. The augmentation techniques included horizontal and vertical inversions, scale transformation with ranges from zero to one, rotation transformation of 25 degrees, zoom magnification of 0.20, and arbitrary transformation of 20 degrees. These transformations expanded the malignant class to balance it against the more numerous benign samples.
Data splitting: The dataset was divided into three partitions: 60% for training, 10% for validation, and 30% for testing. This 60/10/30 split gave the model a substantial training set while reserving nearly a third of the data for unbiased performance evaluation.
The first stage of the proposed framework uses U-Net, an architecture originally proposed by Ronneberger et al. for biomedical image segmentation. U-Net has a symmetric encoder-decoder structure. The encoder generates high-level feature maps through 2D convolution and max pooling layers, progressively reducing spatial resolution while increasing the depth of learned representations. The decoder then regenerates feature maps to match the original image dimensions, recovering spatial detail through up-sampling operators.
Architecture details: Within each layer, 3x3 convolutions with ReLU activation and batch normalization are applied. The max pooling 2D layers reduce filter size at each stage, helping the network handle large input images while keeping trainable parameters manageable. Skip connections between corresponding encoder and decoder layers preserve fine-grained spatial information. After each convolutional layer, the Adam and Adamax optimizers compute loss and accuracy metrics. The architecture produces a U-shaped design with a contracting path and an expanding path.
Role in the pipeline: The U-Net model's primary task was to identify affected areas of skin cancer by generating segmented masks. These masks isolate the region of interest (the lesion) from surrounding healthy skin, hair, and other artifacts. The model was trained using pairs of dermoscopic images and their corresponding ground-truth segmentation masks. By producing clean, targeted masks, the U-Net stage reduces noise in downstream classification and prevents the feature extraction model from being distracted by irrelevant background information.
The authors emphasize that U-Net's reduced overhead compared to fully connected architectures, combined with its automatic border recognition capability, makes it particularly well-suited for medical image datasets where pixel-level precision matters. Its symmetrical extended path enables precise localization with fewer parameters than standard feed-forward networks.
Inception-ResNet-v2 for feature extraction: After U-Net produces segmented masks, both the masks and original images are passed to Inception-ResNet-v2 for optimal feature extraction. This architecture combines the multi-scale feature capture of the Inception family with residual connections that address the vanishing gradient problem in deep networks. The residual connections replace the Inception architecture's filter concatenation stage, enabling the network to learn residual characteristics while shortening training time. The architecture uses factorized convolutions to reduce the total number of parameters and decreases input resolution within modules before applying convolutions, keeping computational cost relatively low despite its depth.
Hyperparameter configuration: The Inception-ResNet-v2 model was fine-tuned with a learning rate of 0.001, batch size of 20, 15 epochs, momentum of 0.99, and dropout regularization of 0.5. These hyperparameters were tuned iteratively until optimal values were found. The output feature maps from Inception-ResNet-v2 are flattened into a sequence of patches, which become the input to the Vision Transformer.
Vision Transformer (ViT) classification: The multi-head Vision Transformer processes the high-dimensional feature vectors through its self-attention mechanism. Unlike CNNs, which concentrate on local pixels within receptive fields and struggle with remote pixel connections, the ViT's self-attention mechanism captures relationships between all patches in the input sequence simultaneously. The scaled dot-product attention is computed as Attention(Q, K, V) = Softmax(QK^t / sqrt(d_k))V, where Q, K, and V represent query, key, and value vectors. Multi-head attention extends this by projecting queries, keys, and values h times through different learned linear projections.
Transformer architecture components: The proposed transformer encoder includes a normalizing layer, a multi-head self-attention (MHA) layer, two dense layers with a classification head, and a Softmax regression function. The MLP block within the transformer contains dropout and normalization layers with a 50% dropout rate, plus a non-linear layer with 1,024 neurons using Gaussian Error Linear Unit (GeLU) activation. The 2D patches from Inception-ResNet-v2 are linearly concatenated into a 1D vector before passing through the transformer encoder containing MSA and MLP blocks for final malignant vs. benign classification.
The proposed framework was trained on an NVIDIA GeForce RTX 3060 GPU with an Intel Core i7-10700KF CPU at 3.80 GHz and 64 GB of RAM. Training accuracy started at 86.03% at epoch 1 and steadily climbed to 98.65% by epoch 15. The loss value started high at epoch 1, dropped to 0.3746 by epoch 6, and reached 0.1556 at epoch 15 when peak accuracy was achieved. Validation loss stabilized at 0.2773 after 12 epochs, and both validation loss and accuracy loss converged at 0.1851 at epoch 12, remaining stable thereafter.
Classification performance: The final model achieved 98.65% accuracy, 99.20% sensitivity, and 98.03% specificity on the ISIC 2020 test set. The high sensitivity (99.20%) is particularly important in melanoma screening because it means the model correctly identifies 99.20% of true malignant cases, minimizing dangerous false negatives. The 98.03% specificity indicates strong performance in correctly ruling out benign lesions, reducing unnecessary biopsies.
Head-to-head comparisons: The proposed method outperformed 13 existing approaches. Al-Masni et al.'s FrCN (2018) achieved 90.78%. Kassem et al.'s transfer learning approach (2020) reached 94.92%. Duggani and Nath's DCNN + YOLO combination (2021) hit 97.49%. Among 2023-2024 methods, Patel et al.'s CNN reached 95%, Tembhurne's CNN with contourlet transform achieved 93%, Singh et al.'s YOLO + L-Fuzzy Logic reached 98%, Gamage et al.'s multi-architecture transfer learning reached 98.37%, and Din et al.'s LSCS-Net + U-Net reached 98.62%. The proposed framework's 98.65% surpasses all of these.
To mitigate overfitting, the authors employed regularization techniques and optimized hyperparameters by adding a penalty term to the model's loss function. The convergence behavior, where validation and training metrics met at epoch 6 for accuracy and epoch 12 for loss, suggests the model generalized well rather than simply memorizing the training data.
Segmentation ablation: The authors conducted ablation experiments to quantify the contribution of U-Net segmentation to overall performance. When the segmentation step was removed from the pipeline, classification results dropped measurably. With segmentation included, benign lesion classification recorded 0.99% precision, 0.80% recall, and 0.89% F1-score. Without segmentation, the malignant class saw the largest degradation: precision dropped by 0.83%, recall dropped by 2.38%, and F1-score dropped by 1.6%. These results confirm that isolating the lesion region before feature extraction has a significant and measurable impact on diagnostic accuracy, particularly for detecting malignant cases.
Cross-dataset validation on HAM10000: To test generalizability, the authors evaluated their trained model on the HAM10000 dataset, a separate publicly available collection of 10,015 dermoscopic images spanning seven diagnostic categories: basal cell carcinoma (514 images), dermatofibroma (115 images), actinic keratosis (327 images), benign keratosis (1,099 images), nevi (6,705 images), vascular skin lesions (142 images), and melanoma (1,113 images). The model was specifically tested on the melanoma subset and achieved 98.44% accuracy on this unseen data.
The near-consistent accuracy between the ISIC 2020 results (98.65%) and the HAM10000 results (98.44%), a difference of only 0.21 percentage points, suggests strong generalizability. The model was not retrained on HAM10000, so the 98.44% figure represents true out-of-distribution performance. This is particularly encouraging because the two datasets were collected independently, with different imaging equipment, patient populations, and annotation protocols.
Single-task binary classification: The framework classifies lesions only as malignant or benign. It does not differentiate among the various subtypes of skin cancer (basal cell carcinoma, squamous cell carcinoma, melanoma in situ, etc.) or the seven diagnostic categories present in the HAM10000 dataset. A real clinical deployment would likely require multi-class classification to guide treatment decisions, since different cancer types demand different therapeutic approaches.
Dataset and hardware constraints: Although the ISIC 2020 dataset is large (33,126 images, 48 GB), it still represents a limited demographic and geographic slice of the global patient population. The model was trained on a single GPU (NVIDIA RTX 3060 with 64 GB RAM), and the paper does not report inference time or throughput metrics, which would be critical for any real-time clinical application. The reliance on DICOM-format dermoscopic images also means the model may not generalize to images captured by consumer cameras or smartphone devices without additional adaptation.
Validation design: The study relies on a single train/validation/test split (60/10/30) rather than k-fold cross-validation, which would provide more robust performance estimates. The ablation study, while informative, only compares the effect of including vs. excluding segmentation. It does not isolate the individual contributions of Inception-ResNet-v2 vs. Vision Transformer, or compare against using a different backbone entirely. External validation was limited to a single additional dataset (HAM10000).
Future directions: The authors plan to develop a smartphone application integrated with their trained model to provide early and accurate melanoma diagnosis in real-world settings. This would be a significant step toward clinical deployment, but it would require addressing the gap between high-resolution dermoscopic imaging and lower-quality smartphone captures. Additional validation on multi-center, prospective datasets and comparison with dermatologist performance in head-to-head studies would strengthen the clinical case for adoption.