Melanoma is the most aggressive form of skin cancer, responsible for over 75% of skin cancer deaths despite representing only 1% of all skin cancer cases. Its incidence has been rising steadily, increasing 3-7% per year among fair-skinned populations. The global melanoma burden grew by 41% between 2012 and 2020, from 230,000 to 325,000 cases worldwide. In the United States alone, an estimated 99,780 new cases and 7,650 deaths occurred in 2022.
Why AI matters here: Early detection dramatically improves outcomes. The five-year survival rate for melanoma overall is 93.5%, driven largely by cases caught at early, localized stages. However, survival drops sharply for advanced disease: 73.9% for Stage III and just 35.1% for Stage IV. This review by Kalidindi (2024) examines how artificial intelligence, particularly convolutional neural networks (CNNs), is being used to improve the speed and accuracy of melanoma diagnosis across dermoscopy, histopathology, and smartphone-based screening.
Disparities in diagnosis: Men are 1.6 times more likely to develop melanoma than women. Lifetime risk varies starkly by race: 2.6% for white individuals, 0.6% for Hispanic individuals, and 0.1% for Black individuals. African American patients often present with more advanced disease and have poorer prognoses, underscoring the need for tools that work equitably across skin types.
Clinical examination: The standard bedside approach uses the ABCDE criteria (Asymmetry, Border irregularity, Color variability, Diameter of 6 mm or more, and Evolution over time). Naked-eye examination using these criteria achieves diagnostic accuracy of approximately 65%. Dermoscopy, which uses a handheld device with 10x magnification and polarized light, improves diagnostic accuracy by revealing subsurface structures invisible to the naked eye. Algorithms like the seven-point checklist, Menzies method, and CASH criteria (Color, Architecture, Symmetry, Homogeneity) have improved melanoma identification sensitivity and specificity by up to 18% and 10%, respectively.
Histopathology as gold standard: Tissue biopsy examined under a microscope remains the definitive diagnostic method, but it is inherently subjective. Interobserver variability is a well-documented problem, especially for smaller and thinner lesions. The WHO now classifies melanoma into nine subtypes (up from the original four: lentigo maligna, superficial spreading, acral lentiginous, and nodular), incorporating genomic and epidemiologic data.
Molecular diagnostics: Advanced techniques like comparative genomic hybridization (CGH) detect chromosomal copy number variations with over 95% sensitivity. Fluorescence in situ hybridization (FISH) provides high sensitivity and specificity for targeted genomic segments. Gene expression profiling (GEP) tests such as myPath use quantitative reverse transcription PCR to distinguish melanoma from benign lesions. The pigmented lesion assay, a noninvasive tape-strip test measuring RNA markers, has demonstrated over 99% negative predictive value and over 91% sensitivity.
Two main AI approaches: The review identifies two primary methodologies for AI-based skin lesion diagnosis. The first uses machine learning (ML) and deep learning (DL), particularly convolutional neural networks (CNNs), which learn directly from large image datasets. The second uses expert systems built on information ontologies with explicit rules encoding dermatologist knowledge. CNNs have become dominant since 2016, replacing classical ML techniques that were used in pigmented lesion classification since the 1990s.
CNN performance benchmarks: Nasr-Esfahani et al. trained a CNN on a dataset expanded from 170 to 6,120 images (using cropping, scaling, and rotation augmentation), achieving 81% sensitivity, 80% specificity, and 81% accuracy for melanoma detection. Other studies reported sensitivities up to 90% and accuracies between 82% and 94%. The landmark study by Esteva et al. used a pre-trained CNN on 129,450 clinical images (including 3,374 dermoscopy images) and achieved an AUC of 0.96 for both carcinomas and melanomas, matching the performance of 21 board-certified dermatologists.
Wide-field and clinical images: Soenksen et al. demonstrated 90% sensitivity and specificity for classifying suspicious lesions from wide-field photographs, with 83% agreement between CNN-generated and dermatologist saliency rankings. Notably, only 12 of 51 studies reviewed used gross clinical images rather than dermoscopic images, highlighting a gap in research since dermoscopy is often inaccessible to non-dermatologists.
The "black-box" challenge: Deep learning models process vast amounts of data and generate predictions without providing clear explanations of how inputs lead to outcomes. This lack of interpretability limits trust and applicability in healthcare, where clinicians need to understand and validate AI decisions.
AI as a clinical decision aid: Marchetti et al. demonstrated that AI support increased correct lesion classification rates from 73.4% to 75.4% for dermatologists and from 69.4% to 72.6% for residents. This is significant because even small percentage improvements translate to thousands of patients receiving correct diagnoses. However, Maron et al. showed that CNN models are sensitive to minor image changes (such as rotation or color shifts) that do not affect human examiners, raising concerns about robustness.
Performance by melanoma subtype: Winkler et al. tested CNNs on different melanoma subtypes and locations, finding highly variable results. The system achieved strong AUCs for superficial spreading melanoma (0.989), acral melanoma (0.928), and lentigo maligna melanoma (0.926). Performance dropped substantially for mucosal melanoma (AUC 0.754) and nail unit melanoma (AUC 0.621), revealing that rare subtypes and unusual locations remain challenging for AI.
Acral lesion detection: Lee et al. showed that CNN-generated diagnoses enhanced clinicians' accuracy in evaluating acral-pigmented lesions specifically, improving concordance among different physician groups and reducing performance disparities. This is clinically important because acral melanoma disproportionately affects darker-skinned patients and is often diagnosed late.
AI chatbots for dermoscopy: A recent survey found that AI chatbots performed well at generating differential diagnoses for basal cell carcinoma but showed lower accuracy for squamous cell carcinoma and inflammatory dermatoses. Participants were generally satisfied with the diagnostic and educational capabilities of these tools.
Early computer-aided diagnosis: Computer-assisted histopathology began in 1987 with TEGUMENT, a decision tree system that saw limited use due to oversimplification of dermatopathology knowledge. The field was revitalized by whole-slide image (WSI) scanners and modern CNNs, which can analyze entire tissue slides at scale.
CNN versus pathologists: Hekler et al. (2019) used CNNs to classify melanocytic lesions and achieved a 19% discordance rate with dermatopathologists, which was comparable to the pathologists' own interobserver variability. Brinker et al. achieved 92% accuracy with annotated slides and 88% accuracy with unannotated slides when comparing a CNN against 18 international expert pathologists.
The critical role of image curation: Hart et al. demonstrated the importance of training data quality in a study classifying Spitz nevi versus conventional melanocytic nevi. The CNN achieved 92% accuracy with carefully curated images but only 52% accuracy with non-curated images, because the model misclassified based on predominant but irrelevant features. This 40-percentage-point gap highlights that AI performance is only as good as the data it learns from.
Educational AI tools: The review describes intelligent tutoring systems like SlideTutor (which teaches algorithmic problem-solving in visual classification using virtual microscopy and WSIs) and ReportTutor (which trains dermatopathology trainees in preparing standardized diagnostic reports for melanoma). Both immediate and delayed feedback led to significant learning gains in these systems.
Mobile screening tools: Smartphones with advanced cameras are increasingly being used for skin self-examination and teledermatology. Devices like DermLite and MoleScope attach to phones to improve image quality, allowing patients to photograph lesions and send them for professional analysis.
App performance metrics: Early apps used ABCD feature extraction with support vector machine (SVM) classifiers. Iowa State University developed an app with a detachable 10x lens that achieved 88% accuracy using an SVM classifier with a radial basis function (RBF) kernel. A CNN-based app achieved 78.8% accuracy, 91.3% sensitivity, and 73% specificity on 8,000 images. The PAD-UFES-20 dataset app, using a modified ResNet50 CNN that incorporated clinical patient data alongside images, achieved 85% accuracy and 96% reproducibility.
Commercialized solutions: SkinScreener, now a certified medical device, demonstrated 96.4% sensitivity and 94.85% specificity for melanoma risk assessment. SkinVision, one of the most widely reviewed apps, uses ML algorithms trained on over 130,000 images and reports 95% sensitivity and 78% specificity for triaging skin lesions. These numbers suggest smartphone apps are approaching clinically useful thresholds for screening, though not yet for definitive diagnosis.
Training data bias: AI models learn from their training datasets, and biases in those datasets directly affect diagnostic accuracy. Early AI models underperformed on darker skin types due to a lack of diverse data representation. Confounders such as surgical pen markers on images can mislead models by creating spurious associations with malignancy. Heterogeneous datasets from diverse sources and clinical settings create inconsistencies that further degrade performance.
Regulatory and integration barriers: Most AI models have been validated using retrospective data without thorough prospective clinical trials. Integrating AI into clinical workflows raises medical-legal concerns about liability for misdiagnosis or delayed diagnosis. Regulatory bodies like the FDA have been cautious, and both clinicians and patients may distrust AI-driven recommendations. Building trust requires continuous validation across multiple clinical sites and transparent decision-making processes.
Emerging solutions: Vision-language models (VLMs) like Skin-GPT4 can interpret clinical photographs, generate descriptions and diagnoses, and serve as patient-facing chatbots or triage tools. These multimodal models integrate patient demographics, visual data, and genetic information for more comprehensive analysis. Federated learning (FL) addresses privacy concerns by allowing AI models to train on datasets across multiple institutions without transferring raw data, which also helps reduce performance disparities for underrepresented skin types. Foundation models (FMs) can be fine-tuned locally for institution-specific demographics.
What needs to happen next: The review calls for standardized image quality protocols, comprehensive public benchmark datasets, equitable data representation across skin types, prospective clinical validation trials, clear regulatory frameworks for AI liability, and improved model transparency through techniques like saliency maps and content-based image retrieval.