Artificial Intelligence in the Non-Invasive Detection of Melanoma

Diagnostics 2024 AI 9 Explanations View Original
Original Paper (PDF)

Unable to display PDF. Download it here or view on PMC.

Plain-English Explanations
Pages 1-2
What This Review Covers and Why Non-Invasive Detection Matters

Skin cancer is the most commonly diagnosed cancer among fair-skinned populations worldwide, with incidence rising steadily. Melanoma, though representing only about 1% of all skin cancer cases, is the deadliest form, with roughly 8,300 Americans expected to die from it annually. The uncontrolled proliferation of melanocytes drives melanoma, and the American Cancer Society reports that its death rates far exceed those of other skin cancer subtypes.

The biopsy problem: Skin biopsies with histopathological evaluation remain the gold standard for diagnosing melanoma. However, confirming every suspicious lesion with a biopsy is impractical for several reasons: scar formation from excisions, time constraints in clinical practice, and financial burdens that disproportionately affect lower-income patients. This creates a significant gap between what is medically ideal and what is clinically feasible.

Non-invasive imaging technologies such as dermoscopy (epiluminescence microscopy using magnification and polarized light to reveal subsurface skin features) and confocal microscopy (which captures images at cellular resolution comparable to histology) are widely used to triage suspicious lesions. These tools reduce unnecessary biopsies and increase sensitivity, but their success depends heavily on provider skill level. Notably, no standardized training for these devices exists in dermatology residency programs, resulting in high inter-user variability.

This review by Ismail Mendi et al. (2024) examines how artificial intelligence algorithms are being applied to non-invasive melanoma detection, focusing on AI in clinical imaging, dermoscopic evaluation, algorithms distinguishing melanoma from non-melanoma cancers, and in vivo skin imaging devices such as reflectance confocal microscopy (RCM) and optical coherence tomography (OCT).

TL;DR: Melanoma is rare among skin cancers but causes the most deaths. Biopsies are the gold standard but impractical for mass screening. AI applied to non-invasive imaging (dermoscopy, confocal microscopy, OCT) could bridge this gap by making accurate, accessible diagnosis possible without tissue removal.
Pages 2-4
How AI Works in Dermatology: From Machine Learning to Vision Transformers

Machine learning (ML) is a subfield of AI that makes predictions from input data, and it presents an excellent opportunity for automating medical image analysis. Supervised models are currently the most prevalent form of ML in dermatology, where each training sample is paired with a diagnostic label. The three primary tasks are classification (assigning a label like "melanoma" to an image), detection (identifying whether a structure like atypical networks is present), and segmentation (delineating the exact borders of a lesion within an image).

Convolutional neural networks (CNNs) dominate the field. In their basic form, CNNs consist of multiple cascading nonlinear layers that filter input data by removing redundant information, finding correlations, and summarizing critical features. These features are then mapped to diagnostic labels through classification layers. Unlike traditional ML approaches that rely on hand-crafted features (texture, color, border information), deep learning (DL) models extract higher-order representations that capture complex patterns not easily visible to humans.

More recently, CNNs have been supplemented by Vision Transformers, which leverage a mechanism called self-attention to capture long-range dependencies across an entire image. Vision Transformers divide images into smaller patches (analogous to words in text) and encode relationships between all patches simultaneously. This architecture has achieved state-of-the-art performance on multiple computer vision benchmarks and is increasingly being adapted for dermatological tasks.

Evaluation metrics are crucial for assessing AI performance. The Area Under the Receiver Operating Characteristic curve (AUROC or AUC) is the dominant metric, where 1.0 indicates perfect discrimination and 0.5 means random guessing. Sensitivity measures the ability to detect true positives (catching melanomas), while specificity measures accuracy in ruling out negatives (avoiding false alarms). For segmentation tasks, the DICE coefficient and Jaccard index quantify overlap between predicted and actual lesion boundaries.

TL;DR: AI in dermatology relies primarily on CNNs that learn image features automatically, supplemented by newer Vision Transformers. Models are evaluated using AUC (discrimination ability), sensitivity (catching melanomas), and specificity (avoiding false positives), with DICE scores used for segmentation accuracy.
Pages 4-8
The Datasets Powering Melanoma AI: ISIC, HAM10000, and Beyond

Why datasets matter: Small datasets restrict the learning and generalizability of AI algorithms. The availability of large, demographically expansive, and standardized datasets is essential for building models that work reliably across diverse patient populations. This review catalogs 15 major datasets that researchers use to train and evaluate melanoma detection algorithms.

The ISIC Archive is the most widely used resource, developed by the International Skin Imaging Collaboration. It has grown from 900 images in its first challenge (ISIC 2016, two classes) to over 400,000 images in the most recent SLICE-3D dataset, which contains standardized lesion crops from 3D Total Body Photography collected across seven international clinics. The ISIC 2024 Kaggle challenge attracted roughly 3,500 participants worldwide with about 80,000 submissions. The HAM10000 dataset provides 10,015 dermoscopic images across seven diagnostic categories, sourced from clinics in Australia and Austria. The PH2 dataset from Portugal includes 200 images with detailed medical annotations covering dermoscopic criteria such as asymmetry, pigment network, dots/globules, and blue-whitish veil.

Critically, most datasets are dominated by images from fair-skinned individuals. The Fitzpatrick 17k dataset attempted to address this by labeling 16,577 images across all six Fitzpatrick skin types, though images of the darkest skin types (5 and 6) represent only 2,168 of 16,577 total images. The SCIN dataset by Google and Stanford collected over 10,000 images directly from Internet users through advertisements, including metadata on ethnicity and Monk skin tone. The Diverse Dermatology Images dataset from Stanford was specifically structured to compare dark skin tones (Fitzpatrick 5-6) against light skin tones (Fitzpatrick 1-2) with pathologically validated diagnoses.

Other notable datasets include BCN20000 (19,424 dermoscopic images from Barcelona covering challenging sites like nails and mucosa), PAD-UFES-20 (smartphone-captured clinical images with extensive patient metadata from Brazil), and the Asan dataset (17,125 clinical images with over 99% of subjects being of Asian descent). Each dataset brings different strengths and limitations in terms of imaging modality, population diversity, and diagnostic confirmation method.

TL;DR: AI for melanoma depends on large, well-labeled datasets. The ISIC Archive now contains over 400,000 images, and HAM10000 provides 10,015 dermoscopic images. However, most datasets overrepresent fair-skinned patients, and newer efforts like Fitzpatrick 17k and SCIN aim to improve skin tone diversity.
Pages 8-11
AI Applied to Clinical (Non-Dermoscopic) Images of Melanoma

Visual assessment with the ABCDE criteria (Asymmetry, Border irregularity, Color variation, Diameter over 6 mm, and Evolving changes) remains the standard bedside screening method. AI has been applied to enhance the accuracy of these assessments using standard clinical photographs rather than specialized dermoscopic images. Nasr-Esfahani et al. trained a simple two-layer CNN on just 170 clinical images, augmented to 6,120 through cropping, scaling, and rotation, achieving 81% accuracy, 80% specificity, and 81% sensitivity.

More sophisticated architectures have improved on these results. Dorj et al. used a pre-trained AlexNet with an ECOC-SVM classifier on 3,753 images of four skin cancer types, achieving 94.2% accuracy for melanoma with 97.8% sensitivity. Soenksen et al. applied VGG16 transfer learning to a dataset of 33,980 images and achieved an AUC of 0.935 for suspicious pigmented lesions, with 96.3% agreement with the consensus of 10 dermatologists when using a saliency-based "ugly duckling" detection approach. Han et al. used ResNet-152 to classify 12 skin diseases across the Asan and Edinburgh datasets, achieving AUCs of 0.96 and 0.88 respectively, with the performance drop on Edinburgh highlighting the impact of demographic differences on algorithm effectiveness.

Liu et al. built a deep learning system (DLS) with Inception-v4 modules that processed both images and patient metadata to identify 26 common skin conditions. For top-three diagnoses, the DLS achieved 90% accuracy compared to 75% for dermatologists. Sangers et al. conducted a prospective multicenter study of a smartphone app using CNN (RD-174) on 785 lesions from dermatology outpatient clinics, reporting overall sensitivity of 86.9% and specificity of 70.4%, though the study was limited by having over 80% of participants with Fitzpatrick skin types 1 or 2.

Importantly, only 12 of the roughly 50 studies reviewed used standard clinical images rather than dermoscopic images, a significant gap given that dermoscopy is often unavailable to non-dermatologists and primary care providers who perform most initial skin screenings.

TL;DR: AI applied to standard clinical photographs achieves 81-94% accuracy depending on the architecture, with top models like VGG16 reaching AUCs of 0.935. However, most AI research still focuses on dermoscopic images, leaving a gap for primary care settings where dermoscopy is unavailable.
Pages 12-20
AI-Assisted Dermoscopy: Distinguishing Melanoma from Benign Lesions and Other Cancers

Melanoma vs. benign lesions: The review analyzed 37 studies using dermoscopic images, of which 26 evaluated AI for melanoma detection specifically. Masood et al. compared three artificial neural network (ANN) algorithms and found that the Scaled Conjugate Gradient (SCG) method achieved 92.6% sensitivity and 91.4% specificity. Chanki Yu et al. showed that a pre-trained VGG-16 CNN for acral melanoma achieved performance comparable to expert dermatologists and significantly outperformed non-experts. Fink et al. demonstrated that Moleanalyzer Pro (based on GoogLeNet Inception v4) achieved 97.1% sensitivity and 78.8% specificity in distinguishing combined nevi from melanoma, outperforming all 11 dermatologists tested.

AI as clinical decision support: Giulini et al. evaluated 64 physicians assessing 100 dermoscopic photographs with and without CNN assistance. With CNN support, mean sensitivity rose from 56.3% to 67.9% and specificity from 69.3% to 73.7%. Hybrid models combining image analysis with patient metadata showed particular promise: Ningrum et al. found that a CNN+ANN model integrating dermoscopic images and clinical data achieved 92.3% accuracy, far exceeding the 73.7% accuracy of the CNN analyzing images alone.

Melanoma vs. other skin cancers: Esteva et al. applied transfer learning with Google Inception v3 on 127,463 clinical images and achieved an AUC of 0.94 for melanoma from clinical photographs and 0.91 from dermoscopic images, performing on par with 21 board-certified dermatologists. Rezvantalab et al. compared DenseNet 201, ResNet 152, Inception v3, and InceptionResNet v2, finding that all outperformed dermatologists in detecting melanoma and BCC, with ResNet 152 achieving the top melanoma AUC of 94.4%. Tschandl et al. showed that the top three algorithms from the ISIC 2018 challenge achieved 81.9% sensitivity versus 67.8% for human readers in seven-way classification.

Explainability and trust: Chanda et al. developed an explainable AI (XAI) algorithm that explains the basis of its melanoma prediction, increasing clinicians' diagnostic confidence and trust. Correira et al. introduced an interpretable prototypical-part model that uses binary segmentation masks to ensure that learned features relate specifically to important lesion areas rather than irrelevant background artifacts. Barata et al. developed a reinforcement learning (RL) model with a dermatologist-created reward table that achieved 79.5% sensitivity for melanoma, compared to 61.4% for a standard supervised learning model, and improved dermatologists' correct diagnosis scores by 12% when used as an aid.

TL;DR: AI-assisted dermoscopy improves melanoma detection across experience levels. Top CNN models achieve 94-97% sensitivity, and combining image analysis with patient metadata boosts accuracy to over 92%. Newer explainable AI and reinforcement learning approaches are improving both performance and clinical trust.
Pages 20-22
AI for Reflectance Confocal Microscopy: Cellular-Level Non-Invasive Diagnosis

Reflectance confocal microscopy (RCM) allows in vivo imaging of skin lesions at "quasi-histologic" resolution without requiring a biopsy. RCM captures grayscale images in an en face orientation (horizontal slices rather than the vertical cross-sections used in standard pathology) and depends on the natural reflectance contrast of skin tissue. It has been shown to enhance sensitivity and specificity for melanoma diagnosis and reduce unnecessary biopsies. However, RCM produces grayscale images, is susceptible to technical artifacts (air bubbles, motion, nodular convexity), and requires significant reader expertise.

Artifact detection: Kose et al. demonstrated that an automated semantic segmentation method called Multiscale Encoder-Decoder Network (MED-Net) could detect artifacts in RCM images of melanocytic lesions with 83% sensitivity and 92% specificity, helping to pre-filter images before diagnostic analysis. Pattern recognition with MED-Net achieved a pixel-wise mean sensitivity of 70% and specificity of 95% for detecting various patterns of melanocytic lesions at the dermal-epidermal junction (DEJ), with a DICE coefficient of 0.71.

Gerger et al. developed an automated system using Classification and Regression Trees (CARTs) that correctly classified 97.3% of RCM images in the learning set but only 81% in the test set, highlighting generalizability challenges. Wodzinski et al. used a ResNet-based CNN that achieved 87% accuracy in identifying melanoma, BCC, and nevi from in vivo RCM images, slightly surpassing human expert accuracy. D'Alonzo et al. implemented a weakly supervised model based on EfficientNet to analyze RCM mosaics, achieving an AUC of 0.969 and DICE coefficient of 0.778 for distinguishing benign from melanoma-suggestive regions, enabling spatial localization that enhances interpretability for clinicians.

TL;DR: AI applied to RCM imaging achieves up to 97% accuracy in training sets and 87% in real-world tests, with MED-Net and EfficientNet models excelling at pattern segmentation (AUC 0.969). These tools help overcome RCM's key limitations: artifact susceptibility and the steep learning curve required for accurate image interpretation.
Pages 22-23
Optical Coherence Tomography and Emerging Non-Invasive Imaging Technologies

Optical coherence tomography (OCT) is a non-invasive imaging method that captures echo delays and intensity of reflected infrared light, enabling real-time visualization of skin to depths of 1-2 mm with 3-15 micrometer resolution. Several advanced variants have been developed, including full-field OCT (FF-OCT), vibrational OCT (VOCT), and combination devices pairing OCT with near-infrared Raman spectroscopy.

Chou et al. used a multi-directional CNN to successfully predict the dermal-epidermal junction (DEJ) in FF-OCT images, a boundary critical for melanoma staging and Breslow depth assessment. Silver et al. demonstrated that a logistic regression model achieved 83.3% sensitivity and 77.8% specificity in distinguishing melanoma from normal skin using VOCT images. Lee et al. trained an SVM on OCT images and successfully identified pigmented non-malignant lesions.

You et al. developed an integrated OCT-Raman spectroscopy device and tested multiple ML models for distinguishing between BCC, SCC, melanoma, and normal cells. Using the decision tree algorithm on OCT features alone, they achieved 85.9% accuracy. Remarkably, discrimination between melanoma and keratinocytic tumors using Raman spectra reached 98.9% accuracy with the KNN algorithm and 91.6% with decision trees, suggesting that combining OCT structural imaging with Raman molecular fingerprinting could provide highly accurate non-invasive classification.

TL;DR: OCT and its variants provide real-time, non-invasive skin imaging at depths up to 2 mm. When combined with Raman spectroscopy and ML, these devices achieve up to 98.9% accuracy distinguishing melanoma from other cancers, pointing toward a future of highly precise non-invasive diagnosis.
Pages 23-27
Dataset Limitations, Bias, and the Generalizability Problem

Skin type diversity: Most publicly available datasets predominantly consist of images from white or fair-skinned individuals, or they lack skin type labels entirely. Only 2.1% of images across evaluated datasets included Fitzpatrick skin type metadata. Algorithms trained primarily on fair skin images show significantly reduced performance when applied to lesions from individuals with darker skin tones. An algorithm trained mainly on East Asian skin images also performed poorly on White American patients, confirming that dataset bias is not limited to any single direction.

Metadata gaps and rare subtypes: Most datasets lack comprehensive metadata such as patient age, gender, ethnicity, lesion location, genetic factors, and environmental exposure history. When clinical information is integrated (as demonstrated by Haenssle et al.), both sensitivity and specificity improve meaningfully. Additionally, rare but aggressive melanoma subtypes like amelanotic melanoma, subungual melanoma, and acral lentiginous melanoma are underrepresented in most datasets, and studies consistently show weaker algorithm performance on these variants.

Generalizability challenges: A comprehensive meta-analysis found that automated systems generally perform worse on independent test sets than on non-independent ones. Models trained at tertiary cancer centers perform best in similar settings and may underperform in primary care. Overfitting to small or homogeneous datasets, class imbalance (benign nevi vastly outnumber melanoma cases), and differences in image acquisition hardware all contribute to the gap between reported performance and real-world reliability. Most reviewed studies did not specify essential patient demographics, raising serious concerns about how their findings would translate across populations.

Image quality and artifacts: Differing camera hardware, zoom levels, lighting, and artifact presence across datasets significantly impact model performance. Surgical skin markings on images have been shown to reduce specificity and AUC. Dark corner artifacts in dermoscopic images lead to significant specificity drops when large. These findings underscore that AI models are learning patterns from images rather than truly understanding skin biology, making them vulnerable to spurious visual correlations.

TL;DR: Major limitations include biased training data (most datasets overrepresent fair skin and lack skin type labels), underrepresentation of rare melanoma subtypes, poor generalizability across populations and clinical settings, and sensitivity to image artifacts like surgical markings and dark corners.
Pages 26-33
Toward Reliable, Equitable, and Clinically Integrated AI for Melanoma

Explainability and reliability: Understanding how black-box models like CNNs and Transformers make decisions remains inherently challenging for clinicians. Kim et al. developed a vision-language model called MONET that correlates dermatological concepts from literature captions with image content in a self-supervised fashion, providing AI transparency throughout the development pipeline. Yan et al. proposed Explanatory Interactive Learning, integrating human users into the ML training process to identify and remove confounding behaviors by transforming feature representations into explainable concept scores.

Standardization and benchmarking: The International Skin Imaging Collaboration (ISIC) has aggregated over 1.1 million images from leading cancer research institutions across five continents and leads standardization efforts within Digital Imaging and Communications in Medicine (DICOM) for dermatology imaging. Through regular ML challenges, ISIC fosters collaboration between AI and medical communities, enabling researchers to benchmark approaches against standardized datasets. This collaborative infrastructure is essential for developing models that perform reliably across diverse populations.

What needs to happen: The review calls for prospective clinical trials that evaluate AI models in real-world settings across diverse populations. Dataset development must prioritize reflecting patient demographics at different levels of care, not just specialized cancer centers. Class balancing techniques (resampling minority classes, downsampling majority classes, class re-weighting) are needed to address the overwhelming imbalance between benign and malignant lesions. Image quality standardization, comprehensive metadata inclusion, and multi-modal dataset matching (combining dermoscopic, clinical, RCM, and total body photography images of the same lesions) would create training environments closer to real clinical workflows.

Despite ongoing challenges, AI in non-invasive melanoma detection holds clear promise. The technology can reduce clinician workload by automating triage of benign lesions in primary care, decrease unnecessary biopsies, and help identify suspicious lesions earlier. However, achieving clinical integration requires continued validation, regulatory clarity, improved diversity in training data, and transparent AI decision-making. Dermatological expertise and clinical correlation remain indispensable, particularly for complex cases requiring contextual judgment.

TL;DR: The path forward requires explainable AI models (like MONET), standardized benchmarks through ISIC, prospective clinical trials across diverse populations, multi-modal datasets, and clear regulatory frameworks. AI will augment, not replace, dermatological expertise in non-invasive melanoma detection.