Performance of Commercial Dermatoscopic Systems That Incorporate Artificial Intelligence for the Identification of Melanoma in General Practice

Journal of the American Academy of Dermatology 2024 AI 9 Explanations View Original
Original Paper (PDF)

Unable to display PDF. Download it here or view on PMC.

Plain-English Explanations
Pages 1-2
What This Systematic Review Covers and Why It Matters

Malignant melanoma is a growing global health burden, with an annual incidence of 325,000 cases and 57,000 deaths worldwide. That rate is projected to climb by 50% by 2040, with mortality reaching 96,000 annual cases. Australia and New Zealand remain the two nations with the highest incidence. Major risk factors include high nevi count, fair skin, cumulative ultraviolet radiation exposure, CDKN2A gene mutations, and personal or family history of melanoma. Advanced-stage melanoma still carries a poor prognosis, making early detection critical.

Dermoscopy is widely accepted for improving the sensitivity of identifying malignant skin lesions compared to unaided visual inspection. However, emerging technologies like artificial intelligence may further improve detection rates and reduce unnecessary invasive procedures. AI has been proposed as a non-invasive tool that can help clinicians diagnose malignant lesions earlier and more accurately while limiting unnecessary biopsies of benign lesions. Patient attitudes toward AI in melanoma diagnosis have been generally favorable, especially when AI is used as an adjunct to a trained clinician.

While many studies have shown convolutional neural networks (CNNs) achieving "dermatologist-level performance," most of these studies test algorithms on pre-built image databases like HAM10000 and ISIC challenges, which may not replicate real-world clinical settings. This systematic review by Miller et al. (2024) specifically examined the performance of commercially available, market-approved dermatoscopic systems with AI when used on actual patients in clinical practice for classifying melanoma.

TL;DR: Melanoma incidence is rising globally and early detection is essential. While AI tools show promise, most studies test them on curated image databases rather than real patients. This review focused specifically on commercial, market-approved AI systems tested in actual clinical settings.
Pages 2-4
How the Researchers Searched for and Selected Studies

The review followed PRISMA guidelines (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) and was registered in PROSPERO (CRD42023484501). Five electronic databases were searched: CINAHL, Medline, Scopus, ScienceDirect, and Web of Science. The search covered studies published between 2018 and 2023, focusing on keywords related to melanoma, performance metrics, artificial intelligence, detection, and clinical settings.

Studies were eligible if they were peer-reviewed, published in English, and used market-approved AI in clinical settings. Crucially, studies that only tested algorithms on pre-built image databases or "challenges" were excluded. Studies also had to report melanoma performance separately from other cutaneous malignancies. The ground truth for each lesion was determined by histopathology, meaning every suspicious lesion had been biopsied and examined under a microscope to confirm or rule out melanoma.

The primary performance metrics collected were sensitivity (how well the system identifies true melanoma cases), specificity (how well it correctly identifies non-melanoma lesions), accuracy (overall correct classifications), and AUROC (area under the receiver operating characteristic curve, where values above 0.8 are clinically acceptable and 0.9 to 1.0 are highly accurate). Out of 2,772 initially identified studies, 16 met all inclusion criteria.

TL;DR: The researchers systematically searched five major databases and identified 16 studies that tested commercially available AI systems on real patients in clinical settings, with melanoma confirmed by biopsy.
Pages 4-6
The Landscape of Included Studies: Technologies, Locations, and Scale

Across the 16 included studies, there were a total of 1,160 confirmed melanomas and 33,010 benign lesions. The studies investigated several different technology categories: eleven examined bedside CNN performance, eight compared CNNs versus clinicians, three evaluated CNNs working alongside clinicians, three assessed mobile applications, three investigated 3D total body photography (TBP), and two reported on 2D TBP.

The studies came from diverse geographic locations. Germany led with five published studies, while Australia, Switzerland, the UK, and the USA each contributed two studies. Canada, Romania/Netherlands, and Spain each contributed one. The number of publications has steadily increased, rising from two per year in 2019 through 2021 to seven in 2023 alone, reflecting the growing clinical interest in this field.

Quality assessment was performed using the AXIS critical appraisal tool, with 15 of the 16 articles rated as "good" quality and one rated as "fair." The average quality score was 83.4% with strong interrater reliability (Cohen's Kappa k = 0.977, p < 0.001), indicating substantial agreement between the two independent raters.

TL;DR: Sixteen high-quality studies spanning seven countries were included, covering 1,160 melanomas and 33,010 benign lesions. The technologies tested ranged from mobile apps to bedside CNNs to total body photography systems.
Pages 6-8
Performance of Mobile Applications and Total Body Photography

Mobile applications: Three studies evaluated smartphone-based AI apps for melanoma classification. The apps tested included a CNN-based triage application and SkinVision. Performance varied considerably: sensitivity ranged from 80.0% to 92.8%, specificity from 60.0% to 95.0%, and accuracy from 62.3% to 92.0%. One study reported an AUROC of 0.717, which falls below the 0.8 threshold typically considered clinically acceptable. Study sizes ranged widely, from just 5 melanoma cases up to 138.

3D total body photography: Three studies examined 3D TBP, all using the Canfield Vectra WB360 system. Two studies with small participant numbers (6 to 10 melanomas) showed high sensitivity (83.3% to 90.0%) but lower specificity (63.6% to 64.6%), with overall accuracy between 65.6% and 68.0%. One larger study (43 melanomas, 22,489 other lesions) reported an AUROC of 0.9399, which is highly accurate. The third study achieved an AUROC of 0.92.

2D total body photography: Two studies using the FotoFinder 2D TBP system reported the weakest results of any technology category. Sensitivity ranged from 70.0% to 83.3%, but specificity was uniformly low at 40.0% for both studies, yielding accuracy of only 44.0% to 44.3%. One study reported an AUROC of just 0.68. These results indicate that 2D TBP in its current form struggles to differentiate melanoma from benign lesions, producing a high rate of false positives.

TL;DR: Mobile AI apps showed mixed results (sensitivity 80-93%, but variable specificity). 3D total body photography performed reasonably well with AUROC up to 0.94, while 2D total body photography performed poorly with only 40% specificity and accuracy around 44%.
Pages 8-10
Performance of Standalone Bedside Convolutional Neural Networks

Eleven studies reported performance metrics for bedside CNN devices used in clinical settings. The commercial systems tested included SkinAnalytics DERM, FotoFinder MoleAnalyzer Pro, MetaOptima, and quantusSKIN. Performance was highly heterogeneous. Sensitivity ranged from a low of 16.4% all the way up to 100.0%. Specificity ranged from 54.4% to 98.3%, and accuracy ranged from 54.2% to 87.7%. Six studies reported AUROC values between 0.54 and 0.969.

The SkinAnalytics DERM system was tested across multiple camera types (iPhone, Galaxy S5, and DSLR). Notably, the newer version of DERM (version B) achieved 100% sensitivity in two validation cohorts, meaning it did not miss any melanomas. However, specificity for this version was around 80%, meaning roughly one in five benign lesions was flagged as suspicious. The older version showed lower sensitivity (71-79%) with AUROC values between 0.823 and 0.879.

FotoFinder MoleAnalyzer Pro was the most frequently tested CNN across multiple studies. Results were strikingly inconsistent: one Australian study reported sensitivity of only 53.3% and accuracy of 54.2%, while a German study with the same device achieved 95.7% sensitivity and accuracy of 86.2%. This extreme variability likely reflects differences in patient populations, lesion characteristics, and the types of melanoma encountered in different geographic settings.

Two other CNN systems showed contrasting performance profiles. MetaOptima tested with two different classification models showed sensitivity of 50.9% for the 7-class model and a very low 16.4% for the ISIC model, but the latter achieved 98.3% specificity. The quantusSKIN system achieved a more balanced 69.1% sensitivity and 80.2% specificity with an AUROC of 0.802.

TL;DR: Standalone bedside CNNs showed wildly variable results, with sensitivity ranging from 16% to 100% depending on the device, software version, and clinical setting. No single commercial system demonstrated consistently reliable performance across all studies.
Pages 9-10
Head-to-Head: How CNNs Compare to Clinicians Working Alone

Eight studies directly compared clinician performance against CNN performance. The clinicians ranged from novice family practitioners to experienced dermatologists, and the results were highly variable for both groups. Clinician sensitivity ranged from 41.8% (novice practitioners) to 96.6% (dermatologists), while CNN sensitivity ranged from 16.4% to 100.0%. Clinician specificity ranged from 32.2% to 92.7%, compared to 54.4% to 98.3% for CNNs.

Experience level played a significant role in clinician performance. Beginner dermatologists (less than 2 years of experience) had lower sensitivity and specificity compared to skilled (2 to 5 years) and expert (more than 5 years) practitioners. For example, in one study, beginners achieved 80.0% sensitivity compared to 100.0% for experts. Interestingly, dermatologists achieved the highest reported clinician AUROC at 0.91, while the best CNN AUROC was 0.969.

A notable geographic pattern emerged. The two Australian studies reported comparatively lower CNN sensitivity values than studies from other countries. The authors hypothesize this may be because Australian general practitioners, working in the nation with the highest melanoma incidence globally, routinely encounter early-stage, subtle melanomas with limited visual clues. These feature-poor lesions may be more challenging for current AI architectures that were not specifically trained on such cases.

TL;DR: Neither clinicians nor AI systems were consistently better at detecting melanoma. Performance depended heavily on clinician experience and the specific CNN used. Geographic differences in melanoma presentation may explain why AI performed worse in Australia, where subtle early-stage lesions are more common.
Pages 10-11
The Most Promising Finding: Clinicians and AI Working Together

Three studies examined what happened when clinicians used AI as a support tool rather than competing against it. This "clinician plus AI" approach produced the most consistent and promising results of any category in the review. Sensitivity ranged from 83.3% to 100.0%, specificity from 83.7% to 87.3%, and accuracy between 86.4% and 86.9%. Two studies reported AUROC values of 0.88 to 0.968.

Compared to every other configuration tested, the clinician-AI partnership showed notably less variability. Standalone CNNs had sensitivity gaps as wide as 84 percentage points (16.4% to 100.0%), and clinicians alone varied by 55 percentage points (41.8% to 96.6%). By contrast, the combined approach had a sensitivity range of only about 17 percentage points. The specificity range was similarly narrow at just 3.6 percentage points (83.7% to 87.3%), suggesting that working together substantially reduced both missed melanomas and unnecessary biopsies.

In one study, Winkler et al. (2023) found that dermatologists supported by the FotoFinder MoleAnalyzer Pro achieved 100.0% sensitivity and 83.7% specificity with an AUROC of 0.968. This compares favorably to the same CNN's standalone performance of 81.6% sensitivity. The collaboration also benefited less-experienced clinicians, with one study showing that even beginners improved their diagnostic performance when aided by AI output.

TL;DR: Clinicians working together with AI produced the most consistent results of any approach tested, with sensitivity of 83-100%, specificity of 84-87%, and dramatically less performance variability than either clinicians or AI working alone.
Pages 11-13
Sensitivity vs. Specificity: The Clinical Trade-Off in Melanoma Detection

The review highlights an important clinical tension between sensitivity and specificity when it comes to melanoma detection. Low sensitivity means a greater likelihood of missing patients with melanoma, which can be devastating given that advanced stages of the disease carry significantly worse prognosis. Low specificity, on the other hand, leads to overtreatment, with patients undergoing unnecessary excisional biopsies for benign lesions. Neither metric should be considered in isolation.

The European Academy of Dermatology and Venerology (EADV) has deliberately avoided stating a minimally acceptable accuracy threshold for mobile applications and web-based skin cancer services, citing risk-benefit considerations. The review authors agree with this position, noting that setting an arbitrary minimum performance figure could potentially cause more harm than good. At present, AI does not replace the gold standard of diagnosis, which remains biopsy of suspect lesions followed by histopathological examination.

The discussion around 2D total body photography is particularly relevant. Low specificity values (40% in both studies) lead to overdiagnosis, which connects to a broader debate about melanoma rates rising dramatically without a corresponding increase in mortality. When AI systems flag too many benign lesions as suspicious, the downstream consequences include patient anxiety, unnecessary procedures, and increased healthcare costs.

TL;DR: In melanoma detection, missing a cancer (low sensitivity) is dangerous, but flagging too many benign lesions (low specificity) causes unnecessary biopsies and anxiety. The EADV has deliberately avoided setting minimum accuracy thresholds, and biopsy with histopathology remains the gold standard.
Pages 13-15
Limitations, Transparency Challenges, and Where the Field Goes Next

The review acknowledges several important limitations. The 16 included studies were highly varied in methodology, technology used, geographic location, and subject populations. Additionally, 15 other studies were excluded because they classified lesions broadly as "malignant versus benign" without reporting melanoma-specific performance. This means the review captured only a portion of the available evidence on commercial AI dermatoscopic systems. There also remains a persistent lack of testing on individuals with darker skin, particularly Fitzpatrick skin types IV through VI.

A significant transparency problem exists among commercial AI vendors. Of the systems studied, FotoFinder and quantusSKIN have reported using CNN variants based on GoogLeNet Inception algorithms. Canfield shared that they use two AI applications, one based on EfficientNetV2 and a custom CNN. However, other companies including MetaOptima, SkinAnalytics, SkinVision, and Triage Technologies did not respond to requests for architectural details. While the authors acknowledge companies' rights to protect proprietary technology, this lack of disclosure limits the ability to compare and evaluate different CNN architectures.

Looking ahead, the authors suggest that market-approved CNNs would benefit from training on images of early evolving melanoma, which may be particularly relevant to high-incidence countries like Australia and New Zealand. Current AI systems receive input only in a 2D plane, while clinicians have access to patient history, family history, whether a lesion is raised, and dermatoscopic color interpretation to assess melanin depth. Future AI systems that integrate clinical metadata alongside images may close this gap. Sequential digital dermoscopy, which monitors lesion changes over time, represents another promising avenue.

The review concludes that rather than framing the question as "AI versus clinicians," the field should embrace a collaborative model where AI supports clinical decision-making. The most consistent performance metrics emerged when clinicians and AI worked in partnership, suggesting this is the most productive path forward for melanoma detection in general practice.

TL;DR: Key limitations include varied study designs, lack of diversity in skin types tested, and poor transparency from AI vendors about their architectures. The future lies in AI systems that integrate clinical metadata, train on early-stage melanomas, and serve as decision-support tools alongside clinicians rather than replacements.