Interpretable ML Imaging Biomarkers for Melanoma

Plain-English Explanations

Overview & Background

Pages 1-3

Why Melanoma Screening Needs Transparent AI, Not Black Boxes

Melanoma is the deadliest form of skin cancer, responsible for over 96,000 new cases annually in the United States and roughly 60,000 deaths worldwide each year. Early detection is critical because melanomas in the vertical growth phase invade at approximately 0.13 mm per month, and a delay of even a few months can push tumor depth past the 0.3 mm basal layer basement membrane, significantly worsening prognosis. Despite these stakes, most patients first present to primary care physicians who lack the dermoscopy expertise needed to reliably distinguish benign nevi from melanoma. The authors note that the ratio of US population to top dermoscopists is approximately 6,480,000 to 1, making universal expert screening infeasible.

The interpretability problem: Deep learning models, particularly convolutional neural networks (CNNs), have shown strong performance in melanoma image classification, but their "black box" nature remains a barrier to clinical adoption. Physicians are reluctant to trust a diagnostic risk score when they cannot understand how the algorithm reached its conclusion. This paper proposes an alternative: an ensemble classifier called Eclass that uses 38 hand-engineered imaging biomarker cues (IBCs) derived from dermoscopy images, rather than raw pixel analysis. Because the IBCs correspond to visual features dermatologists already understand (color presence, structural asymmetry, border sharpness), the diagnostic process is transparent and interpretable.

Study scope: The research was conducted across two international cohorts. The first cohort used alcohol-coupled, nonpolarized dermoscopy images from New York (acquired with an EpiFlash dermatoscope and Nikon D80 camera at 1-5 megapixels). The second cohort used polarized digital dermoscopy images from the Hospital Clinic de Barcelona (acquired with the Dermlite Foto at approximately 5.9 x 2.7 megapixels). After excluding images with hair, surgical ink, incomplete borders, ulceration, or nodular presentations, the final dataset comprised 349 lesions (one per patient), drawn from an initial pool of 668 images containing 113 melanomas and 236 nevi.

The study also developed a companion iOS/Mac application ("Eclass Imaging Biomarkers," freely available on the Apple App Store) that visualizes the diagnostically significant biomarkers overlaid on the dermoscopy image. Biomarkers flagged red indicate values in the statistically malignant range, while green indicates benign. This interface was designed to support sensory cue integration during clinical screening, providing physicians with an augmented-reality-like diagnostic aid rather than just a numerical score.

TL;DR: Melanoma kills over 60,000 people per year globally, and the US has only about 50 top dermoscopists for 324 million people. This paper proposes Eclass, a transparent ensemble ML classifier using 38 interpretable imaging biomarkers from dermoscopy, as an alternative to opaque deep learning. The study used 349 lesions (113 melanomas, 236 nevi) from New York and Barcelona cohorts.

Methodology

Pages 4-6

Engineering 38 Imaging Biomarker Cues from Dermoscopy Features

The core innovation of Eclass is its use of imaging biomarker cues (IBCs), which are quantitative features extracted from dermoscopy images that correspond to clinically recognized visual patterns. Unlike a CNN that ingests raw pixels and learns its own abstract feature representations, the IBCs were hand-engineered to capture specific dermoscopic phenomena. The full set comprised 130 candidate biomarkers: 7 multicolor IBCs (computed from all RGB channels simultaneously) and 41 single-channel IBCs each evaluated in red, green, and blue channels independently.

Multicolor IBCs (MC1-MC4): MC1 counts the number of dermoscopic colors present in a lesion out of six recognized categories (light brown, dark brown, black, red, blue-gray, and white). Melanomas typically show more colors than benign nevi. MC2 measures the normalized size difference between the red and blue channel lesion masks, capturing spectral irregularity. MC3 computes the mean coefficient of variation of lesion radii across color channels, quantifying shape inconsistency between spectral views. MC4 is a binary indicator for the presence of blue-gray or white coloring, which corresponds to the "blue-white veil" seen in dermoscopy and is a statistically significant melanoma discriminant via the Tyndall effect.

Single-channel IBCs (angular sweep analysis): The majority of IBCs were derived from an angular sweep methodology. A radial arm was projected from the geometric center of the lesion to the border, then rotated 360 degrees clockwise. Along each radial position, the algorithm measured pixel brightness statistics (mean, standard deviation, derivative). For example, B1 captured the average absolute brightness shift between adjacent angular positions. B2 measured the variance of radial brightness standard deviations across the angular sweep. Other IBCs quantified edge demarcation (how sharply the lesion border transitions from dark lesion pixels to bright normal skin), asymmetry, and pigmented network branch length variation.

Of the 130 candidate IBCs, 38 achieved statistical significance (p < 0.05) in discriminating melanoma from nevi via two-sided unpaired t-tests, Wilcoxon-Mann-Whitney tests, and chi-square tests as appropriate for continuous, ordinal, and categorical variables. The final set included 4 multicolor, 9 red-channel, 7 green-channel, and 18 blue-channel IBCs. Notably, more biomarkers reached significance in the blue channel, likely because shorter wavelengths image superficial epidermal structures (basal layer atypia and junctional melanocyte nests) more effectively than longer wavelengths.

TL;DR: From 130 candidate features, 38 imaging biomarker cues achieved p < 0.05 significance for melanoma discrimination. These included multicolor features (dermoscopic color count, blue-white veil presence) and single-channel features derived from 360-degree angular sweep analysis of brightness, edge sharpness, and structural asymmetry. Blue-channel IBCs dominated (18 of 38), reflecting the diagnostic value of superficial epidermal imaging.

Methodology

Pages 5-8

The Eclass Ensemble: 12 Classifiers Combined via Monte Carlo Simulation

Eclass employed a "wisdom of the crowd" strategy by combining 12 fundamentally different classification algorithms. These were chosen to span the broad universe of base classifier architectures: feed-forward neural networks with a single hidden layer (NNET), support vector machines with both linear and radial kernels (SVM), logistic regression via generalized linear models (GLM), elastic-net penalized logistic regression (GLMnet), gradient-boosted logistic regression (GLMboost), random forests (RF), CART decision trees (RP), K-nearest neighbors (KNN), multiple adaptive regression splines (MARS), C5.0 decision trees (C50), partial least squares (PLS), and linear discriminant analysis (LDA).

Training protocol: The ensemble was trained and validated within a Monte Carlo simulation framework. In each of 1,000 iterations, the 349 lesions were randomly partitioned into a 75% training set and a 25% hold-out test set. For each classifier, model parameters were optimized by maximizing the partial area under the ROC curve (limiting specificity to the 0-40% range), with tuning parameters estimated via 10-fold cross-validation. The final Eclass score for each lesion was the median melanoma probability across all available classifiers: Eclass Score = median{Prob_i(Melanoma | M)}, where i ranges across the k classifiers and M is the set of IBCs.

Computational efficiency: Eclass trained all 12 algorithms across 1,000 Monte Carlo iterations in approximately 150 seconds. By contrast, the CNN comparison model (based on ResNet-50 with ImageNet-pretrained weights and binary classification output layers) trained for only 10 cross-validation runs in 52 hours on an Nvidia Quadro M5000 GPU. This represents a roughly 1,200-fold speed advantage for Eclass, although the approximately 3 hours required for IBC extraction is not included in that comparison. The CNN used standard augmentation (flip, zoom, rotate), minority class oversampling, pixel normalization (zero mean, unit standard deviation), and test-time augmentation with majority voting across 5 augmented versions.

The study design was double-blinded: dermoscopy images were randomized and stripped of all patient identifiers before being injected into the algorithm pipeline. No clinical metadata (age, sex, anatomical location, sun damage) was used. The gold standard for each lesion was the histopathological diagnosis obtained from surgical excision biopsy performed as part of routine clinical care.

TL;DR: Eclass combined 12 different ML algorithms (from KNN to random forests to SVMs) via median probability voting across 1,000 Monte Carlo iterations. Training took 150 seconds versus 52 hours for a ResNet-50 CNN doing only 10 runs. The study was double-blinded with histopathology as the gold standard, and no patient metadata was used.

Results

Pages 9-10

Eclass Outperformed CNN 75% of the Time on Challenging Dysplastic Nevi

On the Barcelona validation set (Validation Set 1), Eclass achieved a mean AUROC of 0.71 +/- 0.07 with a 95% confidence interval of [0.56, 0.85]. The CNN achieved a mean AUROC of 0.67 with a 95% confidence interval of [0.63, 0.71]. In a Monte Carlo comparison that randomly drew ROCs from the 10 CNN runs and the 1,000 Eclass runs, Eclass produced a higher AUROC 74.88% of the time. While neither model reached the 0.91 AUROC published in prior melanoma detection studies using larger datasets, the authors emphasize that their cohort was deliberately more challenging because all nevi in the study were clinically dysplastic (atypical), meaning they had already been flagged as suspicious enough to warrant biopsy.

Why the AUROCs are lower than published benchmarks: The study cohort excluded obviously benign lesions. Every nevus in the dataset was one that a clinician had judged sufficiently suspicious to excise, so the melanoma-versus-nevus discrimination task was inherently harder than studies that include clearly benign moles. The authors note that performance for both models would likely improve with larger training sets that include a wider spectrum of lesion presentations.

The C5.0 decision tree stood out: Although the median-based ensemble approach defined the Eclass score, the C5.0 decision tree alone outperformed the full ensemble at high sensitivity, yielding 98% sensitivity with 44% specificity on the originally published dataset. The C5.0 tree used branching logic based on just 10 IBCs: 4 from the blue channel (B1, B6, B7, B15), 5 from the red channel (R4, R6, R8, R12, R13), and 1 multicolor (MC1). The tree had 10 decision nodes, 11 terminal nodes, and 7 of those terminal nodes were "pure" (100% melanoma or 100% nevi prevalence). Two nodes alone (#4 and #20) contained 59.8% of all lesions and perfectly discriminated nevi from melanoma.

Comparison to existing clinical methods: Table 3 in the paper benchmarks Eclass against over a dozen human and machine-augmented diagnostic methods. Expert dermoscopy achieved 90% sensitivity and 90% specificity. The ABCD dermoscopy rule reached 84% sensitivity and 75% specificity. Commercial systems like MelaFind achieved 98% sensitivity but only 10% specificity. Eclass at the 98% sensitivity operating point achieved 36% specificity, which is comparable to MelaFind and the preliminary SIAscopy results, but with the critical advantage of interpretability.

TL;DR: Eclass AUROC = 0.71 vs. CNN AUROC = 0.67. Eclass outperformed CNN in 74.88% of random comparisons. The C5.0 decision tree alone achieved 98% sensitivity / 44% specificity using only 10 IBCs. Performance was lower than published benchmarks because the cohort consisted entirely of clinically dysplastic (atypical) nevi, making discrimination much harder.

Clinical Application

Pages 9-11

The IBC App: Clinician Feedback and Minimalist Visualization Strategy

The research team developed a companion iOS/Mac application that visualizes imaging biomarker cues directly on dermoscopy images. Rather than displaying all 38 IBCs (which would overwhelm clinicians), the app implemented a minimalist visualization strategy. An IBC was highlighted in red if its value fell more than 1.5 standard deviations above the population mean (indicating a malignant-range value) or in green if it fell more than 1.5 standard deviations below the mean (indicating a benign-range value). IBCs within normal range were simply not displayed, reducing cognitive load.

Human subjects study: Ten clinicians (IRB-approved, RU DGA-0923), aged 26 to 64 years, evaluated the app and scored it an average of 2.3 out of 4 for clinical utility. Scores ranged from 2 to 4, with younger clinicians rating it more favorably (linear regression: score = -0.037 x Age + 4.17, R-squared = 0.41). The diagnostic score itself (the numerical Eclass/CNN probability) was the least favorite feature among clinicians. This finding is telling: dermatologists preferred the visual biomarker overlays to a simple risk number, confirming the paper's thesis that interpretability matters more than a summary statistic.

Bandwidth mismatch: A key insight from the human subjects research was that dermatologists have a "very small bandwidth" to process IBCs compared to the analytical capacity of the computer. The algorithm can evaluate 38 features simultaneously, but clinicians can only absorb a handful during a real-time screening encounter. This informed the design decision to show only the most extreme biomarker values. The app workflow proceeds from dermoscopy image to IBC overlay to Eclass/CNN scores, and finally reveals the gold-standard biopsy diagnosis, simulating a clinical decision pipeline.

The clinical workflow envisioned by the authors spans the full expertise hierarchy. At high-specificity operating points, a patient-facing version of the app could help individuals determine whether a lesion requires professional evaluation. At high-sensitivity operating points, the app could help trained dermoscopists ensure they are not missing subtle or atypical lesions. This tiered deployment model addresses the workforce bottleneck: general practitioners (ratio 1:379 in the US) could screen with expert-level precision if the technology reliably translates pattern recognition from top dermoscopists down the expertise hierarchy.

TL;DR: The companion app shows only extreme-value IBCs (greater than 1.5 SD from mean) in red (malignant) or green (benign) to avoid overwhelming clinicians. Ten clinicians rated it 2.3/4 for utility, with younger clinicians scoring it higher (R-squared = 0.41). Dermatologists preferred visual biomarker overlays over raw numerical scores.

Diagnostic Imaging

Pages 12-13

Spectral Properties of IBCs: Why Blue, Green, and Red Channels Tell Different Stories

One of the more biologically grounded findings in this paper is that imaging biomarkers exhibit spectrally variant diagnostic significance. Across the two validation datasets, the number of statistically significant IBCs was consistent (35, 38, and 35 across three sets), but which color channel dominated shifted depending on the imaging system and patient population. In the Barcelona cohort, blue-channel IBCs were the most numerous (18 of 38), while the originally published New York cohort showed a different distribution. The IBC that was most significant in the blue channel in the original publication (B1) shifted to greater significance in the green channel in the Barcelona data.

Biological basis for spectral dependence: The authors propose a depth-dependent mechanism. Shorter wavelengths (blue light) penetrate only the superficial epidermis and are therefore sensitive to basal layer atypia and junctional nests of melanocytes, both hallmarks of melanoma. Longer wavelengths (red light) penetrate deeper into the dermis and can visualize the three-dimensional tissue characteristics of invading melanoma cells. Green-channel features capture polymorphic vasculature and metabolic activity through hemoglobin saturation and desaturation contrast, which is relevant because active tumor growth alters local blood supply. These spectral relationships are consistent across both melanoma and non-melanoma skin cancers (basal cell carcinoma and squamous cell carcinoma).

Clinical relevance of specific IBCs: The "steeper edge slope" biomarker (R5), which measures how sharply the lesion border transitions from dark interior pixels to bright surrounding skin, was more pronounced in melanomas. The authors hypothesize this reflects melanocyte nest growth at the dermal-epidermal junction at the edge of a melanoma, whereas atypical nevi tend to have individually dispersed junctional melanocytes that taper off more gradually at the border. This is an example of how an interpretable IBC can potentially "teach back" to dermatologists, revealing visual patterns that may not be obvious during manual examination.

The authors note that this spectral analysis motivates their ongoing hyperspectral imaging research, which would expand from three RGB channels to dozens of narrow spectral bands. Within a hyperspectral image, one could include direct measures of hemoglobin oxygenation, potentially identifying metabolically active tumor regions with greater precision than standard dermoscopy RGB analysis.

TL;DR: Blue-channel IBCs dominated (18/38) because short wavelengths image superficial melanocyte nests. Red-channel IBCs capture deeper dermal invasion. Green-channel features reflect vascular and metabolic changes. The consistency of significant IBC counts across datasets (35, 38, 35) suggests the biomarker framework generalizes, even as channel-specific rankings shift.

Limitations

Pages 13-14

Image Exclusions, Population Retraining, and the Small Dataset Constraint

High exclusion rate: Of the original 668 images, 319 (47.8%) were discarded because one or more IBCs failed to compute. The Eclass method requires images that show the complete lesion with full borders and some adjacent normal skin, and cannot process images with hair, surgical ink markings, air bubbles in immersion media, or lesions extending beyond the field of view. Nodular, ulcerated, or extremely atypical presentations were also excluded. The CNN, by contrast, could still analyze these "defective" images because it operates on raw pixels without requiring successful feature extraction. The authors acknowledge this exclusion rate must be reduced to below 10% for practical clinical use, though they note that the clinical "when in doubt, cut it out" rule partially mitigates the impact.

Population-specific retraining: The Eclass model required retraining for each population. The New York and Barcelona cohorts differed in imaging systems, patient populations, and skin types. While the number of significant IBCs was similar across datasets, the specific channel rankings shifted, meaning a model trained on one population may not generalize directly to another. This limits deployment in diverse clinical settings without population-specific calibration.

Small dataset and underparameterization: The dataset of 349 lesions is small by modern deep learning standards. The authors frame this as a strength of Eclass: the model had roughly 30 free parameters (the 38 IBCs reduced through selection) versus approximately 20 million for the ResNet-50 CNN, placing Eclass in the "underparameterized classical regime" where the training data (349 images) exceeded 10 times the parameter count. The CNN, however, was severely data-starved relative to its parameter count. The AUROC of 0.71 is lower than the 0.91 published by Esteva et al. using approximately 130,000 images, and head-to-head comparisons on larger datasets are needed to determine whether Eclass maintains its advantage as data availability scales.

Ethical and representation concerns: The authors briefly note that training data does not always represent all skin types, raising ethical concerns about diagnostic equity. This is particularly relevant for melanoma detection, where lesion appearance can vary significantly with skin pigmentation, and underrepresentation of darker skin tones in dermatology training datasets is a well-documented problem.

TL;DR: 47.8% of images (319/668) were excluded due to IBC computation failures (hair, incomplete borders, atypical presentations). The model requires population-specific retraining (New York vs. Barcelona). The dataset of 349 lesions is small, and the 0.71 AUROC falls short of the 0.91 achieved by deep learning on 130,000 images. Skin type representation in training data remains an ethical concern.

Future Directions

Pages 13-15

Toward Hybrid Models, Hyperspectral Imaging, and Collaborative IBC Libraries

Automatic image defect correction: The most immediate priority is developing imaging biomarkers that can recognize and correct for image defects (hair, bubbles, incomplete borders) to reduce the current 47.8% exclusion rate. If the algorithm could automatically flag and compensate for defective regions, more IBCs would compute successfully, and the system could provide diagnostic confidence that all relevant features in a lesion have been analyzed. This could involve either traditional image processing or targeted deep learning for preprocessing.

Hybrid CNN-Eclass models: The authors propose incorporating the CNN risk score as an additional imaging biomarker within the Eclass ensemble, or even adding the CNN as a 13th member of the classifier collection. The computational cost would increase, but the deep learning component could contribute an additional capability: retrieving visually similar lesions with known diagnoses from a reference database, allowing clinicians to confirm visual similarity and infer probable diagnosis by analogy. This hybrid approach would combine the interpretability of Eclass with the pattern recognition power of deep learning.

Collaborative IBC repository: The authors call for a shared repository of executable MATLAB functions for computing imaging biomarkers. Different research groups could contribute complementary IBC sets, expanding the feature library beyond what any single lab could develop. More IBCs are needed to cover the full spectrum of clinical presentations, and new biomarkers should be explicitly related to underlying tissue structure, including proliferative and invasion patterns of melanoma cells and molecular pathways impacting pigment distribution.

Decision tree translation to visual screening: Perhaps the most intriguing direction is the potential for the C5.0 decision tree to "teach back" to dermatologists. Because the tree uses branching logic based on specific, visualizable IBCs, it could reveal new dermoscopic features and new ways to combine visual evaluations sequentially. The authors suggest that the full 12-algorithm ensemble might be reduced to just 1 or 2 classifiers (such as C5.0) for clinical translation, simplifying both computation and interpretation. This opens the possibility of automated, unconstrained visual-aided screening that could eventually work without computer vision, with dermatologists internalizing the IBC decision logic.

Hyperspectral imaging: The team's ongoing hyperspectral imaging research aims to move beyond three-channel RGB analysis to dozens of narrow spectral bands. This would enable direct measurement of hemoglobin saturation states within lesions, potentially identifying metabolically active regions with higher precision and further enriching the IBC library with biologically grounded spectral features.

TL;DR: Key future directions include automatic image defect correction to reduce the 47.8% exclusion rate, hybrid CNN-Eclass models that combine interpretability with deep learning power, a shared MATLAB IBC repository for collaborative development, C5.0 decision tree "teach-back" to train dermatologists on new visual patterns, and hyperspectral imaging to expand beyond RGB into dozens of spectral bands.

Deep learning-level melanoma detection by interpretable machine learning and imaging biomarker analysis

Original Paper (PDF)