Multi-Task Learning for Melanoma Segmentation and Classification

Overview & Background

Pages 1-2

Why Melanoma Detection Needs Automated Multi-Task Systems

Melanoma is the fastest-growing and most lethal form of skin cancer, with mortality rates climbing steadily each year. Early detection dramatically improves survival, but distinguishing melanoma lesions from benign skin features is extremely difficult, even for experienced dermatologists. Differences of opinion among clinicians are common, which has driven demand for automated computer-aided diagnosis systems that can deliver both accurate and rapid assessments from dermoscopy images.

This paper, published in Diagnostics in 2023 by researchers at Jouf University (Saudi Arabia) and Bolu Abant Izzet Baysal University (Turkey), proposes a multi-task learning network that handles both segmentation (identifying the lesion boundary) and classification (determining whether the lesion is melanoma or not) in a single pipeline. The key innovation is linking these two tasks so that the output of segmentation feeds directly into classification, rather than treating them as independent problems.

Prior work and gaps: Previous approaches typically tackled segmentation or classification in isolation. For example, Seeja and Suresh (2019) combined UNet segmentation with SVM classification but achieved only 85.19% classification accuracy. Ding et al. (2022) reached 90.9% accuracy on ISIC 2017 using a two-stage deep neural network. Jojoa Acosta et al. (2021) paired Mask R-CNN with ResNet152 for 90.4% accuracy. While Malibari et al. (2022) claimed 99% accuracy on ISIC 2019, most prior systems suffered from resolution loss when cropping lesions and did not jointly optimize both tasks.

The proposed pipeline: The system works in sequential stages. First, a pre-processing step removes hair artifacts and enhances image contrast. Second, a VGGNet-based FCNLayer architecture segments the lesion region. Third, cropped lesions are upscaled using a Very Deep Super-Resolution (VDSR) neural network to prevent quality loss. Finally, a classifier built on three combined pre-trained CNNs (DenseNet201, GoogleNet, MobileNetv2) performs melanoma classification. The authors tested this pipeline on the publicly available HAM10000 dataset from ISIC.

TL;DR: This paper introduces a multi-task pipeline that chains lesion segmentation, super-resolution upscaling, and a triple-CNN classifier. Tested on the HAM10000 dataset (10,015 dermoscopy images), it achieved 96.99% segmentation accuracy and 97.73% classification accuracy, outperforming prior single-task approaches that typically ranged from 85% to 95%.

Methodology

Pages 3-4

Pre-Processing: Hair Removal and Image Enhancement

Dermoscopy images frequently contain hair artifacts that obscure lesion boundaries and confuse automated systems. The authors addressed this with a dedicated pre-processing pipeline applied before any segmentation or classification. This step is critical because hair strands can create false edges that cause segmentation models to misidentify lesion boundaries, directly degrading downstream classification accuracy.

Maximum pooling filter: The first operation applies a max pooling filter to the dermoscopy image. This filter slides across the image and retains the maximum pixel intensity value within each local neighborhood, effectively suppressing thin, dark hair strands by replacing them with the dominant surrounding skin tone. The pooling window size was selected to target hair-width artifacts without blurring the lesion edges.

Contrast and sharpening filters: After hair removal, the pipeline applies contrast enhancement to improve the visual distinction between the lesion region and the surrounding healthy skin. A sharpening filter then restores edge definition that may have been softened by the pooling step. Together, these three operations (max pooling, contrast adjustment, and sharpening) produce enhanced images where the lesion boundary is more clearly delineated.

The authors note that this pre-processing pipeline is simpler than some alternatives (such as deep learning-based hair removal networks like the one proposed by Li et al., 2021) but proved effective enough to support high segmentation performance. The enhanced images serve as the direct input to the VGGNet-based FCNLayer segmentation architecture.

TL;DR: A three-step pre-processing pipeline (max pooling for hair removal, contrast enhancement, and sharpening) cleans dermoscopy images before segmentation. This lightweight approach removes hair artifacts and clarifies lesion boundaries without requiring a separate deep learning model for artifact removal.

Segmentation

Pages 3-5

VGGNet-Based FCNLayer Architecture for Lesion Segmentation

The segmentation stage uses a Fully Convolutional Network (FCN) architecture built on VGGNet (specifically the pre-trained VGG16 backbone). FCNs are designed for pixel-level prediction, meaning they classify every pixel in the image as either "lesion" or "non-lesion." The authors evaluated three variants of this architecture: FCN-32s, FCN-16s, and FCN-8s, each differing in how they upsample predictions back to the original image resolution.

FCN-32s: The simplest variant, which upsamples the final prediction map directly by a factor of 32 in a single step. This approach is fast but can produce coarse boundaries. FCN-16s: Combines predictions from the final layer with those from the fourth pooling layer (via a 1x1 convolution), then upsamples by 16. The fusion of two feature levels provides finer boundary detail. FCN-8s: Further adds predictions from the third pooling layer, combining three levels of features before upsampling by 8. In theory, this should produce the sharpest boundaries, but it also introduces more noise.

Training setup: All three models were trained with an epoch size of 200, a batch size of 1, and the Adam optimizer. The training used the pre-trained VGG16 weights as a starting point (transfer learning), which significantly reduces the amount of dermoscopy-specific training data needed to achieve good performance.

Surprisingly, FCN-16s outperformed FCN-8s despite the latter having access to more feature levels. FCN-16s achieved 96.99% accuracy, 97.65% precision, and 98.41% sensitivity at the pixel level. FCN-8s reached only 93.61% accuracy and 92.59% precision, though its sensitivity was slightly higher at 98.99%. The authors attribute this to FCN-8s detecting non-lesion regions as lesions (higher false positive rate), which was confirmed by visual inspection of the prediction maps.

TL;DR: Three VGGNet-based FCN variants (8s, 16s, 32s) were tested for lesion segmentation. FCN-16s was the best performer at 96.99% accuracy, 97.65% precision, and 98.41% sensitivity. FCN-8s had a higher false positive rate despite theoretically finer resolution. All models used VGG16 transfer learning with Adam optimizer, 200 epochs, and batch size 1.

Super-Resolution

Pages 4-6

VDSR: Restoring Image Quality After Lesion Cropping

Once the lesion region is segmented, the next step is to crop it from the original image for classification. However, cropped lesions vary widely in size (some as small as 48 x 64 pixels), while the classifier models require fixed input dimensions (224 x 224 pixels for DenseNet201 and MobileNetv2, and 224 x 224 for GoogleNet). Simply resizing with standard bilinear interpolation introduces blurring and loss of fine diagnostic features such as pigment network patterns and lesion border irregularities.

Very Deep Super-Resolution (VDSR): To address this, the authors employed the VDSR neural network architecture. VDSR uses a cascade of convolutional layers (each 3 x 3 x 64 filters) with a skip connection that adds the input directly to the output. The input patch size is 41 x 41 pixels. The skip connection is key: it allows the network to learn only the residual (the difference between the low-resolution and high-resolution versions), which is easier to optimize than learning the full high-resolution output from scratch.

Combining low-level and high-level features: The VDSR architecture fuses shallow features (edges, textures) with deeper semantic features through the skip link. This ensures that fine-grained details lost during resizing are re-inserted into the upscaled image. The authors demonstrate this visually with an example where a 48 x 64 lesion image was upscaled to 224 x 224 using both bilinear interpolation and the VDSR approach. The VDSR output showed noticeably sharper lesion boundaries and better-preserved internal structures.

This super-resolution step is a distinctive feature of the proposed pipeline. Most prior melanoma classification systems simply resize cropped lesions using interpolation, accepting the resolution loss. By inserting VDSR between segmentation and classification, the authors ensure the classifier receives higher-quality inputs, which they argue directly contributes to the improved classification accuracy of 97.73%.

TL;DR: Cropped lesions (as small as 48 x 64 pixels) are upscaled to 224 x 224 using a VDSR neural network with cascading 3 x 3 x 64 convolutional layers and a skip connection. This produces sharper, more detailed images than standard bilinear interpolation, preserving diagnostic features like pigment networks and border irregularities for downstream classification.

Classification

Pages 6-7

Triple-CNN Classifier: DenseNet201, GoogleNet, and MobileNetv2

The classification stage combines three pre-trained convolutional neural network architectures, each with fundamentally different internal structures. Rather than training a CNN from scratch on dermoscopy images (which would require far more data), the authors leveraged transfer learning, using weights pre-trained on ImageNet and fine-tuning them for melanoma detection.

DenseNet201: A densely connected architecture where every layer receives feature maps from all preceding layers as input. This dense connectivity enables feature reuse across the network, improves gradient flow during training, and reduces the total number of parameters needed. DenseNet201 uses transition layers with pooling and bottleneck operations to manage computational complexity. As a standalone classifier, it achieved 95.51% accuracy, 97.05% specificity, 97.02% precision, and 94.01% sensitivity.

GoogleNet: Built around inception modules that apply multiple convolution filter sizes (1 x 1, 3 x 3, and 5 x 5) in parallel, then concatenate the results. This multi-scale approach captures features at different spatial resolutions within a single layer. GoogleNet uses average pooling at the end instead of fully connected layers, reducing overfitting risk. Alone, it produced 93.07% accuracy, 98.01% specificity, 97.85% precision, and 88.24% sensitivity. MobileNetv2: Designed for efficiency using depthwise separable convolutions (DSC) and inverted residual blocks with linear bottlenecks. It separates the standard convolution into a depthwise step (per-channel) and a pointwise step (1 x 1 combining channels), dramatically reducing parameters. Solo performance: 95.06% accuracy, 97.67% specificity, 97.60% precision, and 92.51% sensitivity.

Feature fusion strategy: From each architecture's fully connected layer, 1,000 deep features were extracted per image. These three 1,000-dimensional feature vectors were then combined using a global average pooling layer, producing a single 1,000-dimensional representation per image. This fused vector was passed through a feature layer, two fully connected layers with ReLU activation, and a final softmax layer for binary classification (melanoma vs. non-melanoma). The training used stochastic gradient descent with momentum (SGDM), 100 epochs, and a batch size of 32.

TL;DR: Three pre-trained CNNs (DenseNet201 at 95.51%, GoogleNet at 93.07%, MobileNetv2 at 95.06% standalone accuracy) were combined by extracting 1,000 features each, fusing them via global average pooling, and classifying through FC + softmax layers. Training used SGDM optimizer, 100 epochs, and batch size 32.

Dataset & Experimental Design

Pages 7-8

HAM10000 Dataset and Data Augmentation Strategy

All experiments used the HAM10000 dataset, a widely used, publicly available collection of 10,015 dermoscopic images from the International Skin Imaging Collaboration (ISIC). The full dataset spans seven diagnostic classes: benign keratosis, melanoma, basal cell carcinoma, vascular lesion, dermatofibroma, melanocytic nevi, and actinic keratosis. For this study, the authors simplified the task to binary classification: 1,113 melanoma images versus 8,902 non-melanoma images.

Class imbalance problem: The roughly 8:1 ratio of non-melanoma to melanoma images poses a serious overfitting risk, where a model could achieve high overall accuracy simply by predicting "non-melanoma" for every input. To address this, the authors applied data augmentation specifically to the melanoma class using rotation, flipping, contrast adjustment, and brightness modification. This increased the melanoma image count from 1,113 to 8,904, yielding a balanced dataset of 17,806 total images.

Train/test split protocol: The data was split 80% for training and 20% for testing, performed once on the raw (pre-augmentation) data to prevent data leakage. Augmentation was then applied separately to the training and test subsets. This is an important methodological detail because applying augmentation before splitting would allow augmented copies of the same original image to appear in both sets, artificially inflating performance metrics.

Performance was evaluated using a confusion matrix and derived metrics: accuracy, precision (positive predictive value), sensitivity (recall/true positive rate), and specificity (true negative rate). For segmentation, these metrics were calculated at the pixel level across all test images. For classification, they were calculated at the image level.

TL;DR: The HAM10000 dataset (10,015 images, 7 classes) was reduced to binary classification: 1,113 melanoma vs. 8,902 non-melanoma. Data augmentation (rotation, flip, contrast, brightness) balanced the dataset to 17,806 images. An 80/20 train/test split was performed on raw data before augmentation to prevent leakage.

Results

Pages 9-12

Classification Results and Comparison with Previous Studies

The authors tested all possible combinations of the three CNN architectures. Two-model combinations showed progressive improvement: DenseNet + GoogleNet achieved 95.84% accuracy, GoogleNet + MobileNet reached 96.35%, and MobileNet + DenseNet hit 97.16%. The full triple combination (DenseNet + GoogleNet + MobileNet) achieved the best results across all metrics: 97.73% accuracy, 99.83% specificity, 99.83% precision, and 95.67% sensitivity.

Segmentation results in context: The VGGNet-FCN16s segmentation model achieved 96.99% pixel-level accuracy, 97.65% precision, and 98.41% sensitivity. These results exceeded those of the FCN-32s variant (96.11% accuracy, 96.71% precision, 98.17% sensitivity) and the FCN-8s variant (93.61% accuracy, 92.59% precision, 98.99% sensitivity). The high sensitivity across all variants indicates that the models rarely missed actual lesion pixels, which is clinically desirable since missing melanoma tissue is more dangerous than including some healthy skin.

Comparison with prior work: When benchmarked against previous studies on the same HAM10000 dataset, the proposed approach outperformed the field. Alam et al. (2022) achieved 91% classification accuracy. Srinivasu et al. (2021) reached 90.21% with MobileNet V2 and LSTM. Dhivyaa et al. (2020) reported 97.3% accuracy using decision trees and random forests, but without segmentation. Bibi et al. (2022) achieved 96.7% accuracy. For studies that performed both segmentation and classification, Khan et al. (2021) reached 92.69% segmentation and 90.67% classification accuracy in one study, and 92.25% and 88.39% in another. The proposed model's combined 96.99% segmentation and 97.73% classification accuracy represents a clear improvement over these multi-task baselines.

The ROC curve for the proposed classifier demonstrated strong discriminative ability, with the area under the curve visually approaching 1.0. The confusion matrix showed that the model's errors were predominantly false negatives (melanoma cases classified as non-melanoma), consistent with the 95.67% sensitivity being the lowest of the four metrics. This pattern suggests the model is slightly conservative, which is a trade-off worth examining in clinical deployment where missing melanoma carries severe consequences.

TL;DR: The triple-CNN classifier achieved 97.73% accuracy, 99.83% specificity, 99.83% precision, and 95.67% sensitivity. This outperformed prior multi-task approaches by Khan et al. (2021), which reached only 90.67% and 88.39% classification accuracy. Segmentation at 96.99% accuracy also exceeded comparable baselines (92.25% to 92.69%).

Limitations & Future Directions

Pages 11-13

Constraints of the Current Approach and Planned Improvements

Single-dataset evaluation: All experiments were conducted exclusively on the HAM10000 dataset. While this is a standard benchmark, it represents a specific image acquisition protocol, patient demographics, and lesion distribution. The model's performance on images from different dermatoscopes, lighting conditions, or patient populations remains unknown. Cross-dataset validation (for example, testing on ISIC 2017 or PH2 datasets) would be necessary to establish generalizability.

Binary classification only: The study reduced the seven-class HAM10000 dataset to a binary melanoma vs. non-melanoma task. In clinical practice, distinguishing between all seven lesion types (including basal cell carcinoma, which also requires treatment) is essential. A multi-class extension of the proposed approach would be more clinically relevant but also substantially more challenging, particularly given the severe class imbalance in the original dataset (melanocytic nevi alone account for over 6,700 of the 10,015 images).

Computational complexity: The pipeline uses three separate deep learning models for classification (DenseNet201, GoogleNet, MobileNetv2) plus a VGG16-based segmentation model and a VDSR super-resolution network. Running five neural networks in sequence introduces significant computational overhead that could limit real-time clinical deployment. The authors do not report inference time or hardware requirements, which are critical for practical implementation in dermatology clinics.

Data augmentation concerns: While augmentation balanced the class distribution, the melanoma class was expanded roughly 8-fold from 1,113 to 8,904 images. Heavy augmentation can lead to the model overfitting to augmented variants of a relatively small set of original melanoma images. The single 80/20 split (rather than k-fold cross-validation) makes it harder to assess the robustness of the reported metrics.

Future directions: The authors plan to explore optimization of the hyperparameters and variables in the proposed approach, specifically the parameters that affect performance. They also intend to investigate transformer architectures (such as Vision Transformers), which have shown strong results in medical image analysis and could potentially replace or augment the CNN-based components. Adapting the proposed multi-task framework to incorporate transformers for either the segmentation or classification stage represents a natural evolution of this work.

TL;DR: Key limitations include single-dataset evaluation (HAM10000 only), binary classification instead of the full 7-class problem, high computational cost from running 5 neural networks in sequence, heavy augmentation of just 1,113 original melanoma images, and a single train/test split without cross-validation. Future work will focus on hyperparameter optimization and incorporating transformer architectures.

A Novel Multi-Task Learning Network Based on Melanoma Segmentation and Classification with Skin Lesion Images

Original Paper (PDF)

Plain-English Explanations