DenseNet-II: an improved deep convolutional neural network for melanoma cancer detection

Computers in Biology and Medicine 2023 AI 8 Explanations View Original
Original Paper (PDF)

Unable to display PDF. Download it here or view on PMC.

Plain-English Explanations
Pages 1-2
Why Melanoma Detection Needs Better Deep Learning Models

Melanoma is one of the deadliest forms of skin cancer, driven by uncontrolled growth of melanocytic cells. Average skin exposure to UV radiation has risen by 53%, and melanoma accounts for roughly 6% of all reported cancer cases worldwide. In 2018, approximately 9.5 million people died of cancer globally, with an estimated 1,806,590 new cancer cases expected in the United States in 2020 alone. These numbers underscore the urgency of developing automated, accurate detection tools that can flag melanoma at early stages, when treatment outcomes are most favorable.

The visual nature of melanoma: Unlike many other cancers, melanoma often presents visible changes early on, such as newly appeared moles with irregular borders or sudden shifts in skin color. This visual signature makes melanoma uniquely suited to image-based classification using convolutional neural networks (CNNs). However, melanoma also bears a striking visual similarity to other benign skin conditions, making it difficult for both patients and clinicians to distinguish cancerous from non-cancerous lesions without computational assistance.

Limitations of prior work: Earlier machine learning models for melanoma detection were typically trained on high-quality, professionally captured images taken at correct angles and lighting. When deployed in the real world, these models received images taken by patients using mobile phone cameras in poorly lit environments with inconsistent zoom and angles. The authors argue that data-driven approaches, using large and varied training datasets, can overcome this quality gap and produce models that generalize to real-world conditions.

The authors propose DenseNet-II, an enhanced deep learning CNN framework that customizes the number of layers, activation functions, and input dimensions to improve melanoma detection accuracy. Unlike most prior studies that limited classification to two categories (malignant or benign), this work classifies seven distinct lesion types using the HAM10000 dataset, which contains approximately 10,000 dermatological images.

TL;DR: Melanoma accounts for 6% of cancer cases, and UV exposure has risen 53%. DenseNet-II is a custom CNN designed to classify 7 lesion types (not just malignant vs. benign) on the HAM10000 dataset of ~10,000 images, addressing real-world image quality variability.
Pages 3-4
Machine Learning and Deep Learning Baselines for Skin Cancer

The authors survey prior work across two broad categories: traditional machine learning approaches and deep learning approaches. On the ML side, Cruz and Wishart (2006) conducted an early survey identifying three fundamental foci of cancer prognosis: predicted cancer susceptibility, survivability, and recurrence. Their framework laid the groundwork for subsequent classification-based approaches. Other notable ML contributions include Ontiveros-Robles et al. (2021), who built a supervised fuzzy classifier using Type-1 membership functions, statistical quartiles, and nature-inspired optimizations to handle data uncertainties and noise.

Classical ML methods: Vijayalakshmi (2019) proposed an automated system for classifying malignant and benign skin lesions using a three-phase pipeline: image augmentation, model design, and final classification. The work also addressed real-world image noise by using MATLAB filters to remove hair, shades, and glare during pre-processing. Richter and Khoshgoftaar (2019) tackled the absence of well-structured data by proposing a distributed, cloud-based big-data solution for structured collection of data points across multiple regions, feeding them into a melanoma risk prediction model.

Deep learning advances: On the deep learning side, Esteva et al. (2017) used a data-driven approach with 129,450 dermatological images organized in a tree taxonomy of 2,032 diseases with three root nodes classifying lesions as benign, malignant, or non-neoplastic. Guo et al. (2020) proposed a 3D CNN enhanced with pyramidal attribute structures for contextual learning. Poma et al. (2020) optimized conventional CNNs by reducing the number of trainable parameters. Neethu et al. (2020) demonstrated CNN superiority over naive Bayes, SVM, Markov models, and KNN, achieving 96.2% accuracy in their domain.

Clinical deep learning for melanoma: Kammer et al. (2017) developed two deep convolutional neural network (DCNN) classifiers, a segmentation classifier and a responsive classifier, for classifying whole-slide images of metastatic melanoma tissue. These were combined with clinical demographic data to produce more robust prognostic models. The authors also cite Albu et al. (2019), who described the broad applications of artificial neural networks in medicine, including skin disease diagnosis and hepatitis-B prediction, further motivating the development of DenseNet-II.

TL;DR: Prior ML work used fuzzy classifiers and cloud-based pipelines. Deep learning models like Esteva et al. (2017) trained on 129,450 images across 2,032 disease classes. DCNN classifiers on whole-slide melanoma tissue and 3D CNNs with pyramidal structures set the stage for DenseNet-II.
Pages 5-6
Architecture of the Proposed DenseNet-II Model

DenseNet-II builds on several established deep learning architectures, specifically DenseNet, VGG-16, InceptionV3, and ResNet. It extracts key features from each and combines them into a unified classifier. The model architecture consists of three main chunks of layers, each subdivided into two 2D convolutional layers with 3x3 convolution kernels. Each convolutional block is followed by a max-pooling filter after being activated with a rectified linear unit (ReLU).

Layer structure: The model uses four primary layer types. The Conv2D layer produces tensor outputs by convolving feature maps with varying kernel sizes. The ConvMaxPool2D layer condenses recognized features by taking the maximum value from the feature matrix. The Flatten layer compresses multi-dimensional features into a single column for downstream processing. Finally, the Dense layer applies activation functions across stacked interconnected neural networks to produce nonlinear outputs. The DenseNet function itself combines convolution and batch normalization with ReLU activation.

Output and optimization: The outputs from the convolutional blocks feed into a network of 4 dense layers activated by ReLU, with a softmax function applied in the final layer to produce multi-label classification across all 7 lesion types. The learning rate is dynamically adjusted using the ReduceLROnPlateau() function. The model uses sparse categorical cross-entropy loss with the Adam optimizer. Performance was evaluated at 10, 15, and 20 epochs to assess how training duration affects accuracy.

Diagnostic scoring: The paper also references the Total Dermoscopy Score (TDS), computed using the ABCD method: (Asymmetry score x 1.3) + (Border score x 0.1) + (Color score x 0.5) + (Diameter score x 0.5). This scoring system provides a clinical baseline for lesion identification that the DenseNet-II model aims to improve upon through automated classification.

TL;DR: DenseNet-II uses 3 convolutional blocks (each with two 2D conv layers and 3x3 kernels), max-pooling, ReLU activation, 4 dense layers, and softmax output for 7-class classification. It uses Adam optimizer, sparse categorical cross-entropy loss, and ReduceLROnPlateau for dynamic learning rate adjustment.
Pages 7-8
The HAM10000 Dataset and Handling Class Imbalance

The study uses the HAM10000 dataset, a widely used benchmark containing approximately 10,015 dermatological images across seven lesion classes: melanocytic nevi (nv), melanoma (mel), benign keratosis-like lesions (bkl), basal cell carcinoma (bcc), actinic keratoses (akiec), vascular lesions (vasc), and dermatofibroma (df). The dataset metadata includes lesion ID, image ID, mole location, patient age and gender, diagnosis type, and diagnostic method. The images vary in quality, ranging from professional DSLR captures to lower-quality mobile phone images, which is by design to simulate real-world deployment conditions.

Data pre-processing steps: The original dataset required augmentation because images were not directly linked to lesion types but instead connected through an image ID mapping. The authors added three columns to the data frame: cell type, cell type index (for classification), and the image file path. Duplicate images mapping to two or more lesion IDs were filtered out. The data was normalized by computing the mean and standard deviation of all red, green, and blue color matrices, scaling pixel values from the 0-255 range down to 0-1. The dataset was split 80/20 for training and testing, with both sets augmented using Keras.

Addressing class imbalance: The authors found a significant class imbalance, with the actinic keratosis (akiec) class having far more samples than other lesion types. Since multi-label classification assumes roughly balanced class distributions, this imbalance would inflate F1 scores for the majority class while suppressing scores for minority classes. The authors addressed this through a three-step process: (1) augmenting training images using Keras ImageDataGenerator() to rebalance classes, (2) applying equalization sampling to distribute generated data across classes, and (3) implementing focal loss, an advanced cross-entropy variant that penalizes the majority class with a lower weight.

Demographic findings: Exploratory data analysis revealed that men above 50 years of age are significantly more likely to be diagnosed with melanoma compared to women of the same age group or younger individuals of either gender. At certain age points, men were nearly twice as vulnerable to melanoma. These demographic patterns matched established epidemiological findings, confirming the dataset's consistency and integrity.

TL;DR: HAM10000 contains 10,015 images across 7 lesion classes. Class imbalance (akiec overrepresented) was addressed via Keras ImageDataGenerator augmentation, equalization sampling, and focal loss. Data was normalized (pixel values scaled to 0-1) and split 80/20 for training and testing.
Pages 9-10
Comparative Architecture Breakdown: VGG-16, DenseNet, ResNet, and InceptionV3

The authors benchmarked DenseNet-II against four established deep learning architectures, all pre-trained on large image classification tasks spanning 1,000 categories. VGG-16 uses 16-19 layers of 3x3 convolution filters reduced progressively with max-pooling, where smaller network outputs feed as input weights to deeper layers before a final softmax prediction. The authors note VGG-16's key weakness: it is slow and assigns large weight values to parameters.

DenseNet improves upon ResNet's residual approach by connecting all densely connected previous layers to the layer in front of them, rather than only the immediately preceding layer. Instead of adding residuals, DenseNet concatenates feature maps to enhance learning rates. Each transition layer incorporates normalization, convolution, and pooling, and the model is further augmented with bottleneck architecture and compression.

ResNet was designed to solve the vanishing gradient problem in deep networks. It divides the full architecture into micro CNN units where pooling, convolution, and activation are applied independently. These micro units combine into a macro output. The key innovation is the residual function, which uses identity mapping to minimize the residual value added to outputs from previous layers, ensuring optimally mapped features across the network.

InceptionV3 computes 1x1, 3x3, and 5x5 convolutions within micro-networks, acting as a multi-level feature extractor. It uses symmetric and asymmetric networks powered by max-pooling, convolution, dropouts, and concatenation, with batch normalization and softmax loss. Its compact size of 96 MB gives it a practical advantage over ResNet and VGG for deployment scenarios. A standard CNN model using 2D convolutional layers with 16, 32, and 64 filters also served as an additional baseline.

TL;DR: DenseNet-II was benchmarked against VGG-16 (16-19 layers, slow), DenseNet (concatenated feature maps), ResNet (residual functions to avoid vanishing gradients), InceptionV3 (multi-scale feature extraction, 96 MB), and a standard CNN with 16/32/64 filters.
Pages 11-13
DenseNet-II Achieves 96.27% Accuracy, Outperforming All Baselines

The experimental results demonstrate clear superiority of DenseNet-II over all comparative models. When trained on 15 epochs, the accuracy rankings were: DenseNet-II at 96.27%, ResNet + DenseNet combined at 92.00%, DenseNet at 87.30%, DenseNet-161 at 87.12%, ResNet at 86.90%, and VGG-16 at 75.27%. The performance gap between DenseNet-II and the next best model (ResNet + DenseNet) was over 4 percentage points, a meaningful margin in multi-class skin lesion classification.

Epoch sensitivity: DenseNet-II's accuracy varied with training duration. At 10 epochs, the model achieved 92.704% accuracy (reported elsewhere as 93.8%). At 15 epochs, it reached approximately 95.7%. At 20 epochs, it peaked at 97.351%. The authors settled on 20 epochs as the final configuration since no significant improvement was observed beyond that point. The accuracy and loss curves for training and validation phases were visualized at all three epoch settings.

Precision, recall, and F1-score: At 15 epochs, DenseNet-II achieved the highest precision (96%) and recall (96%) among all models, while VGG-16 showed the lowest precision (75.09%) and recall (73.5%). The F1-score, computed as the harmonic mean of precision and recall, followed the same pattern: DenseNet-II reached 95.7% and VGG-16 fell to 74%. At 10 epochs, DenseNet-II still led with a recall of 93.7%. The confusion matrix at 20 epochs confirmed strong classification performance across all seven lesion categories.

Statistical validation: The authors validated these results using the Friedman non-parametric statistical test, which ranks algorithms based on relative performance without assumptions about data distribution. The test compared all models across precision, recall, and F1-score at 10, 15, and 20 epochs. The obtained p-value was 0.01353, which is below the significance threshold of 0.05 (alpha = 0.05), allowing the null hypothesis to be rejected. This confirms a statistically significant difference in performance between the models, with DenseNet-II consistently ranked highest.

TL;DR: DenseNet-II achieved 96.27% accuracy at 15 epochs and 97.35% at 20 epochs, beating ResNet+DenseNet (92%), DenseNet (87.3%), ResNet (86.9%), and VGG-16 (75.27%). Precision, recall, and F1-score all topped 95.7%. Friedman test p-value of 0.01353 confirmed statistical significance.
Pages 14-15
Single-Dataset Evaluation and Generalizability Concerns

Single dataset dependency: The entire evaluation relies on the HAM10000 dataset alone. While HAM10000 is a well-established benchmark with 10,015 images, training and testing on a single dataset raises concerns about generalizability. The model has not been validated on external datasets from different clinical settings, imaging devices, or patient populations. Performance on HAM10000 may not translate to real-world deployment where image acquisition conditions vary more dramatically than the dataset captures.

Limited clinical validation: The study is purely computational, with no prospective clinical trial or comparison against dermatologist performance. While the model achieves 96-97% accuracy on the test set, it remains unclear how it would perform in a clinical workflow where factors like patient history, lesion evolution over time, and dermoscopic context play significant roles in diagnosis. The absence of clinician benchmarking makes it difficult to assess whether DenseNet-II offers a genuine diagnostic advantage.

Class imbalance and augmentation artifacts: Although the authors addressed the class imbalance using augmentation (ImageDataGenerator), equalization sampling, and focal loss, heavy augmentation of minority classes can introduce artificial patterns that inflate test-set metrics. The paper does not report per-class accuracy breakdowns at the final 20-epoch configuration, making it difficult to assess whether the model performs equally well across all seven lesion types or if some rare classes still lag behind.

Architecture transparency: While the paper describes the high-level layer structure, specific details about the exact number of parameters, training time, and computational requirements are not thoroughly reported. For a model intended to be deployable on consumer devices (processing mobile phone images), these practical considerations are critical. The paper also does not compare against more recent lightweight architectures like EfficientNet or MobileNet that are specifically designed for resource-constrained environments.

TL;DR: Key limitations include single-dataset evaluation (HAM10000 only), no clinical validation against dermatologists, potential augmentation artifacts from heavy minority-class oversampling, missing per-class accuracy breakdowns, and no comparison against lightweight architectures like EfficientNet or MobileNet.
Pages 15-16
Multi-Dataset Training and Illumination-Aware Classification

The authors outline several directions for extending DenseNet-II. The primary goal is to merge multiple dermoscopic datasets and train the model on a wider, more diverse set of image inputs. This would directly address the single-dataset limitation and help assess whether the model's 96-97% accuracy holds when exposed to images from different clinical institutions, imaging devices, and patient demographics. Cross-dataset validation is an essential step before any clinical deployment can be considered.

Accuracy and efficiency improvements: The authors plan to improve both accuracy and computational efficiency by incorporating datasets beyond HAM10000, making the models more robust and flexible. This could involve training on datasets like ISIC (International Skin Imaging Collaboration) archives, which contain larger and more diverse image collections. Expanding the training data would also help validate the model's performance on underrepresented lesion types where class imbalance posed challenges in the current study.

Illumination techniques: An interesting future direction involves incorporating illumination techniques that can aid in the detection and classification of unclear images. Many real-world dermoscopic images suffer from inconsistent lighting, shadows, and reflections that can obscure lesion features. Illumination normalization or enhancement methods could pre-process images to reduce these artifacts before they enter the classification pipeline, potentially improving accuracy on the types of lower-quality images that patients are most likely to capture themselves.

Beyond what the authors explicitly state, the broader field would benefit from head-to-head comparisons between DenseNet-II and newer architectures such as Vision Transformers (ViT), EfficientNet, and attention-based models that have shown strong performance on medical imaging tasks since 2020. Prospective clinical studies comparing DenseNet-II against board-certified dermatologists on blinded image sets would also be necessary to establish clinical utility.

TL;DR: Future work includes multi-dataset training (beyond HAM10000), illumination normalization for low-quality images, and computational efficiency improvements. The field would also benefit from comparisons against Vision Transformers and prospective clinical validation against dermatologists.
Citation: Girdhar N, Sinha A, Gupta S.. Open Access, 2022. Available at: PMC9400005. DOI: 10.1007/s00500-022-07406-z. License: Open Access.