Melanoma remains one of the most aggressive forms of skin cancer, with rising incidence rates worldwide. Early and accurate detection is essential for effective treatment, yet traditional diagnostic methods face significant challenges when applied to dermoscopic images. Classical feature extraction techniques, such as entropy, energy, and momentum-based methods, often perform poorly when images contain blurred noise or visual artifacts, leading to unreliable classification results.
This paper proposes a hybrid framework called DL-AO (Deep Learning with Aquila Optimizer) that combines Convolutional Neural Networks (CNNs) for automatic feature extraction with the Aquila Optimizer (AO) for feature dimensionality reduction. CNNs use hierarchical convolutional layers to learn discriminative features from dermoscopic images, effectively filtering out noise such as blur and spot artifacts. The AO, a nature-inspired metaheuristic optimization algorithm introduced in 2021, then selectively prunes the extracted feature space to retain only the most informative features while discarding redundant ones.
Key motivation: While CNNs excel at feature extraction, the resulting high-dimensional feature spaces create computational bottlenecks, especially for real-time or resource-constrained applications such as mobile devices and edge computing platforms. By integrating the AO, the authors aim to produce a compact feature representation that maintains classification accuracy while drastically cutting computational cost.
The approach was evaluated across three benchmark datasets (ISIC 2019, ISBI 2016, and ISBI 2017) as well as the PH2 dataset from Hospital Pedro Hispano in Portugal. The authors claim improvements of 4.2% in accuracy, 6.2% in sensitivity, and 5.8% in specificity over existing state-of-the-art methods, with a 37.5% reduction in computational complexity.
The literature review covers two key domains: CNN architectures for melanoma recognition and metaheuristic algorithms for feature optimization. Several CNN architectures have been explored for dermoscopic image classification, including AlexNet, VGGNet, ResNet, and Inception, each offering different trade-offs between model complexity, accuracy, and computational efficiency. These networks use hierarchical convolutional filters to learn increasingly abstract image representations.
Dimensionality reduction techniques: When CNN feature spaces grow large, they become computationally expensive and prone to overfitting. Established methods for reducing feature dimensions include Principal Component Analysis (PCA), t-distributed Stochastic Neighbor Embedding (t-SNE), and auto-encoders. The AO represents a newer, biologically inspired alternative that adaptively prunes features while preserving discriminative information.
Metaheuristic landscape: The paper surveys a broad range of nature-inspired optimization algorithms that have been applied to image segmentation and feature selection. These include Artificial Bee Colony (ABC), Genetic Algorithm (GA), Particle Swarm Optimization (PSO), Ant Colony Optimization (ACO), Honey Bee Mating Optimization (HBMO), Bat Algorithm (BA), Bacterial Colony Optimization (BCO), Firefly Algorithm (FA), and Artificial Flora (AF). Each draws inspiration from biological behaviors, from honeybee foraging to firefly light displays. However, these methods often suffer from high computational complexity that limits their practical deployment.
The authors position the Aquila Optimizer as a more advanced alternative that can outperform traditional metaheuristic methods like PSO and GA in accuracy, while maintaining efficient search, exploration, and exploitation mechanisms for thorough feature selection.
Datasets: The study used four publicly available dermoscopic image datasets. ISIC 2019 contained 2,000 images (1,000 benign, 1,000 malignant). ISBI 2016 had 400 images (200 benign, 200 malignant). ISBI 2017 included 600 images (300 benign, 300 malignant). The PH2 dataset from Hospital Pedro Hispano in Portugal contained 100 dermoscopic images: 35 typical melanocytic nevi, 25 dysplastic nevi, and 30 melanomas, captured at 768 x 560 pixel resolution in 24-bit RGB color using the Tuebinger Mole Analyzer system.
Preprocessing: All dermoscopic images were resized to a standardized resolution, normalized to enhance contrast and reduce illumination variation, and augmented via rotation, flipping, and scaling to increase training diversity and prevent overfitting.
CNN architecture: The deep CNN used multiple convolutional layers followed by max-pooling layers to capture hierarchical features at different spatial scales. Batch normalization and ReLU activation functions were applied to improve convergence and mitigate the vanishing gradient problem. Training used a supervised approach with cross-entropy loss, optimized via stochastic gradient descent with the Adam optimizer and an appropriate learning rate with momentum. The implementation leveraged pretrained networks including GoogleNet, ResNet, and SqueezeNet.
Aquila Optimizer for feature reduction: After the CNN extracted 1,024 features from each dataset, the AO reduced them dramatically. For ISIC 2019, features dropped from 1,024 to 256 (a 75% reduction). For ISBI 2016, they went from 1,024 to 300 (70.7% reduction). For ISBI 2017, the reduction was from 1,024 to 280 (72.7% reduction). The AO used an iterative optimization process, evaluating feature subsets based on their contribution to classification accuracy and computational efficiency. The model employed k-fold cross-validation to ensure reliability and generalization.
The Aquila Optimizer models the hunting behavior of the Aquila eagle and operates through four distinct search strategies. Each feature vector in the population is represented as a binary row in a matrix, where a value of zero indicates that a feature is selected and one indicates it is not. The algorithm evaluates each feature vector using an objective function that balances the number of selected features against classification performance.
Expanded exploration: The first strategy updates feature vectors by combining the current best-known solution with the population mean, introducing randomness to explore a broad search space. This is governed by a time-decay factor (t/T) that gradually reduces exploration as iterations progress. Focused exploration: The second strategy uses Levy flight dynamics, a rotational and spiral movement pattern toward the prey, which enables more targeted search within promising regions of the feature space.
Expanded exploitation: The third strategy drives solutions directly toward the prey without spiral dynamics, controlled by parameters alpha and delta (ranging from 0 to 0.1) that govern local search intensity. Narrowed exploitation: The fourth strategy employs a quality function (QF) to balance search strategies, with parameter G2 decreasing from 2 to 0 over the course of optimization, representing the flight slope as the algorithm tracks and closes in on optimal solutions.
The algorithm alternates between exploration (when the iteration counter t is less than or equal to two-thirds of the total iterations T) and exploitation (in the remaining iterations). Throughout the process, the best solution is continuously updated. The Levy flight function uses a scaling factor and the Gamma function to generate step sizes, enabling occasional large jumps that help the algorithm escape local optima.
To isolate the contribution of the Aquila Optimizer, the authors conducted an ablation study comparing classification performance with and without the AO applied. The results consistently showed that removing the AO led to a higher-dimensional feature space, increased overfitting risk, and worse performance across all metrics.
ISIC 2019: Without AO, accuracy was 88.40%, sensitivity 89.10%, specificity 86.80%, and AUC 0.91. With AO applied, these improved to 92.50% accuracy (+4.1%), 93.20% sensitivity (+4.1%), 91.80% specificity (+5.0%), and AUC 0.96 (+0.05). ISBI 2016: Accuracy jumped from 85.70% to 90.00% (+4.3%), sensitivity from 87.30% to 91.50% (+4.2%), specificity from 84.10% to 88.60% (+4.5%), and AUC from 0.89 to 0.94 (+0.05).
ISBI 2017: Without AO, accuracy was 87.20%, rising to 91.20% with AO (+4.0%). Sensitivity improved from 88.00% to 92.10% (+4.1%), specificity from 85.50% to 89.40% (+3.9%), and AUC from 0.90 to 0.95 (+0.05). PH2 Dataset: The largest gains appeared here, with accuracy going from 89.10% to 93.00% (+3.9%), sensitivity from 90.50% to 94.40% (+3.9%), specificity from 87.80% to 91.90% (+4.1%), and AUC from 0.92 to 0.97 (+0.05).
Overall, the AO delivered a 3.8% to 5.3% increase in accuracy across all datasets, along with AUC improvements of 0.05 to 0.07. Both sensitivity and specificity benefited significantly, indicating that the AO preserves the most discriminative features while stripping away noise that contributes to misclassification.
The final DL-AO model achieved strong results across all three benchmark datasets. On ISIC 2019, the model reached 97.46% sensitivity, 98.89% specificity, 98.42% accuracy, 97.91% precision, 97.68% F1-score, and 99.12% AUC-ROC. On ISBI 2016, it achieved 98.45% sensitivity, 98.24% specificity, 97.22% accuracy, 97.84% precision, 97.62% F1-score, and 98.97% AUC-ROC. On ISBI 2017, the results were 98.44% sensitivity, 98.86% specificity, 97.96% accuracy, 98.12% precision, 97.88% F1-score, and 99.03% AUC-ROC.
ISIC 2019 comparison: DL-AO was benchmarked against Al-Masni et al. (93.72% sensitivity, 95.65% specificity, 95.08% accuracy), Barata et al. (92.5% sensitivity, 76.3% specificity, 84.3% accuracy), and Xie et al. (83.3% sensitivity, 95% specificity, 94.66% accuracy). DL-AO outperformed all three, achieving 97.45% sensitivity, 97.98% specificity, and 97.87% accuracy.
ISBI 2016 comparison: Against Menegola et al. (47.6% sensitivity, 88.1% specificity, 79.2% accuracy), Vasconcelos et al. (74.6% sensitivity, 84.5% specificity, 82.5% accuracy), and Oliveira et al. (91.8% sensitivity, 96.7% specificity, 95.4% accuracy), DL-AO delivered 94.34% sensitivity, 97.99% specificity, and 97.24% accuracy. ISBI 2017 comparison: Evaluated against Bi et al. (42.7% sensitivity, 96.3% specificity, 85.8% accuracy), Li and Shen (82% sensitivity, 97.8% specificity, 93.2% accuracy), and Guo et al. (97.5% sensitivity, 88.8% specificity, 95.3% accuracy), DL-AO achieved 97.98% sensitivity, 98.02% specificity, and 98.99% accuracy.
A confusion matrix analysis on 100 test samples revealed 42 correctly classified benign cases and 50 correctly classified malignant cases. There were 3 false positives (benign classified as malignant) and 5 false negatives (malignant classified as benign), yielding an overall accuracy of 92%, precision of 94.3% on malignant cases, and recall of 90.9%.
Dataset bias: The ISIC 2019, ISBI 2016, ISBI 2017, and PH2 datasets do not cover the full spectrum of melanoma types or other skin lesions. They focus on specific lesion categories, which introduces selection bias if the training data do not adequately represent the diversity encountered in real-world clinical practice. The datasets are also relatively small, with PH2 containing only 100 images and ISBI 2016 containing 400. Future studies would benefit from incorporating additional, more diverse datasets.
Overfitting risk: Despite the dimensionality reduction achieved through the Aquila Optimizer, the model may still be vulnerable to overfitting, particularly on the smaller datasets. While data augmentation and careful feature selection help mitigate this, the authors acknowledge that external validation on independent datasets is necessary to confirm generalization ability. The confusion matrix analysis, which used only 100 test samples, is too small to draw strong conclusions about real-world deployment reliability.
Computational demands: Although the AO reduces feature dimensionality by 70-75%, training large CNN models combined with metaheuristic optimization still requires significant computational resources. The paper proposes suitability for mobile and edge computing, but does not present actual deployment benchmarks on such platforms. The gap between theoretical computational savings and practical deployment remains unaddressed.
Algorithmic constraints: The performance of the AO algorithm is sensitive to its initial configuration and optimization parameters. The method may not yield optimal results in every scenario, and the authors suggest that hybrid strategies or further parameter refinement could improve robustness. Additionally, the paper does not explore how the AO compares to other modern feature selection methods beyond the traditional metaheuristics surveyed in the literature review.
The authors identify several avenues for future research. First, they suggest incorporating multimodal data sources beyond dermoscopic images to build more robust and comprehensive melanoma identification systems. Combining dermoscopy with clinical metadata, patient history, or other imaging modalities could improve diagnostic accuracy in complex cases where visual features alone are insufficient.
Advanced optimization strategies: The paper proposes exploring more sophisticated optimization algorithms or hybrid approaches that combine the strengths of multiple metaheuristic methods. This could address the AO's sensitivity to initial configuration and potentially yield better feature selection performance across diverse datasets and clinical settings. Hybrid strategies might combine the AO with other optimizers like PSO or GA to leverage their complementary search behaviors.
Reducing false negatives: The confusion matrix revealed 5 false negatives out of 100 test samples, where malignant cases were misclassified as benign. In clinical practice, this is the most dangerous type of error because it can lead to delayed treatment. The authors note that further model tuning is needed to minimize false negatives and improve early melanoma detection rates. This is particularly critical given that the recall on malignant cases was 90.9%, meaning roughly 1 in 11 melanoma cases was missed.
Finally, the framework needs validation in real-world clinical settings with larger, more diverse patient populations. Moving from benchmark datasets to prospective clinical trials would be essential to demonstrate that the reported accuracy metrics translate to actual diagnostic benefit. Deployment studies on mobile and edge computing platforms would also validate the practical computational efficiency claims made in the paper.