Breast cancer is the most frequently diagnosed malignant neoplasm in women and the second leading cause of cancer-related deaths in females worldwide. Detecting the disease at an early stage is vital for better therapeutic outcomes and improved survival rates. Imaging techniques such as mammography, MRI, and ultrasonography are fundamental tools for identifying irregularities in breast tissue. Deep learning, particularly convolutional neural networks (CNNs), has significantly advanced the analysis of these images by enabling automated identification of subtle patterns. However, significant challenges persist: limited availability of large annotated datasets, heterogeneity in medical image features, and restricted model generalizability across different imaging protocols and patient populations.
The feature extraction challenge: Radiomics analysis is a powerful feature engineering approach that extracts quantitative characteristics from medical images, including shape, texture, size, and intensity of tumors. These handcrafted features provide interpretable, clinically relevant information. On the other hand, deep learning models automatically derive hierarchical and abstract representations from imaging data. Neither approach alone is sufficient. Radiomics features offer reproducible, standardized metrics that reduce subjectivity, while deep features capture complex, nonlinear patterns that handcrafted descriptors may miss. The authors argue that combining these two complementary approaches can create a more comprehensive feature representation for breast cancer classification.
Transfer learning to the rescue: Pre-trained deep neural networks such as VGG, ResNet, and Inception have been trained on vast generic datasets like ImageNet. Transfer learning allows these models to bring their complex feature extractors to the medical imaging domain, which is critical because labeled medical imaging data is typically scarce. By fine-tuning pre-trained models on breast cancer images, researchers can leverage the pattern-recognition capabilities that were learned from millions of natural images while adapting the models to the specific characteristics of mammographic imaging.
Gaps in existing work: Prior studies have made notable progress, with methods achieving up to 89.5% accuracy (Khourdifi et al.) and 97% accuracy (Chugh et al.) on various datasets, but they have generally treated radiomics and deep learning features in isolation. Very few studies have systematically combined both feature types into a unified multimodal feature space for classification. The authors also note that most existing work lacks rigorous comparison of feature selection methods for radiomics data, despite the high-dimensional nature of radiomics features being a known source of overfitting.
Source data: The study uses the well-known CBIS-DDSM (Curated Breast Imaging Subset of DDSM) dataset, which contains 2,620 mammography studies encompassing normal, benign, and malignant cases. The dataset includes updated ROI (region of interest) segmentation and bounding boxes, along with pathological assessments for training data. The original dataset was imbalanced, containing 1,930 benign images and only 1,354 malignant images. This class imbalance is problematic for cancer detection models because they may develop a bias toward predicting the majority class (benign), resulting in more missed malignant cases.
Image preprocessing pipeline: Before any analysis, each image was rescaled to a standardized resolution of 224 x 224 pixels to ensure uniformity and compatibility with pre-trained models (InceptionV3 used 299 x 299). Pixel intensity values were normalized by subtracting the mean and dividing by the standard deviation, reducing variations caused by lighting or contrast differences. A Gaussian filter was applied for noise reduction, minimizing background artifacts that could interfere with accurate feature extraction. Finally, anatomical alignment oriented all images to a common reference point, ensuring comparable regions were captured across the dataset.
Data augmentation strategy: To address the class imbalance, a comprehensive augmentation pipeline was applied. Transformations included random horizontal and vertical flips, random adjustments to brightness (factor of 0.3), contrast (0.8 to 1.2), saturation (0.8 to 1.2), and hue (0.02). Additional augmentations included Gaussian blur with random standard deviation (0.1 to 2.0) to simulate real-world imaging variations, random cropping followed by resizing to simulate zoom-level changes, and Gaussian noise injection (mean 0.0, standard deviation 0.05). After augmentation, the dataset grew to 8,498 images per label, totaling 16,996 images. Augmentation was applied consistently across all three image types collected from the dataset: full images, cropped images, and segmented ROI images.
Data splitting: The augmented dataset was split into three subsets with a distribution ratio of 80% for training, 10% for validation, and 10% for testing. Due to mini-batches with batch sizes of 32 and 64, the final allocation was 13,600 training images (425 batches of 32), 1,700 validation images (53 batches of 32), and 1,700 testing images (53 batches of 32). The dataset was shuffled prior to splitting with a fixed random seed to ensure fairness and reproducibility.
What radiomics captures: Radiomics is the science of extracting large numbers of quantitative features from medical images. In this study, the PyRadiomics library was used to extract seven categories of radiomic features from segmented regions of interest (ROIs) in the mammograms. A total of 1,040 features were extracted per image. These included 198 first-order statistical features (capturing voxel intensity distributions such as percentiles, energy, and entropy), 264 gray-level co-occurrence matrix (GLCM) features, 176 gray-level run length matrix (GLRLM) features, 176 gray-level size zone matrix (GLSZM) features, 154 gray-level dependence matrix (GLDM) features, 55 neighborhood gray-tone difference matrix (NGTDM) features, 9 shape-based features characterizing lesion geometry, and 5 diagnostic metadata features.
Why feature selection matters: With 1,040 features, the risk of overfitting is significant, especially given the relatively limited sample sizes typical of medical imaging studies. High-dimensional radiomics data can cause models to perform well on training data but fail on new data. Feature selection mitigates this "curse of dimensionality" by identifying the most discriminative subset of features while discarding redundant or noisy ones. The authors systematically compared a broad range of supervised feature selection techniques with varying subset sizes to find the optimal balance between predictive performance and model robustness.
Feature selection methods compared: The study evaluated multiple approaches. Recursive Feature Elimination (RFE) was applied using both random forest and logistic regression classifiers, selecting 10, 20, 50, and 100 features. RFECV (with cross-validation) automatically determined optimal feature counts: 74 for random forest and 647 for logistic regression. Univariate selection via ANOVA F-statistic (SelectKBest) was tested with 10, 20, 50, and 100 features. LASSO (LassoCV) yielded 90 and 157 selected features on fivefold cross-validation. Mutual information ranking retained the top 50, 100, and 200 features. Embedded methods using GPU-accelerated tree-based models (XGBoost, LightGBM, and CatBoost) were also explored with subset sizes of 50, 100, and 200.
Winning method: RFE with random forest using 100 features achieved the highest accuracy (83.7%), F1 score (83%), precision (86.2%), and recall (80.0%), with a moderate stability score of 0.485. Interestingly, SelectKBest with ANOVA (n=20) offered the highest stability (0.897) despite having a slightly lower F1 score (81.6%), suggesting a trade-off between predictive performance and reproducibility. The Pearson correlation between stability and F1 score was near zero (r = 0.03), indicating these two properties are largely independent and both should be reported in radiomics pipelines.
Deep feature extraction: Beyond radiomics, the authors used GlobalAveragePooling2D to extract deep learning features from pre-trained transfer learning models. This technique condenses the spatial dimensions of a model's final convolutional feature maps by computing the average of all spatial positions for each feature map channel. For a feature map with dimensions J (height) x E (width) x X (number of channels), GlobalAveragePooling2D produces a single scalar value per channel, creating a compact 1D feature vector. This approach reduces dimensionality significantly while retaining global spatial information and reducing the risk of overfitting compared to fully connected flattening approaches.
Multimodal feature fusion: After independently extracting both radiomics and deep features, the study applied several data harmonization steps before combining them. Missing values in both feature sets were addressed using a KNN imputer with k=5, which leverages neighboring samples to estimate missing entries. Both feature sets then underwent z-score standardization (StandardScaler) to ensure comparable scales. This normalization step is critical because radiomics features (often measured in pixel intensity units) and deep features (unitless activation patterns) can have substantially different value distributions. After standardization, the two feature matrices were concatenated horizontally to form a unified multimodal feature vector for each sample.
Why fusion helps: Radiomics features provide interpretable, clinically meaningful descriptors of tumor shape, texture, and intensity. Deep features capture complex, nonlinear, and hierarchical patterns that handcrafted radiomics cannot represent. By fusing both into a single feature space, each sample is described by complementary information: the structured, reproducible quantitative measurements from radiomics and the abstract learned representations from deep neural networks. This combination creates a richer, more comprehensive input for downstream classification models.
The ResNet152 architecture for fusion: The highest-performing model, ResNet152, was specifically enhanced with custom layers to integrate the fused features. The base ResNet152 architecture starts with a 7 x 7 convolutional layer with 64 filters and stride 2, followed by max-pooling and residual blocks with progressively increasing filter sizes (128, 256, 512, 1024). After global average pooling compresses the deep feature maps into a 1D vector, custom layers concatenate these deep features with the selected radiomics features. Several fully connected layers with ReLU activation, L2 regularization (lambda = 0.01), and 30% dropout then process this combined representation before a final Softmax layer for binary classification.
Model lineup: The study evaluated 13 transfer learning models, all initialized with ImageNet pre-trained weights: VGG16, VGG19, ResNet50, ResNet101, ResNet152, ResNet50V2, ResNet101V2, ResNet152V2, DenseNet121, DenseNet169, DenseNet201, MobileNet, and InceptionV3. Input dimensions were set to 224 x 224 x 3 for all models except InceptionV3, which used 299 x 299 x 3 to accommodate its multi-scale inception modules. Each architecture brings distinct strengths: VGG models use simple stacked 3x3 convolutions, ResNet models employ skip connections through residual blocks, DenseNet models use dense connections where every layer receives input from all preceding layers, and MobileNet uses lightweight depthwise separable convolutions.
Fine-tuning strategy: For all 13 models, the top 30% of layers were unfrozen for fine-tuning while the remaining 70% were kept frozen. This approach preserves the general-purpose feature extraction capabilities learned from ImageNet in the lower layers while allowing the upper layers to adapt to the specific characteristics of mammographic images. Freezing the majority of layers also serves as a regularization mechanism that prevents catastrophic forgetting (where the model loses useful pre-trained knowledge during fine-tuning) and reduces the risk of overfitting on the limited medical imaging data.
Training configuration: All models were trained using the Adam optimizer with a learning rate of 1 x 10^-5, chosen specifically for the fine-tuning context because it enables precise weight updates without destabilizing the pre-trained features. The batch size was 32 for all models. Categorical cross-entropy served as the loss function. Regularization included L2 weight decay (lambda = 0.01) and 30% dropout applied to dense layers. Early stopping with a patience of 5 epochs halted training when validation performance stopped improving, preventing overfitting and conserving computational resources. The best model checkpoint for each architecture was saved in .h5 format.
Evaluation framework: An internal validation procedure used the 80/10/10 split for training, validation, and testing, maintaining class distributions across all subsets. Model performance was assessed using accuracy, precision, recall, specificity, F1 score, and area under the curve (AUC). Additionally, training and validation accuracy/loss curves, confusion matrices, and classification reports were generated for each model to provide a complete picture of learning behavior and classification effectiveness.
Top performer: Among all 13 transfer learning models, ResNet152 achieved the best results across every metric: 97% accuracy, 97% precision, 98% recall, 96% specificity, 97% F1 score, and 99.30% AUC. The 98% recall is particularly important for clinical applications because it means the model identifies nearly all malignant cases, minimizing the risk of missed cancers (false negatives). ResNet152 also required fewer training epochs (40) than most other models (which trained for 50 epochs), indicating more efficient convergence. Its confusion matrix showed 431 true negatives and 387 true positives, with only 19 false positives and 15 false negatives out of 852 test samples.
Runner-up models: VGG19 followed closely with 96% across accuracy, precision, recall, specificity, and F1 score, along with a 99.0% AUC. ResNet101 and ResNet101V2 also achieved 96% accuracy and 99.1-99.2% AUC. DenseNet169 and DenseNet201 showed solid performance at 94-95% accuracy with AUC values between 98.6% and 98.9%. The ResNet family consistently outperformed other architectures, underscoring the value of residual connections for capturing complex hierarchical patterns in mammographic images.
Weaker performers: InceptionV3 achieved 89% accuracy with 97.5% AUC, and MobileNet was the lowest at 88% accuracy with 97.0% AUC. These lightweight architectures, while designed for computational efficiency, lacked the depth and capacity needed to capture the subtle pathological features in mammographic images. The Friedman test confirmed significant stratification in model performance (chi-squared = 8907.405, p < 0.001), and the Nemenyi post-hoc test showed that top models (VGG16, ResNet152, ResNet101) did not differ significantly from each other but significantly outperformed MobileNet and InceptionV3.
Bootstrap confidence intervals: Performance was validated using 95% bootstrap confidence intervals. VGG16 and ResNet152 both achieved 96% accuracy (95% CI: 94-97%) and 96% F1 score (95% CI: 94-97%) in the bootstrap analysis. A TOST equivalence test confirmed that ResNet152 and VGG16 were equivalent within a +/-2% accuracy margin (p < 0.001). However, McNemar's test revealed significant differences in their prediction patterns (chi-squared = 12.552, p < 0.001), suggesting divergent error profiles that could support ensemble strategies to further improve robustness.
Benchmarking against the literature: The study compared its results with recent work published between 2022 and 2025. The improvements are substantial. Yu et al. achieved only 82% accuracy with ResNet50 and 71% with VGG16 on their dataset. Wei et al. reported 83% with VGG19, 72% with ResNet50, and 72% with InceptionV3. Gao et al. reached 82% with ResNet, and Yang et al. achieved 74% with 3DResNet. In contrast, the proposed method reached 97% with ResNet152, 96% with ResNet101 and VGG19, and 94% with ResNet50, representing improvements of 12 to 26 percentage points over equivalent architectures in earlier studies.
Closest competitors: The most competitive prior results came from Alexandru et al., who achieved 99.6% accuracy with DenseNet121 (though on a different dataset and evaluation protocol), Sharmin et al. with 95% using ResNet50V2, and Wang et al. with 96.4% using a custom 17-layer CNN. The proposed method's ResNet152 at 97% surpasses Sharmin et al. and closely matches Wang et al., while the DenseNet models in this study (93-95%) fall short of Alexandru et al.'s result, which was obtained using attention mechanisms on a different breast cancer dataset (BreakHis) rather than mammographic data from CBIS-DDSM.
What drives the improvement: The performance gains are attributed to the multimodal feature fusion approach. By combining selected radiomics features with deep features and feeding this enriched representation into fine-tuned transfer learning models, the method captures both the interpretable quantitative descriptors of tumor characteristics and the complex nonlinear patterns learned by deep networks. The comprehensive data augmentation pipeline, which balanced the dataset from 3,284 to 16,996 images, also played a key role by providing sufficient diverse training examples for the models to generalize effectively.
ROC and precision-recall analysis: The ROC curves showed that ResNet152, VGG19, and DenseNet169 exhibited the most significant initial increases, sustaining high true-positive rates even as false-positive rates rose, indicating exceptional discriminative capabilities. In the precision-recall curves, ResNet152 and VGG19 maintained high precision across a broad range of recall values, confirming strong effectiveness at various classification thresholds. MobileNet and InceptionV3 showed more gradual ROC curves and pronounced drops in precision as recall increased, further confirming their limitations for this task.
Why ResNet dominates: The consistent superiority of the ResNet family across all experiments underscores the benefit of residual connections in modeling complex hierarchical patterns in medical images. Skip connections allow gradients to flow more easily through very deep networks, preventing the vanishing gradient problem that plagued earlier architectures. For mammographic analysis, where subtle textural and morphological differences between benign and malignant lesions must be captured across multiple spatial scales, the depth enabled by residual learning proves especially valuable. VGG19's competitive 96% accuracy also demonstrates that well-optimized, established architectures remain viable, particularly in resource-constrained clinical settings where simpler inference requirements offer practical deployment advantages.
Radiomics feature selection insights: The analysis revealed important practical findings about feature selection in radiomics. RFE with random forest yielded the best F1 score (83%) by leveraging the ensemble method's ability to capture nonlinear correlations. However, SelectKBest demonstrated the highest stability (89.7%), which is critical for reproducibility in clinical applications. The near-zero Pearson correlation (r = 0.03) between stability and F1 score means these two qualities are essentially independent. This finding suggests that radiomics researchers should report both metrics rather than optimizing for one alone, and that the choice of feature selection method should depend on whether the application prioritizes predictive power or reproducibility.
Known limitations: The CBIS-DDSM dataset, while widely used, is based on a specific demographic and imaging protocol, which may limit generalizability to other populations and clinical settings. Its retrospective nature and curation process may introduce selection bias. Evaluation on a single dataset restricts assessment of cross-domain robustness. Standard accuracy metrics may not fully capture clinical utility, as false negatives (missing a malignant tumor) carry far greater consequences than false positives. Notably, per-class analysis for the malignant class revealed that both VGG16 and ResNet152 had high recall (95-96%) but low precision (29.3%), indicating a high false-positive rate that would need to be addressed before clinical deployment.
Future directions: The authors propose several paths forward. Multi-institutional validation using federated learning frameworks would enable training and evaluation across diverse populations, imaging devices, and acquisition protocols without sharing sensitive patient data. Active learning strategies could optimize expert annotation efforts by targeting high-uncertainty or outlier cases. Vision Transformers (ViTs) represent a promising architectural evolution, as they employ self-attention mechanisms to identify long-range dependencies in images, treating image patches as sequences and learning spatial hierarchies more effectively than traditional CNNs. Integrating multimodal data beyond imaging, such as clinical data, genomic information, and patient history, could further improve predictive performance and model adaptability.