Multimodal Feature Fusion with Radiomics for Breast Cancer

Plain-English Explanations

Background & Motivation

Pages 1-3

Why Breast Cancer Detection Still Needs Better Feature Engineering

Breast cancer is the most frequently diagnosed malignant neoplasm in women and the second leading cause of cancer-related deaths in females worldwide. Detecting the disease at an early stage is vital for better therapeutic outcomes and improved survival rates. Imaging techniques such as mammography, MRI, and ultrasonography are fundamental tools for identifying irregularities in breast tissue. Deep learning, particularly convolutional neural networks (CNNs), has significantly advanced the analysis of these images by enabling automated identification of subtle patterns. However, significant challenges persist: limited availability of large annotated datasets, heterogeneity in medical image features, and restricted model generalizability across different imaging protocols and patient populations.

The feature extraction challenge: Radiomics analysis is a powerful feature engineering approach that extracts quantitative characteristics from medical images, including shape, texture, size, and intensity of tumors. These handcrafted features provide interpretable, clinically relevant information. On the other hand, deep learning models automatically derive hierarchical and abstract representations from imaging data. Neither approach alone is sufficient. Radiomics features offer reproducible, standardized metrics that reduce subjectivity, while deep features capture complex, nonlinear patterns that handcrafted descriptors may miss. The authors argue that combining these two complementary approaches can create a more comprehensive feature representation for breast cancer classification.

Transfer learning to the rescue: Pre-trained deep neural networks such as VGG, ResNet, and Inception have been trained on vast generic datasets like ImageNet. Transfer learning allows these models to bring their complex feature extractors to the medical imaging domain, which is critical because labeled medical imaging data is typically scarce. By fine-tuning pre-trained models on breast cancer images, researchers can leverage the pattern-recognition capabilities that were learned from millions of natural images while adapting the models to the specific characteristics of mammographic imaging.

Gaps in existing work: Prior studies have made notable progress, with methods achieving up to 89.5% accuracy (Khourdifi et al.) and 97% accuracy (Chugh et al.) on various datasets, but they have generally treated radiomics and deep learning features in isolation. Very few studies have systematically combined both feature types into a unified multimodal feature space for classification. The authors also note that most existing work lacks rigorous comparison of feature selection methods for radiomics data, despite the high-dimensional nature of radiomics features being a known source of overfitting.

TL;DR: Breast cancer detection using AI faces challenges from limited datasets, overfitting, and restricted generalization. This study proposes combining radiomics features (handcrafted quantitative descriptors) with deep learning features (learned representations from pre-trained CNNs) into a unified multimodal feature space, then systematically evaluating 13 transfer learning models to optimize diagnostic accuracy.

Dataset & Preprocessing

Pages 3-5

The CBIS-DDSM Dataset: 2,620 Mammograms Augmented to 16,996 Images

Source data: The study uses the well-known CBIS-DDSM (Curated Breast Imaging Subset of DDSM) dataset, which contains 2,620 mammography studies encompassing normal, benign, and malignant cases. The dataset includes updated ROI (region of interest) segmentation and bounding boxes, along with pathological assessments for training data. The original dataset was imbalanced, containing 1,930 benign images and only 1,354 malignant images. This class imbalance is problematic for cancer detection models because they may develop a bias toward predicting the majority class (benign), resulting in more missed malignant cases.

Image preprocessing pipeline: Before any analysis, each image was rescaled to a standardized resolution of 224 x 224 pixels to ensure uniformity and compatibility with pre-trained models (InceptionV3 used 299 x 299). Pixel intensity values were normalized by subtracting the mean and dividing by the standard deviation, reducing variations caused by lighting or contrast differences. A Gaussian filter was applied for noise reduction, minimizing background artifacts that could interfere with accurate feature extraction. Finally, anatomical alignment oriented all images to a common reference point, ensuring comparable regions were captured across the dataset.

Data augmentation strategy: To address the class imbalance, a comprehensive augmentation pipeline was applied. Transformations included random horizontal and vertical flips, random adjustments to brightness (factor of 0.3), contrast (0.8 to 1.2), saturation (0.8 to 1.2), and hue (0.02). Additional augmentations included Gaussian blur with random standard deviation (0.1 to 2.0) to simulate real-world imaging variations, random cropping followed by resizing to simulate zoom-level changes, and Gaussian noise injection (mean 0.0, standard deviation 0.05). After augmentation, the dataset grew to 8,498 images per label, totaling 16,996 images. Augmentation was applied consistently across all three image types collected from the dataset: full images, cropped images, and segmented ROI images.

Data splitting: The augmented dataset was split into three subsets with a distribution ratio of 80% for training, 10% for validation, and 10% for testing. Due to mini-batches with batch sizes of 32 and 64, the final allocation was 13,600 training images (425 batches of 32), 1,700 validation images (53 batches of 32), and 1,700 testing images (53 batches of 32). The dataset was shuffled prior to splitting with a fixed random seed to ensure fairness and reproducibility.

TL;DR: The CBIS-DDSM dataset started with 2,620 mammograms (1,930 benign, 1,354 malignant). After preprocessing (rescaling to 224x224, normalization, Gaussian filtering, and anatomical alignment) and comprehensive augmentation (flips, brightness/contrast adjustments, blur, cropping, noise), the dataset grew to 16,996 balanced images split 80/10/10 for training, validation, and testing.

Feature Engineering: Radiomics

Pages 5-8

Extracting 1,040 Radiomic Features from Mammographic ROIs

What radiomics captures: Radiomics is the science of extracting large numbers of quantitative features from medical images. In this study, the PyRadiomics library was used to extract seven categories of radiomic features from segmented regions of interest (ROIs) in the mammograms. A total of 1,040 features were extracted per image. These included 198 first-order statistical features (capturing voxel intensity distributions such as percentiles, energy, and entropy), 264 gray-level co-occurrence matrix (GLCM) features, 176 gray-level run length matrix (GLRLM) features, 176 gray-level size zone matrix (GLSZM) features, 154 gray-level dependence matrix (GLDM) features, 55 neighborhood gray-tone difference matrix (NGTDM) features, 9 shape-based features characterizing lesion geometry, and 5 diagnostic metadata features.

Why feature selection matters: With 1,040 features, the risk of overfitting is significant, especially given the relatively limited sample sizes typical of medical imaging studies. High-dimensional radiomics data can cause models to perform well on training data but fail on new data. Feature selection mitigates this "curse of dimensionality" by identifying the most discriminative subset of features while discarding redundant or noisy ones. The authors systematically compared a broad range of supervised feature selection techniques with varying subset sizes to find the optimal balance between predictive performance and model robustness.

Feature selection methods compared: The study evaluated multiple approaches. Recursive Feature Elimination (RFE) was applied using both random forest and logistic regression classifiers, selecting 10, 20, 50, and 100 features. RFECV (with cross-validation) automatically determined optimal feature counts: 74 for random forest and 647 for logistic regression. Univariate selection via ANOVA F-statistic (SelectKBest) was tested with 10, 20, 50, and 100 features. LASSO (LassoCV) yielded 90 and 157 selected features on fivefold cross-validation. Mutual information ranking retained the top 50, 100, and 200 features. Embedded methods using GPU-accelerated tree-based models (XGBoost, LightGBM, and CatBoost) were also explored with subset sizes of 50, 100, and 200.

Winning method: RFE with random forest using 100 features achieved the highest accuracy (83.7%), F1 score (83%), precision (86.2%), and recall (80.0%), with a moderate stability score of 0.485. Interestingly, SelectKBest with ANOVA (n=20) offered the highest stability (0.897) despite having a slightly lower F1 score (81.6%), suggesting a trade-off between predictive performance and reproducibility. The Pearson correlation between stability and F1 score was near zero (r = 0.03), indicating these two properties are largely independent and both should be reported in radiomics pipelines.

TL;DR: The study extracted 1,040 radiomic features per image across 7 categories using PyRadiomics. To combat overfitting, 11 feature selection methods were compared. RFE with random forest (100 features) won with 83.7% accuracy and 83% F1 score. SelectKBest (ANOVA, 20 features) offered the best stability (0.897), highlighting a performance-vs-reproducibility trade-off in radiomics.

Feature Engineering: Deep Learning

Pages 8-10

Extracting Deep Features via GlobalAveragePooling2D and Building the Multimodal Feature Space

Deep feature extraction: Beyond radiomics, the authors used GlobalAveragePooling2D to extract deep learning features from pre-trained transfer learning models. This technique condenses the spatial dimensions of a model's final convolutional feature maps by computing the average of all spatial positions for each feature map channel. For a feature map with dimensions J (height) x E (width) x X (number of channels), GlobalAveragePooling2D produces a single scalar value per channel, creating a compact 1D feature vector. This approach reduces dimensionality significantly while retaining global spatial information and reducing the risk of overfitting compared to fully connected flattening approaches.

Multimodal feature fusion: After independently extracting both radiomics and deep features, the study applied several data harmonization steps before combining them. Missing values in both feature sets were addressed using a KNN imputer with k=5, which leverages neighboring samples to estimate missing entries. Both feature sets then underwent z-score standardization (StandardScaler) to ensure comparable scales. This normalization step is critical because radiomics features (often measured in pixel intensity units) and deep features (unitless activation patterns) can have substantially different value distributions. After standardization, the two feature matrices were concatenated horizontally to form a unified multimodal feature vector for each sample.

Why fusion helps: Radiomics features provide interpretable, clinically meaningful descriptors of tumor shape, texture, and intensity. Deep features capture complex, nonlinear, and hierarchical patterns that handcrafted radiomics cannot represent. By fusing both into a single feature space, each sample is described by complementary information: the structured, reproducible quantitative measurements from radiomics and the abstract learned representations from deep neural networks. This combination creates a richer, more comprehensive input for downstream classification models.

The ResNet152 architecture for fusion: The highest-performing model, ResNet152, was specifically enhanced with custom layers to integrate the fused features. The base ResNet152 architecture starts with a 7 x 7 convolutional layer with 64 filters and stride 2, followed by max-pooling and residual blocks with progressively increasing filter sizes (128, 256, 512, 1024). After global average pooling compresses the deep feature maps into a 1D vector, custom layers concatenate these deep features with the selected radiomics features. Several fully connected layers with ReLU activation, L2 regularization (lambda = 0.01), and 30% dropout then process this combined representation before a final Softmax layer for binary classification.

TL;DR: Deep features were extracted using GlobalAveragePooling2D from pre-trained models, then combined with selected radiomics features into a unified multimodal vector. Missing values were imputed with KNN (k=5), and both feature sets were z-score standardized before horizontal concatenation. ResNet152 used custom layers to integrate the fused radiomics + deep features, with L2 regularization and 30% dropout to prevent overfitting.

Transfer Learning Models

Pages 10-12

13 Pre-Trained Architectures: Configuration, Fine-Tuning, and Training Strategy

Model lineup: The study evaluated 13 transfer learning models, all initialized with ImageNet pre-trained weights: VGG16, VGG19, ResNet50, ResNet101, ResNet152, ResNet50V2, ResNet101V2, ResNet152V2, DenseNet121, DenseNet169, DenseNet201, MobileNet, and InceptionV3. Input dimensions were set to 224 x 224 x 3 for all models except InceptionV3, which used 299 x 299 x 3 to accommodate its multi-scale inception modules. Each architecture brings distinct strengths: VGG models use simple stacked 3x3 convolutions, ResNet models employ skip connections through residual blocks, DenseNet models use dense connections where every layer receives input from all preceding layers, and MobileNet uses lightweight depthwise separable convolutions.

Fine-tuning strategy: For all 13 models, the top 30% of layers were unfrozen for fine-tuning while the remaining 70% were kept frozen. This approach preserves the general-purpose feature extraction capabilities learned from ImageNet in the lower layers while allowing the upper layers to adapt to the specific characteristics of mammographic images. Freezing the majority of layers also serves as a regularization mechanism that prevents catastrophic forgetting (where the model loses useful pre-trained knowledge during fine-tuning) and reduces the risk of overfitting on the limited medical imaging data.

Training configuration: All models were trained using the Adam optimizer with a learning rate of 1 x 10^-5, chosen specifically for the fine-tuning context because it enables precise weight updates without destabilizing the pre-trained features. The batch size was 32 for all models. Categorical cross-entropy served as the loss function. Regularization included L2 weight decay (lambda = 0.01) and 30% dropout applied to dense layers. Early stopping with a patience of 5 epochs halted training when validation performance stopped improving, preventing overfitting and conserving computational resources. The best model checkpoint for each architecture was saved in .h5 format.

Evaluation framework: An internal validation procedure used the 80/10/10 split for training, validation, and testing, maintaining class distributions across all subsets. Model performance was assessed using accuracy, precision, recall, specificity, F1 score, and area under the curve (AUC). Additionally, training and validation accuracy/loss curves, confusion matrices, and classification reports were generated for each model to provide a complete picture of learning behavior and classification effectiveness.

TL;DR: Thirteen pre-trained models (ResNet, VGG, DenseNet, MobileNet, InceptionV3) were fine-tuned by unfreezing the top 30% of layers while freezing the rest. All used Adam optimizer (lr = 1e-5), batch size 32, L2 regularization (lambda = 0.01), 30% dropout, and early stopping (patience = 5). Each model integrated the multimodal radiomics + deep feature space for binary breast cancer classification.

Results: Model Performance

Pages 13-17

ResNet152 Achieves 97% Accuracy with 98% Recall and 99.3% AUC

Top performer: Among all 13 transfer learning models, ResNet152 achieved the best results across every metric: 97% accuracy, 97% precision, 98% recall, 96% specificity, 97% F1 score, and 99.30% AUC. The 98% recall is particularly important for clinical applications because it means the model identifies nearly all malignant cases, minimizing the risk of missed cancers (false negatives). ResNet152 also required fewer training epochs (40) than most other models (which trained for 50 epochs), indicating more efficient convergence. Its confusion matrix showed 431 true negatives and 387 true positives, with only 19 false positives and 15 false negatives out of 852 test samples.

Runner-up models: VGG19 followed closely with 96% across accuracy, precision, recall, specificity, and F1 score, along with a 99.0% AUC. ResNet101 and ResNet101V2 also achieved 96% accuracy and 99.1-99.2% AUC. DenseNet169 and DenseNet201 showed solid performance at 94-95% accuracy with AUC values between 98.6% and 98.9%. The ResNet family consistently outperformed other architectures, underscoring the value of residual connections for capturing complex hierarchical patterns in mammographic images.

Weaker performers: InceptionV3 achieved 89% accuracy with 97.5% AUC, and MobileNet was the lowest at 88% accuracy with 97.0% AUC. These lightweight architectures, while designed for computational efficiency, lacked the depth and capacity needed to capture the subtle pathological features in mammographic images. The Friedman test confirmed significant stratification in model performance (chi-squared = 8907.405, p < 0.001), and the Nemenyi post-hoc test showed that top models (VGG16, ResNet152, ResNet101) did not differ significantly from each other but significantly outperformed MobileNet and InceptionV3.

Bootstrap confidence intervals: Performance was validated using 95% bootstrap confidence intervals. VGG16 and ResNet152 both achieved 96% accuracy (95% CI: 94-97%) and 96% F1 score (95% CI: 94-97%) in the bootstrap analysis. A TOST equivalence test confirmed that ResNet152 and VGG16 were equivalent within a +/-2% accuracy margin (p < 0.001). However, McNemar's test revealed significant differences in their prediction patterns (chi-squared = 12.552, p < 0.001), suggesting divergent error profiles that could support ensemble strategies to further improve robustness.

TL;DR: ResNet152 led with 97% accuracy, 98% recall, and 99.3% AUC, needing only 40 epochs. VGG19 and ResNet101/101V2 followed at 96% accuracy. MobileNet (88%) and InceptionV3 (89%) underperformed significantly. Bootstrap analysis confirmed top models were statistically equivalent, but McNemar's test revealed different error profiles, suggesting ensemble approaches could further boost performance.

Results: Comparative Analysis

Pages 17-23

Outperforming Prior Studies: ResNet152 at 97% vs. 60-96.4% in Published Literature

Benchmarking against the literature: The study compared its results with recent work published between 2022 and 2025. The improvements are substantial. Yu et al. achieved only 82% accuracy with ResNet50 and 71% with VGG16 on their dataset. Wei et al. reported 83% with VGG19, 72% with ResNet50, and 72% with InceptionV3. Gao et al. reached 82% with ResNet, and Yang et al. achieved 74% with 3DResNet. In contrast, the proposed method reached 97% with ResNet152, 96% with ResNet101 and VGG19, and 94% with ResNet50, representing improvements of 12 to 26 percentage points over equivalent architectures in earlier studies.

Closest competitors: The most competitive prior results came from Alexandru et al., who achieved 99.6% accuracy with DenseNet121 (though on a different dataset and evaluation protocol), Sharmin et al. with 95% using ResNet50V2, and Wang et al. with 96.4% using a custom 17-layer CNN. The proposed method's ResNet152 at 97% surpasses Sharmin et al. and closely matches Wang et al., while the DenseNet models in this study (93-95%) fall short of Alexandru et al.'s result, which was obtained using attention mechanisms on a different breast cancer dataset (BreakHis) rather than mammographic data from CBIS-DDSM.

What drives the improvement: The performance gains are attributed to the multimodal feature fusion approach. By combining selected radiomics features with deep features and feeding this enriched representation into fine-tuned transfer learning models, the method captures both the interpretable quantitative descriptors of tumor characteristics and the complex nonlinear patterns learned by deep networks. The comprehensive data augmentation pipeline, which balanced the dataset from 3,284 to 16,996 images, also played a key role by providing sufficient diverse training examples for the models to generalize effectively.

ROC and precision-recall analysis: The ROC curves showed that ResNet152, VGG19, and DenseNet169 exhibited the most significant initial increases, sustaining high true-positive rates even as false-positive rates rose, indicating exceptional discriminative capabilities. In the precision-recall curves, ResNet152 and VGG19 maintained high precision across a broad range of recall values, confirming strong effectiveness at various classification thresholds. MobileNet and InceptionV3 showed more gradual ROC curves and pronounced drops in precision as recall increased, further confirming their limitations for this task.

TL;DR: The multimodal approach outperformed prior studies using the same architectures by 12-26 percentage points (e.g., ResNet50: 94% vs. 72-82% in earlier work). ResNet152 at 97% surpassed nearly all published results. ROC and precision-recall curves confirmed that ResNet152 and VGG19 maintained the strongest discriminative performance across all classification thresholds.

Discussion, Limitations & Future Work

Pages 24-28

Clinical Translation Challenges and the Path to Vision Transformers

Why ResNet dominates: The consistent superiority of the ResNet family across all experiments underscores the benefit of residual connections in modeling complex hierarchical patterns in medical images. Skip connections allow gradients to flow more easily through very deep networks, preventing the vanishing gradient problem that plagued earlier architectures. For mammographic analysis, where subtle textural and morphological differences between benign and malignant lesions must be captured across multiple spatial scales, the depth enabled by residual learning proves especially valuable. VGG19's competitive 96% accuracy also demonstrates that well-optimized, established architectures remain viable, particularly in resource-constrained clinical settings where simpler inference requirements offer practical deployment advantages.

Radiomics feature selection insights: The analysis revealed important practical findings about feature selection in radiomics. RFE with random forest yielded the best F1 score (83%) by leveraging the ensemble method's ability to capture nonlinear correlations. However, SelectKBest demonstrated the highest stability (89.7%), which is critical for reproducibility in clinical applications. The near-zero Pearson correlation (r = 0.03) between stability and F1 score means these two qualities are essentially independent. This finding suggests that radiomics researchers should report both metrics rather than optimizing for one alone, and that the choice of feature selection method should depend on whether the application prioritizes predictive power or reproducibility.

Known limitations: The CBIS-DDSM dataset, while widely used, is based on a specific demographic and imaging protocol, which may limit generalizability to other populations and clinical settings. Its retrospective nature and curation process may introduce selection bias. Evaluation on a single dataset restricts assessment of cross-domain robustness. Standard accuracy metrics may not fully capture clinical utility, as false negatives (missing a malignant tumor) carry far greater consequences than false positives. Notably, per-class analysis for the malignant class revealed that both VGG16 and ResNet152 had high recall (95-96%) but low precision (29.3%), indicating a high false-positive rate that would need to be addressed before clinical deployment.

Future directions: The authors propose several paths forward. Multi-institutional validation using federated learning frameworks would enable training and evaluation across diverse populations, imaging devices, and acquisition protocols without sharing sensitive patient data. Active learning strategies could optimize expert annotation efforts by targeting high-uncertainty or outlier cases. Vision Transformers (ViTs) represent a promising architectural evolution, as they employ self-attention mechanisms to identify long-range dependencies in images, treating image patches as sequences and learning spatial hierarchies more effectively than traditional CNNs. Integrating multimodal data beyond imaging, such as clinical data, genomic information, and patient history, could further improve predictive performance and model adaptability.

TL;DR: ResNet's residual connections proved ideal for mammographic pattern recognition at 97% accuracy. Key limitations include a single-dataset evaluation (CBIS-DDSM), potential demographic bias, and a 29.3% precision for the malignant class despite high recall. Future work targets federated multi-institutional validation, Vision Transformers for better spatial hierarchy learning, and integration of clinical/genomic data alongside imaging features.

Enhanced Breast Cancer Diagnosis Using Multimodal Feature Fusion with Radiomics and Transfer Learning

Original Paper (PDF)