Breast Cancer Detection and Prevention Using Machine Learning

Plain-English Explanations

Background & Motivation

Pages 1-3

Why Machine Learning Matters for Breast Cancer Detection

Breast cancer is one of the most common cancers in women globally and the second leading cause of cancer-related death in women after lung cancer. The disease develops from breast cells and is classified into multiple subtypes, including invasive ductal carcinoma (IDC), which accounts for approximately 80% of cases, ductal carcinoma in situ (DCIS), which represents 20 to 53% of cases, and invasive lobular carcinoma (ILC), which accounts for 10 to 15% of cases. Additional aggressive subtypes include triple-negative breast cancer (TNBC), HER2-positive breast cancer, and inflammatory breast cancer (IBC). Risk factors include age, family history, BRCA1 and BRCA2 gene mutations, extended use of hormone replacement therapy (HRT), and reproductive factors such as premature menstruation and delayed menopause.

Limitations of current screening: Conventional diagnostic methods face significant challenges. Mammography, the gold standard for breast cancer screening, struggles with sensitivity in women with dense breast tissue, leading to false negatives and unnecessary follow-up tests. Ultrasound, while useful for dense breasts and distinguishing cysts from solid masses, suffers from limited specificity and operator dependence. MRI provides high sensitivity and detailed soft tissue visualization but comes with high cost, longer exam duration, and the need for specialized expertise. Biopsy procedures, though highly accurate, are invasive and carry a small risk of complications. Clinical breast examination is limited by examiner expertise and may miss non-palpable masses entirely.

The promise of AI and ML: Advancements in artificial intelligence and machine learning have created new opportunities for more accurate and reliable diagnostic models. Convolutional neural networks (CNNs) and other deep learning approaches have demonstrated the ability to improve image quality, reduce noise, remove artifacts, and assist in image segmentation and region-of-interest detection. The CNN Improvements for Breast Cancer Classification (CNNI-BCC) model, for example, achieved 90.50% accuracy on data from 221 actual patients. However, these approaches often require significant computing power for imaging preprocessing, which motivates the search for more efficient alternatives.

Research objective: This paper proposes an efficient deep learning model capable of recognizing breast cancer in computerized mammograms of varying densities while requiring less computational power than existing methods. The study uses craniocaudal and medial-lateral views of mammograms from a dataset of 3,002 merged pictures gathered from 1,501 individuals who underwent digital mammography between February 2007 and May 2015. The researchers applied six different classification models: random forest (RF), decision tree (DT), k-nearest neighbors (KNN), logistic regression (LR), support vector classifier (SVC), and linear SVC.

TL;DR: Breast cancer is the second leading cause of cancer death in women, and current screening methods like mammography face sensitivity limitations, especially with dense breast tissue. This paper proposes an efficient ML-based model that classifies breast cancer from mammograms using six algorithms (RF, DT, KNN, LR, SVC, linear SVC), tested on 3,002 images from 1,501 patients.

Literature Review

Pages 4-6

What Prior Research Has Achieved in ML-Based Breast Cancer Diagnosis

CNN-based approaches: Several studies have applied machine learning to breast cancer detection with varying degrees of success. A CNN algorithm employed for predicting and diagnosing invasive ductal carcinoma achieved an accuracy of approximately 88%. Deep convolutional neural networks have proven effective for early-stage breast cancer detection, improving outcomes for patients undergoing treatment. However, a recurring limitation identified in the literature is the significant computational power required for preprocessing medical images before they can be fed into these networks. The CNNI-BCC model demonstrated 90.50% accuracy on 221 patients using a trained deep learning neural network system to categorize breast cancer subtypes without human intervention.

Classical ML comparisons: Comparative studies have benchmarked several traditional algorithms. The random forest algorithm achieved the highest accuracy at 99.76% with the least error on the Wisconsin breast cancer dataset. A neural network technique using the multilayer perceptron (MLP) and back-propagation achieved 97% classification repeatability using the RBF neural network. K-nearest neighbors (KNN) outperformed both naive Bayes and radial basis function classifiers with over 94% detection accuracy rates, also achieving higher precision and F1-score. The ANN classifier outperformed SVM, naive Bayes, and decision tree classifiers when combined with feature selection techniques, producing a 51% efficiency boost.

Ensemble and boosting methods: Research comparing extreme gradient boost (XGBoost) and random forest on a small dataset of 275 examples found that RF outperformed XGBoost for breast cancer detection accuracy. A broader study examining nine classification models on the Wisconsin Diagnosis Cancer Dataset concluded that KNN was the most effective method for supervised learning, while logistic regression was most effective for semi-supervised learning. Ensemble learning techniques, including stacking, boosting, and bagging, have been shown to improve classification performance by merging individual classifiers into aggregated models.

Key research gap: While many studies achieve high accuracy, most rely on a single algorithm or a limited set of comparisons. The literature reveals that different algorithms perform best under different conditions, and the choice of dataset, feature selection method, and preprocessing pipeline all significantly influence results. No single approach has emerged as definitively superior across all contexts, which motivates this study's systematic comparison of six classifiers with a unified feature selection and preprocessing pipeline on a large mammography dataset.

TL;DR: Prior work shows CNNs reaching 88-90% accuracy on breast cancer images, random forest achieving up to 99.76% on the Wisconsin dataset, and KNN outperforming other classifiers with over 94% accuracy. However, no single algorithm dominates across all contexts, motivating this paper's systematic six-algorithm comparison on a large mammography dataset.

Proposed Methodology

Pages 7-9

The Six-Classifier Pipeline with Three-Module Feature Selection

Overall workflow: The proposed methodology follows a structured pipeline. First, the breast cancer dataset is loaded, and data is separated into features (X) and labels (y). Features are normalized or standardized so each has equal influence on the algorithms. The data is then split into 70% for training and 30% for testing. Six classification models are trained and evaluated: random forest (RF), decision tree (DT), k-nearest neighbors (KNN), logistic regression (LR), support vector classifier (SVC), and linear SVC. The most effective algorithm is selected based on performance metrics, and feature importance analysis is performed to determine which attributes contribute most to the classification decision.

Feature selection modules: The study employs three distinct modules for feature selection, which is critical for reducing dataset dimensionality while retaining the most informative attributes. The first module removes low-variance features, eliminating attributes that carry little discriminative power. The second module, univariate feature selection, evaluates each feature independently to assess its relationship with the target variable. The third module, recursive feature elimination, iteratively removes the least important features based on classifier feedback until an optimal subset remains. This three-pronged approach ensures that the final feature set is both compact and highly informative.

Preprocessing pipeline: Raw breast cancer data is processed using the Standard Scaler module from Python's scikit-learn package, which standardizes features by removing the mean and scaling to unit variance. The preprocessing stage also handles data duplication, where duplicate values are replaced, and data balancing, where imbalanced class distributions are corrected. Image preprocessing includes creating variants through rotation, flipping, and adjusting brightness and contrast. Feature extraction techniques such as Histogram of Oriented Gradients (HOG), Scale-Invariant Feature Transform (SIFT), and deep feature extraction using pre-trained CNNs (VGG, ResNet, Inception) are employed for image-based features.

Mammographic views: The model incorporates both craniocaudal (top-down) and medial-lateral (side) views of mammograms, which is standard clinical protocol. Each breast is X-rayed from both angles, producing complementary information about breast tissue structure. Radiologists examine these images for abnormalities including masses, microcalcifications, architectural distortions, asymmetries, spiculated borders, and nodules. The dual-view approach provides a more comprehensive picture than single-view analysis, as some abnormalities may only be visible from certain angles.

Genetic programming optimization: The study also employs genetic programming (GP), an evolutionary algorithm that generalizes the genetic algorithm, to optimize the machine learning pipelines. GP creates solutions based on biological evolution principles (mutation, crossover, and selection), testing and selecting the best option from a group of outcomes. It constructs pipelines by varying the order of operators and hyperparameters, with the top 20 individuals cloned in each generation and a 5% crossover rate. This approach automatically searches for optimal model architectures while the pipeline structure and important features remain unknown.

TL;DR: The methodology uses a three-module feature selection approach (low-variance removal, univariate selection, recursive elimination) with Standard Scaler preprocessing, a 70/30 train-test split, and six classifiers (RF, DT, KNN, LR, SVC, linear SVC). Both craniocaudal and medial-lateral mammogram views are used, and genetic programming optimizes the overall pipeline.

Dataset & Imaging

Pages 8-10

The 3,002-Image Mammography Dataset and How Breast Cancer Appears on Scans

Dataset composition: The study uses a large dataset of 3,002 merged pictures gathered from 1,501 individuals who underwent digital mammography between February 2007 and May 2015. This dataset includes both craniocaudal (CC) and medial-lateral oblique (MLO) views for each patient, providing two complementary perspectives of each breast. The breast cancer data is sourced from the Kaggle platform and processed using Python Jupyter Notebook version 6.4.12. Exploratory data analysis (EDA) is performed using Python extensions including Pandas for data manipulation, Seaborn for statistical visualizations, Plotly for interactive visualizations, and Bokeh for interactive dashboards.

How breast cancer appears on mammograms: Breast cancer on mammograms varies depending on stage, size, and location. Radiologists look for several key abnormalities: masses (solid lumps of tissue), microcalcifications (tiny calcium deposits in breast tissue that can indicate early cancer), architectural distortions (irregularities in the breast's normal structure), asymmetries (differences in appearance between left and right breasts), spiculated borders (jagged or spiky edges around a mass), and nodules (small rounded masses assessed for malignancy). Results are reported using the Breast Imaging Reporting and Data System (BI-RADS), which categorizes findings into levels. A BI-RADS 0 rating indicates that further evaluation such as additional imaging or biopsy is needed.

Handling outliers: The EDA phase revealed anomalies in the dataset's boxplot distributions. To avoid interfering with the detection of dangerous cancers, outliers were removed only from benign tumors, preserving all malignant tumor data points. After this processing, the maximum value of the "worst area" feature decreased from 1,210.0 to 932.7, and the maximum area mean value dropped from 992.1 to 788.5. The authors acknowledge this strategy may lead to some false positives (benign tumors mistakenly labeled as cancerous), but they consider this preferable to the alternative of missing actual cancers, since early detection saves lives when treatment is most effective.

Data distribution insights: The dataset analysis showed that the vast majority of data is right-skewed, meaning the right-hand tail of the distribution is longer than the left. This indicates that most patients have smaller values for certain features, while a small percentage have larger values. A correlation analysis of the ten best variables found a link of 69% or higher between each of these factors and patient diagnosis. Additionally, malignant tumors consistently showed larger feature values and higher variation compared to benign tumors, making these statistical differences a key signal for the classifiers.

TL;DR: The dataset contains 3,002 mammographic images from 1,501 patients (2007-2015) with both CC and MLO views. Outliers were removed only from benign tumors to preserve cancer detection sensitivity, reducing the max "worst area" from 1,210 to 932.7. Malignant tumors consistently showed larger feature values and higher variation, with 69%+ correlation between top features and diagnosis.

Evaluation Framework

Pages 10-12

K-Fold Cross-Validation and Performance Metrics

Confusion matrix fundamentals: The study evaluates classifier performance using a confusion matrix, which compares predicted outcomes against actual values. Four key quantities are measured: true positives (TP), where malignant tumors are correctly identified; false positives (FP), where benign tumors are incorrectly classified as malignant; false negatives (FN), where malignant tumors are incorrectly classified as benign; and true negatives (TN), where benign tumors are correctly identified. From these, four standard metrics are derived: accuracy (overall correctness), precision (proportion of positive predictions that are truly positive), recall or sensitivity (proportion of actual positives correctly identified), and F1-score (harmonic mean of precision and recall).

K-fold cross-validation: The dataset is partitioned into K equal-sized subgroups. In each iteration, one subgroup serves as the test set while the remaining K-1 subgroups are used for training. The model is trained and assessed K times, and performance indicators from each fold are averaged to produce a comprehensive evaluation. This technique ensures that the assessment is more reliable and less dependent on any single data split, enabling more precise forecasts of the model's performance on new, unseen data. Cross-validation also helps detect overfitting, optimize hyperparameters, and evaluate bias-variance tradeoffs.

Ensemble model considerations: The authors emphasize that the success of an ensemble model depends on the diversity of individual models and their capacity to capture various aspects of the data. Throughout the process, careful model selection, training, and assessment are critical. The ensemble must also be monitored and maintained as new data becomes available or the clinical setting changes. This is particularly important in breast cancer detection, where demographic shifts, new imaging technologies, and evolving clinical protocols can all affect model performance over time.

TL;DR: Performance is measured via confusion matrix metrics (accuracy, precision, recall, F1-score) with K-fold cross-validation to ensure robust evaluation across multiple data splits. The approach guards against overfitting and single-split bias, producing more reliable performance estimates for clinical deployment.

Exploratory Data Analysis

Pages 13-16

What Feature Correlations and Distributions Reveal About the Data

Correlation analysis: The study computed correlations between all variables and diagnosis, revealing that the top ten features had a correlation of 69% or higher with the target outcome. The diagnostic relationship between features was tested by examining calculated correlations sorted by absolute value. Features such as radius_mean, perimeter_mean, area_mean, concavity_mean, and concave_points_mean showed the strongest associations with whether a tumor was malignant or benign. This correlation analysis guided the feature selection process, ensuring that the most predictive attributes were retained while less informative ones were removed.

Distribution characteristics: The data exhibited right-skewed distributions across most features, meaning that the majority of patients had smaller values for key measurements while a small percentage showed significantly larger values. This skewness is clinically meaningful: it suggests that most tumors are relatively small and benign, while a smaller but critical subset exhibits the extreme measurements associated with malignancy. For the breast cancer dataset specifically, this pattern implies that certain characteristics vary substantially with tumor type, making distribution shape itself a useful diagnostic signal.

Malignant vs. benign patterns: Graphical analysis of the breast cancer diagnosis data confirmed that malignant tumors consistently exhibited larger values for the same characteristics compared to benign tumors. After outlier removal, data from malignant tumors showed higher variation than data for benign tumors. Malignant tumor outliers were better represented in clusters when extreme cases had not been removed, suggesting that preserving these outliers is important for maintaining detection sensitivity. The color-coded diagnosis graphs showed that most features were strongly correlated with each other, reinforcing the value of the multi-feature approach used in the classification pipeline.

Age and demographic factors: The study analyzed breast cancer prevalence by age group, dividing patients into young adults, middle-aged adults, and elders, with separate tracking of malignant and benign diagnoses. According to WHO data, breast cancer is the most frequent cancer among women worldwide, with case rates differing significantly across geographic locations. The proportion is often greater in affluent nations due to widespread screening programs and public education efforts, while underdeveloped countries face scarcer resources and less awareness. These demographic insights provide important context for understanding the generalizability of the model's predictions across different populations.

TL;DR: The top ten features correlated at 69%+ with diagnosis, with malignant tumors showing consistently larger values and higher variation than benign tumors. Right-skewed distributions across most features mean most tumors cluster at lower values, while extreme measurements signal malignancy. Demographic analysis shows breast cancer prevalence varies significantly by age group and geography.

Results

Pages 17-19

Random Forest Achieves 96.49% Accuracy, Leading All Six Classifiers

Headline result: Among the six classifiers tested, random forest achieved the highest accuracy at 96.49%, establishing it as the best-performing model for this breast cancer detection task. The confusion matrix for the best model showed 50 true positives (malignant tumors correctly identified), 44 true negatives (benign tumors correctly identified), only 2 false positives (benign tumors misclassified as malignant), and 4 false negatives (malignant tumors misclassified as benign). This high ratio of correct predictions to errors demonstrates the model's strong discriminative ability between malignant and benign tumors.

Classifier ranking: Random forest models demonstrated the highest accuracy values overall, indicating superior performance in predicting the target variable compared to all other models. Decision tree and KNN achieved higher accuracy values than logistic regression and SVC, and were comparable to each other, though neither matched random forest's performance. Logistic regression and SVC produced similar values, indicating comparable but moderate performance in predicting the target variable. The linear SVC classifier rounded out the comparison, with all six models providing a comprehensive view of how different algorithmic approaches handle the same breast cancer classification task.

Why random forest excels: Random forest's strong performance can be attributed to several factors. As an ensemble classifier composed of many decision trees working together, it improves both efficiency and prediction accuracy by aggregating diverse perspectives on the data. Its versatility, ease of interpretation, and ability to determine which traits are most crucial to the categorization decision-making process make it particularly well-suited for medical applications. The bagging method that underlies RF reduces variance and helps prevent overfitting, which is especially important when dealing with clinical data where generalization to new patients is essential.

Practical implications: The selection of an algorithm for breast cancer diagnosis in real-world clinical settings should consider the unique needs of the clinical environment and available computational resources. The authors note that the best method should be chosen based on factors including dataset size, hardware capabilities, and whether real-time or batch processing is required. Computational efficiency of selected algorithms may be further increased through parallelization and optimization methods, ensuring their viability and efficacy for clinical applications. While random forest leads in this study, the relative rankings may shift depending on these practical considerations.

TL;DR: Random forest achieved 96.49% accuracy, outperforming all five other classifiers (DT, KNN, LR, SVC, linear SVC). The confusion matrix showed 50 true positives, 44 true negatives, only 2 false positives, and 4 false negatives. Random forest's ensemble approach, combining many decision trees via bagging, proved most effective for distinguishing malignant from benign tumors.

Conclusions & Future Work

Pages 19-21

Key Takeaways, Limitations, and What Comes Next

Principal findings: The study systematically evaluated six classification models for breast cancer diagnosis using the Breast Cancer Wisconsin (diagnostic) dataset with Standard Scaler preprocessing and scikit-learn-based feature selection. The data was processed using multimodal sets of machine learning algorithms (linear SVC, SVC, KNN, DT, RF, and LR), and performance was assessed through confusion matrices and standard metrics including accuracy, precision, recall, sensitivity, and F1-score. The random forest classifier emerged as the clear winner at 96.49% accuracy, validating its suitability for breast cancer detection tasks that require both high accuracy and interpretability.

Clinical significance: Machine learning has the potential to greatly improve the identification and diagnosis of breast cancer. The study demonstrated that the correlation between key features and diagnosis exceeds 69% for the top variables, and that malignant tumors consistently exhibit larger and more variable measurements than benign tumors. The outlier management strategy of preserving malignant outliers while removing only benign outliers prioritizes cancer detection sensitivity over specificity, an approach that aligns with the clinical imperative to catch cancers early when treatment is most effective. This preference for minimizing false negatives over false positives reflects the medical reality that a missed cancer is far more dangerous than an unnecessary follow-up test.

Broader context: Breast cancer remains a leading cause of female mortality, particularly in developing countries where screening access is limited. Early detection through mammography, combined with AI-assisted analysis, offers a path toward more equitable cancer care. The study notes that advancements in artificial intelligence have made mammography more accurate, and deep learning models continue to improve at recognizing breast cancer in computerized mammograms. Breast MRI has also proven to be a highly sensitive imaging technique, with dynamic contrast-enhanced (DCE) MRI providing both morphological and functional lesion information that complements mammographic findings.

Future directions: The authors call for continued research and cooperation between data scientists, medical experts, and researchers to make substantial progress in breast cancer detection and treatment. Successfully integrating ML technology into clinical practice will require overcoming technological, moral, and legal obstacles. The objective is to develop a more precise, approachable, and patient-centric approach to breast cancer diagnosis and care. Future work may explore larger and more diverse datasets, additional imaging modalities, and optimization techniques such as parallelization to increase computational efficiency, ensuring the viability of these algorithms for real-time clinical applications.

TL;DR: Random forest at 96.49% accuracy was the top performer among six classifiers on the Wisconsin breast cancer dataset. The study prioritizes detection sensitivity by preserving malignant outliers, reflecting the clinical imperative that missing a cancer is more dangerous than a false alarm. Future work should focus on larger datasets, real-time clinical integration, and overcoming regulatory obstacles to ML-based diagnosis.