Artificial Intelligence-Powered Mammography: Navigating the Landscape of Deep Learning for Breast Cancer Detection

PMC (Open Access) 2023 AI 7 Explanations View Original
Original Paper (PDF)

Unable to display PDF. Download it here or view on PMC.

Plain-English Explanations
Pages 1-3
Why Mammography Needs AI and Why Traditional CAD Falls Short

Breast cancer is one of the most commonly diagnosed malignancies in women worldwide, and early detection remains the single most important factor in improving survival rates and long-term health outcomes. Screening mammography, the method of choice for identifying breast cancer in asymptomatic women, has been shown to reduce breast cancer-related deaths by 40% to 62%. However, current diagnostic criteria for evaluating mammograms, such as the Breast Imaging Reporting and Data System (BI-RADS), are constrained by two types of errors: "detection mistakes" (where pathology is missed entirely) and "interpretation errors" (where pathology is misidentified or mischaracterized). These errors can delay diagnosis and treatment, with potentially fatal consequences.

The limitations of traditional CAD: Computer-aided detection (CAD) is not a new concept in breast imaging. The U.S. Food and Drug Administration approved the first CAD system for mammography in 1998, and insurance reimbursement followed in 2002. Despite this widespread adoption, traditional CAD programs that use prompts to flag potential cancers on mammograms have not meaningfully improved diagnostic precision. Retrospective studies that initially suggested clinical benefits were not corroborated in real-world practice. Traditional CAD systems rely on manually crafted features and rule-based logic, which limits their ability to capture the full complexity of mammographic patterns.

The deep learning revolution: The resurgence of interest in automated mammogram interpretation has been driven by deep learning (DL), a subset of machine learning that uses multilayered convolutional neural networks (CNNs) to learn complex associations directly from data. Unlike traditional CAD, DL methods do not require explicit programming by a human. Instead, the computer processes hundreds of thousands of images and learns on its own how to categorize them, structuring layers of mathematical operations much like the human brain. This key difference means DL systems can detect subtle patterns in mammograms that would be invisible to rule-based systems.

Scope of this review: This literature review, published as a scoping review, systematically assessed the literature on AI for breast cancer detection. The authors searched Google Scholar, PubMed, Web of Science, and Scopus for studies published between 2015 and 2022, focusing on mammography as a primary use case. Non-English articles, predatory journal publications, and commercial platforms were excluded. The review covers conventional ML, AI, and DL algorithms across detection, classification, and segmentation tasks.

TL;DR: Screening mammography reduces breast cancer deaths by 40-62%, but human errors and limited traditional CAD systems leave room for improvement. Deep learning, which learns directly from images without explicit programming, has reignited interest in automated mammogram interpretation. This scoping review covers AI studies from 2015-2022 across detection, classification, and segmentation.
Pages 3-4
Breast Cancer Subtypes, BI-RADS, and What Mammography Actually Captures

The BI-RADS framework: In most countries with breast cancer screening programs, BI-RADS (Breast Imaging Reporting and Data System) is the standard communication tool for mammography reports. First introduced in the United States in 1995 by the American College of Radiology, BI-RADS assigns mammograms to one of seven assessment categories. These range from BI-RADS 0 (incomplete, needing additional imaging) through BI-RADS 1 (negative, no abnormalities) and BI-RADS 2 (benign, 0% malignancy probability) up to BI-RADS 5 (highly suggestive of malignancy, greater than 95% probability) and BI-RADS 6 (biopsy-proven malignancy). The intermediate categories carry escalating suspicion: BI-RADS 3 indicates less than 2% malignancy probability, while BI-RADS 4 spans from 2% to 94% and is subdivided into 4A (low, 2-9%), 4B (moderate, 10-49%), and 4C (high, 50-94%).

Major histological subtypes: The two primary histological subtypes of breast cancer are invasive ductal carcinoma (IDC) and invasive lobular carcinoma (ILC). Invasive lobular carcinomas account for approximately 10-15% of all breast cancers and are distinguished by a discohesive, single-file growth pattern of small, round tumor cells. Critically, ILC is harder to detect on both mammography and FDG PET/CT imaging compared to IDC, and it is typically discovered at a more advanced stage in older patients. About 90% of ILCs show E-cadherin (CDH1) protein deficiency, and roughly 90% express estrogen receptors, making them usually luminal A subtype by gene expression profiling. Less than 10% express HER2/ERBB2.

What mammography captures: Mammography uses low-dose X-rays to image internal breast tissues. Two plates compress the breast to reduce ray dispersion and produce clearer images without requiring high radiation doses. The tissue changes associated with cancer typically appear as white zones against a gray contrast background. The most prevalent mammographic signs of cancer are masses and calcifications, and experts traditionally search for areas that differ from surrounding tissue in size, shape, contrast, or border characteristics. The average radiation exposure for a standard two-view mammogram of each breast is approximately 0.4 times the total radiation dosage threshold.

Why subtypes matter for AI: The difference in detectability between IDC and ILC is directly relevant to AI development. Because ILC grows in a diffuse, single-file pattern rather than forming a discrete mass, it produces subtler mammographic findings that are easier for both humans and algorithms to miss. AI systems trained predominantly on IDC-dominant datasets may underperform on ILC cases unless the training data specifically accounts for this subtype's unique imaging characteristics. Patients with ILC also tend to experience relatively late recurrence and lower long-term survival compared to stage-matched IDC patients, making early detection even more critical for this subtype.

TL;DR: BI-RADS categorizes mammograms from 0 (needs more imaging) to 6 (confirmed cancer), with malignancy probability thresholds at each level. The two main breast cancer subtypes, IDC and ILC (10-15% of cases), differ significantly in imaging appearance, with ILC being harder to detect on mammography and typically found at more advanced stages.
Pages 4-5
How Deep Learning Works in Mammography and What Makes It Different From Traditional ML

From conventional ML to deep learning: The performance of deep learning algorithms has significantly improved compared to conventional machine learning and AI methods. DL has seen rapid adoption across image classification, natural language processing, gaming, and especially medical imaging for detecting diseases such as skin cancer, brain tumors, and breast cancer. The core advantage of DL is its ability to learn descriptive feature mappings directly from images using large numbers of training samples, producing highly accurate classification results. For general image classification tasks, networks are typically trained on more than one million photos across more than 1,000 data classes. However, medical imaging presents a unique challenge: annotated training data is scarce, expensive to produce, and subject to intraobserver variance because annotations depend on individual experts' skills and knowledge.

Convolutional neural networks (CNNs): CNNs are the dominant architecture for medical image analysis. Deep convolutional neural networks (dCNNs) became the method of choice for computer visualization tasks following the 2012 ImageNet Large Scale Visual Recognition Challenge. When supplied with raw data, dCNNs can develop multiple features connected to specific outcomes without needing human-crafted feature engineering. In breast imaging, CNNs analyze pixel-level patterns in mammograms to identify suspicious regions, classify tissue types, and segment masses. This approach eliminates the need for handcrafted features that traditional CAD systems depend on, allowing the model to discover patterns that human engineers might never explicitly define.

Strategies for limited medical data: Researchers have developed several tactics to address the chronic shortage of annotated medical images. Transfer learning allows models pre-trained on large general-purpose datasets to be fine-tuned on smaller medical datasets, with only the final layers retrained for the new target task. Data augmentation applies affine transformations such as translation, rotation, and flipping to artificially expand the training set. Other approaches include using 2D patches or 3D cubes as input instead of entire images to reduce model parameters and prevent overfitting, and convolutionalizing weights from fully connected layers for more efficient processing.

Integration into medical imaging workflows: Computer-extracted characteristics can be embedded into ML algorithms that use complex statistical approaches to learn data patterns and improve at specific tasks. DL sits within this broader ML ecosystem as a subcategory that uses many-layered neural networks to evaluate complicated patterns. The key distinction is that DL models can develop their own feature representations from raw data, while traditional ML models require humans to define and extract features before training. This self-directed feature learning is what gives DL its advantage in mammography, where the visual patterns indicative of cancer can be extraordinarily subtle.

TL;DR: Deep learning uses CNNs to learn directly from mammographic images, eliminating the need for human-crafted features that limited traditional CAD. The main challenge is limited annotated medical data, addressed through transfer learning, data augmentation, and patch-based training. DL's self-directed feature learning is what allows it to detect subtle cancer patterns that rule-based systems miss.
Pages 5-6
How AI Identifies and Classifies Masses, Microcalcifications, and Lesions

Mass identification and classification: Recognizing masses on mammograms can be especially challenging in dense breast tissue, where masses may be obscured by overlapping structures. Several studies have proposed advanced approaches, including a fuzzy clustering method based on crow search optimization (CrSA-IFCM-NA) that effectively separates masses from mammogram images. Other researchers have created integrated CAD systems using the You Only Look Once (YOLO) regional deep learning method, full-resolution deep network (FrCN) models, and dCNNs. Using the INbreast dataset, these approaches achieved detection accuracy of 97.9%, demonstrating their potential to assist radiologists in making precise diagnoses of breast masses.

Microcalcification detection: Breast calcifications are small calcium salt deposits that appear as tiny white patches on mammography. While macrocalcifications are generally normal and age-related, microcalcifications (ranging from 0.1 mm to 1 mm in size) may be early signs of breast cancer, with or without visible masses. A CNN model constructed using filtered deep features was shown to outperform handcrafted feature extraction methods for detecting microcalcifications. For distinguishing benign from malignant microcalcifications, an improved Fisher linear discriminant analytical method combined with a support vector machine (SVM) variant achieved 96% average classification accuracy across 288 regions of interest from the Digital Database for Screening Mammography (DDSM), correctly classifying 139 malignant and 149 benign cases.

Additional microcalcification methods: Beyond CNN-based approaches, researchers have developed other innovative techniques. Jian et al. built a dual-tree complex wavelet transform-based CAD system specifically for breast microcalcification identification. Guo et al. created a hybrid method combining a non-linking simplified pulse-coupled neural network with contourlet transform for mammographic microcalcification detection. These diverse approaches illustrate that microcalcification detection is an active area with multiple viable algorithmic strategies. Artificial neural networks can now automatically detect, segment, and categorize both masses and microcalcifications, serving as a resource for radiologists and considerably enhancing their accuracy and productivity.

Mass segmentation: Proper segmentation of breast masses directly impacts treatment success. Researchers have used contour maps to automatically segment breast masses on mammograms, achieving a mean true positive rate of 91.12% and precision of 88.08% using the Mini-Mammographic Image Analysis Society (MIAS) database. For the DDSM dataset, an SVM classifier paired with mesh-free radial basis function collocation achieved 97.12% sensitivity and 92.43% specificity in dividing suspicious areas into normal and abnormal categories. Plane fitting and dynamic programming techniques have further improved the accuracy of segmenting breast lesions, demonstrating the applicability of DL in automated medical image analysis systems.

TL;DR: AI systems achieve 97.9% accuracy for mass detection (using YOLO and dCNNs on the INbreast dataset) and 96% accuracy for classifying microcalcifications as benign or malignant (using SVM on DDSM). Mass segmentation reaches 97.12% sensitivity and 92.43% specificity. Multiple algorithmic approaches exist for each task, from fuzzy clustering to wavelet transforms.
Pages 6-7
Why Deep Learning Outperforms Traditional Machine Learning and Where the Evidence Gaps Remain

The variability problem: Breast cancer presents in many forms, from overt masses with speculated edges to subtle asymmetries or faint microcalcifications, making consistent mammogram interpretation a persistent challenge. One of the most significant issues in screening mammography is interreader performance variability: studies have shown that radiologist sensitivity for detecting breast cancer ranges from 74.5% to 92.3%. This nearly 18-percentage-point spread means that whether cancer is detected can depend heavily on which radiologist reads the mammogram. DL offers a potential solution because its performance is consistent and not subject to the fatigue, distraction, or subjective interpretation that contribute to human variability.

Why DL outperforms conventional ML: For recognition tasks in mammography, deep learning consistently outperforms conventional machine learning methods. The fundamental advantage is that DL learns rich feature representations directly from massive amounts of data, unconstrained by human-designed criteria. Traditional ML approaches require experts to manually define which features (shape, texture, contrast, border characteristics) the algorithm should evaluate, inevitably missing patterns that fall outside these predefined categories. DL-based systems can reliably detect different radiological manifestations of cancer because they discover their own feature hierarchies through training, capturing associations that human feature engineers would never explicitly code.

Systematic review findings: A recent systematic review of AI applications in breast cancer diagnosis found that various models, including CNNs and systems trained on the DDSM dataset, were employed to achieve timely and precise results. CNN was the most frequently utilized algorithm, achieving a notable accuracy rate of 98%. Specificity was the next highest performance metric at 99%. Most of the reviewed studies originated in the United States, China, and Japan. However, a 2021 review on the efficacy of AI systems for mammography screening concluded that the quality and quantity of current evidence are far from sufficient for integrating AI systems into routine clinical practice.

The validation gap: While AI is anticipated to aid mammography-based breast cancer screening by enhancing cancer diagnosis and reducing false-positive recalls, viability must still be demonstrated in prospective clinical studies. The existing literature shows promising results but often lacks detailed data descriptions, clarity on accuracy across different finding types, and performance evaluations across different demographics and imaging machines. Prior studies have also revealed discrepancies in how countries approach gathering evidence, synthesizing findings, and formulating policy around AI-assisted screening, further complicating the path to clinical adoption.

TL;DR: Radiologist sensitivity for breast cancer detection varies from 74.5% to 92.3%, and DL can reduce this inconsistency. CNNs achieve up to 98% accuracy and 99% specificity in systematic reviews, but a 2021 review found the evidence still insufficient for routine clinical integration. Key gaps include lack of prospective trials, limited demographic diversity, and inconsistent reporting standards.
Pages 7-8
Data Scarcity, Privacy Barriers, and the Challenges of Building DL-Based CAD Systems

The data problem: Developing new deep learning-based CAD systems requires large, high-quality databases, and building these databases is expensive. Unsupervised learning, where computers must identify intrinsic patterns from unlabeled images, demands high-quality raw data to optimize output. Full-resolution mammographic images generate enormous data files and storage requirements. While techniques such as transfer learning and data augmentation have helped reduce training data requirements, the validation and testing datasets used across studies remain non-uniform. This inconsistency makes it challenging to reproduce published results and contrast the performance of different algorithms on a level playing field.

Privacy and data sharing constraints: The Health Insurance Portability and Accountability Act (HIPAA) imposes strict requirements on both imaging and clinical data, creating a constrained capacity for data sharing and pooling between institutions. This is a significant bottleneck because DL models perform best when trained on diverse datasets from multiple sources. Without cross-institutional data sharing, models risk being trained on narrow, institution-specific populations and imaging equipment, limiting their generalizability. The success of initiatives like the DREAM Challenge demonstrates the value of open science and highlights the significance of collecting sizable, high-quality, anonymous, shareable, and generalizable datasets that can be used across research groups.

Digital breast tomosynthesis (DBT) as a frontier: While DL has shown strong results in standard 2D mammography, digital breast tomosynthesis presents additional challenges and opportunities. DBT creates quasi-3D images by taking multiple low-dose X-ray projections at different angles, providing better visualization of overlapping tissue structures. However, DL model development for DBT requires the assembly of even larger databases than those needed for standard mammography, since each DBT exam produces many more images than a conventional mammogram. DL is expected to eventually play a significant part in DBT, including the creation of synthetic mammographic images from tomosynthesis data, but this area requires substantially more development.

The reproducibility challenge: Beyond data quantity and privacy, the field faces a reproducibility problem. Many published studies report impressive accuracy metrics but use different datasets, preprocessing pipelines, and evaluation protocols, making it nearly impossible to determine which algorithms genuinely perform best. Standardized benchmarking on common datasets with agreed-upon evaluation criteria would significantly advance the field, but achieving this standardization requires coordination across institutions, regulatory bodies, and research groups that may have competing interests and different data governance frameworks.

TL;DR: Building DL-based CAD systems is hindered by the expense of large annotated datasets, HIPAA-driven data sharing restrictions, and non-uniform validation protocols that prevent reproducibility. DBT (3D mammography) requires even larger datasets than standard mammography. Open science initiatives like the DREAM Challenge point toward solutions, but standardized benchmarking remains an unresolved problem.
Pages 8-10
FDA-Approved AI Tools, Regulatory Pathways, and the Road Ahead for AI in Mammography

FDA clearance landscape: The FDA has recognized the potential impact of AI-based diagnostic products on human health and employs the Software as a Medical Device (SaMD) standard to identify software under its regulatory authority. As of the review period, the FDA had cleared nine AI products specifically for breast cancer screening, suspicious lesion identification, and mammogram triage. All nine products gained clearance through the 510(k) pathway, which requires demonstrating substantial equivalence to an existing approved device. Notably, all clearances were based on retrospective data rather than prospective clinical trials, raising questions about real-world performance.

What was measured and what was not: The primary reported outcomes for FDA-cleared devices were standard test performance measures: sensitivity, specificity, and area under the receiver operating characteristic curve (AUC). Tissue biopsy was used as the gold standard for evaluating breast cancer screening accuracy in most of these devices. However, other clinically important measures of utility were conspicuously absent. None of the cleared devices reported outcomes related to cancer stage at detection, interval cancer detection rates (cancers found between scheduled screenings), or other clinical endpoints that would demonstrate real-world patient benefit beyond pure diagnostic accuracy.

The promise ahead: Despite the current evidence gaps, AI is expected to play a significant role in the future of mammography and digital breast tomosynthesis assessments, particularly in screening scenarios. DL-based systems outperform traditional CAD systems that use manually created features, approaching radiologist-level performance in some tasks. The review emphasizes that continuous algorithm updates, external validation across diverse populations, and prospective clinical studies are needed before these systems can be fully integrated into routine clinical workflows. The development of larger, more representative databases will be especially critical for advancing DL in DBT.

The bottom line: This review highlights both the exciting potential and the real limitations of AI in breast cancer screening. The technology has reached a point where DL algorithms can match or approach radiologist performance on specific mammographic tasks, with nine FDA-cleared products already on the market. Yet the field remains constrained by reliance on retrospective validation, limited dataset diversity, privacy barriers to data sharing, and the absence of prospective clinical outcome data. Bridging these gaps through larger clinical studies, standardized evaluation protocols, and open science initiatives will determine whether AI fulfills its promise of saving lives through earlier, more accurate breast cancer detection.

TL;DR: Nine FDA-cleared AI products exist for breast cancer screening, all approved via the 510(k) pathway using retrospective data only. None report clinical outcomes like cancer stage at detection or interval cancer rates. DL systems approach radiologist-level performance, but prospective trials, diverse datasets, and standardized benchmarks are needed before routine clinical integration can become a reality.
Citation: Al Muhaisen S, Safi O, Ulayan A, et al.. Open Access, 2024. Available at: PMC11044525. DOI: 10.7759/cureus.56945. License: cc by.