Artificial intelligence for the detection of acute myeloid leukemia from microscopic blood images

PMC 2024 AI 8 Explanations View Original
Original Paper (PDF)

Unable to display PDF. Download it here or view on PMC.

Plain-English Explanations
Pages 1-2
Why Automating AML Detection from Blood Smears Matters

Leukemia is the 11th most prevalent cancer worldwide, responsible for approximately 2.5% of all new cancer cases and 3.1% of cancer-related deaths as of 2020. Acute myeloid leukemia (AML) is the most common malignant blood cancer in adults, making rapid and accurate detection essential for timely treatment. The standard diagnostic pipeline starts with microscopic examination of peripheral blood smears (PBS) and bone marrow slides, followed by immunophenotyping and cytogenetic analysis to confirm the diagnosis.

However, traditional visual inspection of blood smears under a microscope is time-consuming, error-prone, and dependent on the hematologist's expertise and physical acuity. More advanced techniques like molecular cytogenetics and Array-based Comparative Genomic Hybridization (aCGH) are expensive and resource-intensive, which means microscopic blood tests remain the most widely used method for leukemia subtype identification. This creates a clear opportunity for automated image analysis.

This systematic review and meta-analysis by Al-Obeidat et al. (2025) set out to evaluate all AI-based approaches for detecting and diagnosing AML from microscopic blood images. The authors searched PubMed, Web of Science, and Scopus through December 2023 and ultimately included 10 studies conducted between 2016 and 2023. The analysis covered 24 distinct AI models across these studies, spanning convolutional neural networks (CNNs), generative adversarial networks (GANs), and support vector machines (SVMs).

The primary outcome measures were accuracy and sensitivity (recall), which together capture how well a model identifies AML cases correctly and how many true-positive cases it catches. The study was registered with PROSPERO (CRD42024501980) and followed PRISMA 2020 guidelines, providing a rigorous framework for evidence synthesis.

TL;DR: AML is the most common malignant blood cancer in adults. This systematic review analyzed 10 studies (2016-2023) covering 24 AI models for AML detection from microscopic blood images, focusing on accuracy and sensitivity as primary outcomes.
Pages 2-3
Search Strategy, Inclusion Criteria, and Quality Assessment

The authors searched PubMed, Web of Science, and Scopus using keywords including "acute myeloid leukemia," "artificial intelligence," "deep learning," and "machine learning," with no timeframe or language restrictions applied. From 2,565 initial records, 655 duplicates were removed. After title/abstract screening, 75 articles advanced to full-text review, and 10 studies ultimately met all inclusion criteria.

Inclusion criteria: Studies had to (1) use human AML peripheral blood smear samples, (2) employ AI techniques for diagnosing or classifying AML, (3) report performance metrics including recall (sensitivity) and accuracy, and (4) provide separate metrics for AML diagnosis rather than an overall model accuracy across all classes. Exclusion criteria: Studies covering irrelevant conditions like acute promyelocytic leukemia (APL) or myelodysplastic syndrome (MDS), those using flow cytometry or microarray gene algorithms, studies focused on image segmentation into blast cells rather than whole-image classification, studies reporting only prognosis or subtype identification (M1, M2, etc.), and incomplete data or non-original articles.

Quality assessment: Methodological quality was evaluated using the QUADAS-2 tool (Quality Assessment of Diagnostic Studies 2), which assesses four domains: patient selection, index test, reference standard, and flow/timing. The first three domains were also assessed for applicability concerns. Results showed an overall low risk of bias and low risk of applicability concerns, though some unclarity was noted in the flow and timing domains.

Data extraction was performed independently by two authors using Microsoft Excel, with disagreements resolved by consensus. Extracted variables included patient/sample counts, total images used after augmentation, classification task type (binary or multiclass), datasets and their reference standards, classifier usage, transfer learning application, and validation method.

TL;DR: From 2,565 initial records, 10 studies met strict inclusion criteria requiring human AML blood smear samples, AI-based diagnosis, and AML-specific performance metrics. QUADAS-2 assessment showed overall low risk of bias across included studies.
Pages 3-4
Meta-Analysis Framework and Heterogeneity Assessment

The meta-analysis used the "metafor" and "metagen" libraries in R to pool accuracy values across 24 models from the 10 included studies. Both common-effects and random-effects models were computed. The random-effects model is particularly important here because it accounts for variability in effect sizes between studies, which is expected when different AI architectures, datasets, and preprocessing pipelines are involved.

Heterogeneity measures: The I2 statistic quantified the percentage of total variation across studies attributable to true differences rather than sampling error, with values above 60% indicating high heterogeneity. The H2 statistic estimated the ratio of total variability to sampling variability. The Q-value measured the degree of variability in results across studies, where a high H2 (>1.5) and large Q-value with a low p-value (p < 0.05) suggested significant heterogeneity. The Restricted Maximum Likelihood (REML) method was used to estimate total heterogeneity (tau-squared).

Statistical significance was determined using the z-value and corresponding p-value, with a threshold of p < 0.05. Forest plots were generated for both accuracy and sensitivity to visualize the distribution of effect sizes across studies. Funnel plots were also created and visually inspected to check for publication bias, as asymmetry in these plots can indicate that smaller studies with positive outcomes are disproportionately represented.

TL;DR: The meta-analysis pooled 24 AI models using both common-effects and random-effects models in R, with heterogeneity assessed via I2, H2, Q-value, and tau-squared. Funnel plots were used to detect publication bias.
Pages 5-8
The 10 Studies: AI Architectures, Datasets, and Classification Tasks

The 10 included studies were conducted across Pakistan, Saudi Arabia, the United States, India, Iran, and Egypt. Seven studies used CNNs as their primary architecture, two used GANs, and one used SVMs. Specific architectures included hybrid CNN models (Baig et al., 2022), ResNet-34 and DenseNet-121 (Bibi et al., 2020), an Auxiliary Classifier GAN or AC-GAN (Karar et al., 2022), a Hybrid CNN with Interactive Autodidactic School optimization or HCNN-IAS (Sakthiraj, 2022), SENet-based CNN (Shalini and Viji, 2023), Mayfly-optimized GAN or MayGAN (Veeraiah et al., 2023), 8-layer CNN/AlexNet (Shawly and Alsheikhy, 2022), binary/multi-SVM (Kazemi et al., 2016), pre-trained models including AlexNet, VGG16, GoogleNet, ResNet101, and Inception-v3 (Nagiub et al., 2020), and MobileNet, DenseNet121, ResNet152V2, VGG16, Xception, and InceptionV3 (Abhishek et al., 2023).

Datasets: Five studies relied on publicly available online datasets such as the American Society of Hematology Image Bank (ASH-bank) and the Acute Lymphoblastic Leukemia Image Database for Image Processing (ALL-IDB). Others used local hospital data, for example Kazemi et al. collected images from Shariati Hospital pathology laboratories (17 patients, aged 16-69), and Nagiub et al. built the AML-IDB from Assiut University Hospitals in Egypt (2017-2019). Shawly and Alsheikhy used Kaggle data. Two studies (Kazemi et al. and Nagiub et al.) included both PBS and bone marrow images.

Classification tasks: Seven studies performed multiclass classification, distinguishing AML from ALL, CML, CLL, and/or healthy samples, while three studies performed binary classification (AML vs. ALL or AML vs. normal). Transfer learning was used in four studies, and separate classifiers (SVM, Bagging ensemble, RUSBoost, fine KNN, random forest) were applied in five studies. Data augmentation was employed in roughly half the studies to increase training set diversity.

Sample sizes varied considerably, from 19 patients (Abhishek et al., 2023) to 10,500 blood samples (Shawly and Alsheikhy, 2022). AML image counts ranged from 55 samples before augmentation (Bibi et al.) to 1,016 images for validation (Shawly and Alsheikhy). This variability in study design, dataset origin, and model architecture is a key factor driving the heterogeneity observed in the meta-analysis results.

TL;DR: The 10 studies spanned 6 countries and used CNNs (7 studies), GANs (2), and SVMs (1). Datasets ranged from 19 patients to 10,500 samples. Seven studies used multiclass classification, three used binary. Transfer learning was applied in 4 studies and data augmentation in roughly half.
Pages 4, 10-11
Pooled Accuracy: High Performance with Significant Heterogeneity

The common-effects model yielded a pooled accuracy of 1.0000 (95% CI: 0.9999 to 1.0001), while the more conservative random-effects model produced an accuracy of 0.9557 (95% CI: 0.9312 to 0.9802). The random-effects estimate had a standard error of 0.0125, a z-value of 76.5840, and a p-value of less than 0.0001, confirming that the overall accuracy was significantly different from chance.

Heterogeneity was substantial. The Q-value was 410.1247 with 28 degrees of freedom (p < 0.0001). The I2 statistic reached 100.00%, and the H2 was 94,583.49, both indicating extreme heterogeneity across studies. The tau-squared was 0.0043 (SE = 0.0012) with a tau of 0.0659. This means that while the average accuracy is high, the actual performance of any given AI model varies considerably depending on the study's specific design choices.

Several individual models achieved near-perfect accuracy. The SENet-CNN model from Shalini and Viji (2023) reached 99.98% accuracy. The MayGAN model from Veeraiah et al. (2023) achieved 99.8%. The AlexNet-based CNN from Shawly and Alsheikhy (2022) reported approximately 99% accuracy. On the lower end, Abhishek et al. (2023) reported an accuracy of 81% with pre-trained VGG16 combined with SVM, and 84% when using SVM as a standalone classifier on the combined dataset. This wide range is what drives the extreme I2 value.

The large gap between the common-effects (1.0000) and random-effects (0.9557) models is itself informative. It signals that a few high-sample-size studies with near-perfect accuracy dominate the common-effects estimate, while the random-effects model properly downweights them by accounting for between-study variability. The random-effects accuracy of 95.57% is the more reliable summary statistic for real-world expectations.

TL;DR: Random-effects pooled accuracy was 95.57% (95% CI: 93.12-98.02%), with individual models ranging from 81% to 99.98%. Heterogeneity was extreme (I2 = 100%, Q = 410.12, p < 0.0001), reflecting wide variation in study designs and model architectures.
Pages 10-11
Pooled Sensitivity: Strong True-Positive Detection with Notable Outliers

Both the common-effects and random-effects models yielded high sensitivity values of 1.0000 and 0.8581, respectively. In the random-effects model, the overall sensitivity was 0.8581 with a z-value of 18.33 and a p-value of less than 0.0001, confirming statistical significance. Sensitivity captures the proportion of true AML cases correctly identified by the model, making it a critical metric for a screening or diagnostic tool where missing a cancer case has serious consequences.

Several models achieved 100% sensitivity, including those based on KNN, LPboost, Inception, and DenseNet architectures. At the other extreme, the VGG16 combined with Random Forest classifier from Abhishek et al. (2023) had a sensitivity of just 12%, and the fine-tuned VGG16 with Random Forest reached only 20%. The authors attributed this poor performance to a domain mismatch: these pre-trained CNN models were originally trained on the ImageNet dataset (real-life photographs), which differs substantially from microscopic blood smear images, creating the potential for negative transfer learning.

Heterogeneity in sensitivity was also extreme. The Q-value was 3,919.31 with 28 degrees of freedom, and the p-value was effectively 0, indicating highly significant variability. The I2 statistic was 99.3%, and H2 was 11.83. The tau-squared was 0.0633 (SE = 0.0012) with a tau of 0.2516. These values mean that much of the observed variation in sensitivity reflects real differences between AI models and study designs rather than random sampling error.

The gap between the sensitivity (85.81%) and accuracy (95.57%) in the random-effects models is notable. It suggests that while AI models are generally accurate at classifying blood smear images overall, their ability to catch every true AML case is somewhat lower. In a clinical diagnostic context, a sensitivity of 85.81% means that roughly 14 out of every 100 true AML cases could be missed, which underscores the need for AI to serve as a decision-support tool alongside expert hematologist review rather than as a standalone diagnostic.

TL;DR: Random-effects pooled sensitivity was 85.81% (z = 18.33, p < 0.0001). Individual model sensitivity ranged from 12% (VGG16+RF) to 100% (KNN, LPboost, Inception, DenseNet). Heterogeneity was extreme (I2 = 99.3%, Q = 3,919.31).
Pages 11-13
Transfer Learning, IoMT, and Why Traditional ML Sometimes Wins

Transfer learning trade-offs: Four of the 10 studies used pre-trained models. While transfer learning can speed up training and improve results when the source and target domains are similar, the study by Abhishek et al. (2023) demonstrated the risks of domain mismatch. Their pre-trained CNN models, originally trained on ImageNet (natural photographs), performed poorly on microscopic blood smear images. By contrast, Shalini and Viji (2023) trained a SENet-CNN model from scratch on a hybrid dataset combining the ASH-bank and ALL-IDB, achieving 99.98% accuracy. This finding highlights that training from scratch on domain-specific data can outperform transfer learning when the pre-training domain is too dissimilar.

Traditional ML vs. deep learning: Interestingly, Baig et al. (2022) demonstrated that combining CNN feature extraction with traditional ML classifiers (Bagging ensemble, RUSBoost, SVM, fine KNN) yielded strong results. Their hybrid approach reached 97.04% accuracy using the Bagging ensemble classifier. The rationale is pragmatic: deep learning networks can take hours or days to train, while traditional ML classifiers run in minutes. With limited dataset sizes, training complex deep learning models end-to-end may not always be advantageous, and the nuanced morphological features in leukemia images can sometimes be better captured by combining CNN feature extraction with simpler classifiers.

Internet of Medical Things (IoMT): Three studies (Bibi et al., Karar et al., and Sakthiraj) integrated their AI models into IoMT frameworks, where smart medical devices with sensors communicate via Wi-Fi and cloud platforms. This architecture allows an IoT-enabled microscope to upload blood smear images to a cloud service, run the AI classification, and deliver results to the physician's computer. The potential for remote diagnosis is particularly valuable during pandemics, as it reduces hospital visits while maintaining diagnostic accuracy. Bibi et al. used ResNet-34 and DenseNet-121 within this framework, while Sakthiraj's HCNN-IAS model achieved approximately 99% accuracy within an IoMT pipeline.

Data augmentation impact: Roughly half the included studies used data augmentation to increase dataset size and diversity. Studies that employed augmentation generally performed better in training, though the authors note a caveat: augmentation can produce misleadingly high accuracy values compared to what would be seen on genuinely new, unaugmented images. The largest augmented dataset expanded from 55 original AML samples to 1,194 images (Bibi et al.), a roughly 22-fold increase.

TL;DR: Transfer learning from ImageNet hurt performance on blood smear images (domain mismatch). Hybrid CNN + traditional ML classifiers reached 97.04% accuracy. Three studies embedded AI in IoMT frameworks for remote diagnosis. Data augmentation helped training but may inflate reported accuracy.
Page 14
High Heterogeneity, Reporting Gaps, and the Path Forward

Heterogeneity: The most prominent limitation is the extreme heterogeneity (I2 = 100% for accuracy, 99.3% for sensitivity) across the 10 included studies. This is driven by wide differences in AI architectures, data augmentation strategies, classification tasks (binary vs. multiclass), transfer learning approaches, and feature extraction methods. Such heterogeneity makes it difficult to draw a single definitive conclusion about "AI accuracy for AML detection" since the answer depends heavily on which specific model, dataset, and pipeline is used.

Dataset limitations: Most included studies used the ASH-bank as their primary training dataset, which limits the generalizability of findings to different clinical settings and patient populations. Sample sizes varied enormously, from 19 patients to 10,500 blood samples, and not all studies provided the 2x2 contingency table data needed to reconstruct specificity and other diagnostic metrics. This prevented the authors from pooling specificity alongside accuracy and sensitivity.

Reporting inconsistencies: Different studies reported different metrics, with some using AUC and false positive rate while others used precision and F1 scores. This lack of standardization makes cross-study comparisons difficult. The authors call for adoption of emerging reporting standards like STARD-AI and TRIPOD-AI, which are specifically designed for diagnostic accuracy studies evaluating AI procedures. They also note that QUADAS-2, while useful, was not specifically designed for deep learning diagnostic studies, and the field needs a dedicated quality assessment tool for healthcare AI.

Publication bias: Funnel plot analysis revealed asymmetry, suggesting that smaller studies with positive outcomes are more likely to be published. Several studies fell outside the expected funnel shape, possibly due to small sample sizes or heterogeneous study designs. This means the pooled estimates may overstate the true performance of AI models for AML detection.

Future directions: The authors recommend that future research should unify reporting methods by consistently reporting accuracy, sensitivity, and specificity for each cancer type rather than overall averages. They also highlight the promise of integrating AI diagnostic tools into IoMT platforms for remote diagnosis, which would be especially beneficial during epidemics and for underserved regions. Finally, they note this is the first meta-analysis specifically focused on AI for AML detection from whole PBS images, as opposed to single-cell classification or general leukemia detection, and encourage future models to explore AML subtype classification.

TL;DR: Key limitations include extreme heterogeneity (I2 = 100%), reliance on ASH-bank datasets limiting generalizability, inconsistent metric reporting across studies, and evidence of publication bias. The authors call for standardized AI diagnostic reporting (STARD-AI, TRIPOD-AI) and further development of IoMT-integrated diagnostic tools.
Citation: Al-Obeidat F, Hafez W, Rashid A, et al.. Open Access, 2024. Available at: PMC11782132. DOI: 10.3389/fdata.2024.1402926. License: cc by.