ALL-Net: integrating CNN and explainable-AI for enhanced diagnosis and interpretation of acute lymphoblastic leukemia

PMC 2024 AI 8 Explanations View Original
Original Paper (PDF)

Unable to display PDF. Download it here or view on PMC.

Plain-English Explanations
Pages 1-3
Why Detecting Acute Lymphoblastic Leukemia From Blood Smears Is Hard

Acute lymphoblastic leukemia (ALL) is an aggressive cancer of the white blood cells that primarily affects children but can also occur in adults. In 2021, the SEER database projected roughly 61,090 new leukemia cases in the United States, representing 3.2% of all new cancer diagnoses. Diagnosing ALL typically involves a combination of peripheral blood smear (PBS) examination, complete blood counts, bone marrow biopsy, immunophenotyping, cytogenetic testing, and imaging studies. Many of these procedures are invasive, expensive, and time-consuming, and bone marrow biopsy in particular can be painful and poorly tolerated by children.

The hematogone problem: A major diagnostic challenge lies in distinguishing ALL cells from hematogones, which are benign precursor B cells that look remarkably similar to ALL blasts under the microscope. Both involve an abrupt rise in lymphocyte counts and share overlapping morphological features in PBS images. However, hematogones are harmless and typically resolve on their own in children, while ALL requires immediate treatment. This visual similarity makes manual microscopic analysis prone to subjective interpretation and interobserver variability.

The case for automation: Manual PBS examination requires skilled hematopathologists to inspect microscopic images, leading to significant delays and potential errors. Many diseases can cause elevated lymphocyte counts that should not be confused with leukemia. These challenges create a clear need for accurate, low-cost, time-efficient computer-aided diagnosis (CAD) systems. This study introduces ALL-Net, a custom convolutional neural network (CNN) paired with explainable AI (XAI) to classify PBS images into four categories: benign (hematogones), Early-B ALL, Pre-B ALL, and Pro-B ALL.

TL;DR: ALL is a fast-moving leukemia with ~61,090 new U.S. cases projected in 2021. Its PBS images closely resemble benign hematogones, making manual diagnosis slow, subjective, and error-prone. ALL-Net is a custom CNN with explainable AI designed to automate four-class classification of PBS images.
Pages 4-5
Prior Deep Learning Approaches to Leukemia Classification

The authors survey a growing body of research applying CNN-based architectures to leukemia detection from blood smear images, DNA sequence data, and patient symptom profiles. Publicly available datasets that have driven this research include the American Society of Hematology (ASH) Image Bank, the ALL-IDB (Acute Lymphoblastic Leukemia Image Database), and the ALL Challenge dataset from the IEEE International Symposium on Biomedical Imaging (ISBI) 2019.

Binary classification benchmarks: Arivuselvam and Sudha (2022) used a Deep CNN classifier alongside traditional machine learning models (SVM, decision tree, naive Bayes, random forest) and achieved 99.2% accuracy on the ASH Image Bank and 98.4% on ALL-IDB, but only for binary ALL vs. non-ALL classification. Al-Bashir et al. (2024) compared AlexNet, DenseNet, ResNet, and VGG16 on the same datasets, reporting 94% accuracy. Multiclass approaches: Rahman et al. (2023) tackled multiclass blood cancer classification with VGG19, ResNet50, InceptionV3, and Xception, reaching 99.84% on ALL-IDB1 and ALL-IDB2. Shafique and Tehsin (2018) used an ensemble of ResNet50, VGG16, and InceptionV3, achieving 99.8%.

Custom architectures and XAI: Sampathila et al. (2022) proposed a customized deep learning classifier called ALLNET (different from this paper's ALL-Net) that achieved 95.54% on the ISBI 2019 dataset. In the XAI domain, Islam et al. (2024) developed a symptom-based decision tree model with 97.45% accuracy and an AUC of 0.783, but it used tabular data rather than images. Van der Velden et al. (2022) reviewed XAI techniques across medical imaging tasks broadly but did not focus on leukemia-specific image classification with explainability.

The authors identify two key gaps in the literature: most prior work used relatively small datasets (a few hundred to a few thousand images), and many studies relied on pre-trained models without customization. Few combined CNN-based image classification with XAI techniques to provide interpretable outputs for clinicians. ALL-Net aims to fill both gaps by using a larger, diverse dataset of 3,256 images and integrating LIME-based explainability.

TL;DR: Prior work achieved up to 99.84% accuracy for binary or multiclass leukemia classification using pre-trained CNNs, but few studies combined custom architectures with explainable AI. ALL-Net addresses this gap with a purpose-built CNN and LIME integration on a 3,256-image dataset.
Pages 6-9
Dataset, Preprocessing, and Data Augmentation Strategy

The study uses a dataset of 3,256 PBS images from 89 suspected ALL patients, sourced from Kaggle and originally produced at Taleqani Hospital's bone marrow laboratory in Tehran, Iran. The images span four classes: benign/hematogones (504 images), Early-B ALL (985 images), Pre-B ALL (963 images), and Pro-B ALL (804 images). This distribution creates a notable class imbalance, with the benign class containing roughly half the images of the largest malignant class.

Image preprocessing: All images were resized to 224 x 224 pixels with three RGB color channels, yielding an input shape of 224 x 224 x 3. Pixel values were normalized to a 0-1 range by dividing by 255, which improves convergence speed and training stability. The 80:20 train-test split was applied to ensure the model had sufficient data for learning while reserving a meaningful portion for evaluation.

Augmentation of the benign class: To address the class imbalance, data augmentation was applied exclusively to the underrepresented benign class. The augmentation parameters included rotation (10 degrees), horizontal flipping, height and width shift ranges of 0.1, shear range of 0.2, zoom range of 0.2, and nearest-fill mode. Vertical flipping was not used. After augmentation, all four classes had image counts in a similar range, eliminating the imbalance that had caused misclassification of benign images in preliminary experiments.

The authors emphasize that augmentation targeted only the benign class rather than the entire dataset. This selective approach was chosen because benign hematogones were the most frequently misclassified category, and boosting their representation helped the model learn the subtle morphological differences between hematogones and ALL subtypes.

TL;DR: 3,256 PBS images from 89 patients across four classes (benign: 504, Early-B: 985, Pre-B: 963, Pro-B: 804). Images resized to 224x224x3 and normalized. Data augmentation (rotation, flipping, shifting, shearing, zoom) was applied only to the benign class to fix class imbalance.
Pages 10-12
The ALL-Net CNN Architecture: Three Convolutional Blocks Plus Dense Layers

ALL-Net is a custom CNN designed specifically for four-class ALL classification, intentionally kept simpler than large pre-trained models like DenseNet201. The architecture begins with an input layer accepting 224 x 224 x 3 images, followed by three sequential convolutional blocks. Block 1 contains a convolutional layer with 64 filters (3x3 kernel) and ReLU activation, followed by 2x2 max pooling. Block 2 increases to 128 filters (3x3 kernel) with ReLU and 2x2 max pooling. Block 3 uses 256 filters (3x3 kernel) with ReLU and 2x2 max pooling. Each block progressively extracts more abstract features from the input images.

Global pooling and dense layers: After the three convolutional blocks, a global average pooling layer aggregates spatial information across feature maps, followed by a flatten layer converting the 2D feature maps into a 1D vector. The flattened vector passes through three fully connected (dense) layers with 256, 512, and 1,024 neurons respectively, each using ReLU activation. A dropout layer with a rate of 0.2 (dropping 20% of neurons during training) is applied after the dense layers to prevent overfitting.

Output and optimization: The final layer uses softmax activation across four output neurons, one per class (benign, Early-B, Pre-B, Pro-B), producing a probability distribution. The model is optimized with the Adam optimizer and trained using sparse categorical cross-entropy loss. Training ran for 50 epochs with a batch size of 32, on both augmented and unaugmented versions of the dataset. The entire project was implemented in Python using Jupyter, running on a Dell XPS 13 with an Intel i5 processor and 8GB RAM.

The deliberate simplicity of ALL-Net is a design choice. With only three convolutional blocks and three dense layers, it has far fewer parameters than architectures like DenseNet201 (which has over 200 layers). This means faster inference times, which matters in clinical settings where rapid turnaround supports timely treatment decisions.

TL;DR: ALL-Net uses three convolutional blocks (64, 128, 256 filters), global average pooling, three dense layers (256, 512, 1,024 neurons), 0.2 dropout, and softmax output for four classes. Trained with Adam optimizer, sparse categorical cross-entropy, batch size 32, for 50 epochs. Deliberately simpler than large pre-trained models for faster inference.
Pages 13-17
Classification Results: From 97.85% to 99.32% After Augmentation

Unaugmented dataset: On the original dataset without augmentation, ALL-Net achieved a training accuracy of 98.42% and a testing accuracy of 97.85%. The per-class testing metrics revealed a weakness: the benign class had lower precision (93%) and recall (97%) compared to the malignant subtypes, which scored 98-100% across all metrics. The confusion matrix confirmed that a significant number of benign images were misclassified, directly attributable to the class imbalance (only 504 benign images vs. 804-985 for the malignant classes).

Augmented dataset: After applying data augmentation to the benign class, ALL-Net's testing accuracy rose to 99.32%, with training accuracy reaching 99.59%. The improvement was most dramatic for the benign class, which achieved 100% precision, 100% recall, and 100% F1 score on the test set, compared to 93%, 97%, and 95% before augmentation. Across all classes, the augmented model reached an average precision of 99.25%, recall of 99.50%, and F1 score of 99.50%. The confusion matrix showed zero misclassifications for the benign class after augmentation.

Comparison with pre-trained models: ALL-Net outperformed eight well-known pre-trained CNNs fine-tuned on the same augmented dataset. EfficientNet scored only 28.22%, MobileNetV3 reached 50.15%, VGG-19 hit 96.32%, Xception achieved 96.70%, InceptionV3 scored 96.93%, ResNet50V2 reached 97.85%, VGG-16 achieved 98.01%, and NASNetLarge scored 98.16%. Only DenseNet201 exceeded ALL-Net at 99.85%, a margin of just 0.53 percentage points.

Despite DenseNet201's slightly higher accuracy, the authors argue that ALL-Net's simpler architecture provides a meaningful advantage. DenseNet201 has over 200 layers and many more parameters, making it significantly slower at inference. For a clinical tool where speed and interpretability matter, ALL-Net's 99.32% accuracy with a lightweight architecture represents a better practical tradeoff. The training and loss curves for both augmented and unaugmented datasets showed consistent upward accuracy trends and downward loss trends, confirming the model was fitting correctly without significant overfitting.

TL;DR: Without augmentation: 97.85% test accuracy, benign class precision only 93%. With augmentation: 99.32% test accuracy, benign class hits 100% precision/recall. ALL-Net outperformed 8 pre-trained models (VGG-16 at 98.01%, NASNetLarge at 98.16%) and trailed DenseNet201 (99.85%) by only 0.53 percentage points while being much simpler.
Pages 18-21
LIME: Making the Model's Decisions Transparent to Clinicians

A key contribution of this study is the integration of Local Interpretable Model-Agnostic Explanations (LIME) to interpret ALL-Net's predictions. LIME is a model-agnostic XAI technique that explains individual predictions by approximating the model's behavior locally around each input. For image classification, LIME first segments the image into superpixels (contiguous regions with similar characteristics), then generates perturbed versions by randomly toggling superpixels on and off. Each perturbation is fed through the model, and the resulting predictions are used to train a simple linear surrogate model that approximates the CNN's local decision boundary.

How LIME was applied: For each PBS image, LIME generated 1,000 perturbed versions (the default value) and observed how the perturbations affected ALL-Net's classification probabilities. The surrogate model's coefficients reveal which superpixels had the greatest influence on the prediction. The authors visualized this as a mask overlaid on the original image, highlighting the top five most influential regions. A separate heatmap view color-codes regions by their contribution: red for positive contributions to the predicted class and blue for negative contributions.

Class-specific insights: The LIME visualizations for each of the four classes (benign, Early-B, Pre-B, Pro-B) showed distinct patterns. For the malignant subtypes, the model consistently focused on cell morphology features such as chromatin patterns, nucleoli visibility, and cytoplasmic characteristics. For benign class images, an interesting finding emerged: while ALL-Net correctly classified them, LIME could not generate meaningful masks or heatmaps because there were no regions in the image that indicated malignancy. This negative finding actually supports clinical intuition, as benign hematogones lack the morphological markers that define ALL subtypes.

The mathematical formulation behind LIME involves minimizing a weighted loss function that balances fidelity to the original model's predictions with simplicity of the surrogate model. The weight assigned to each perturbation is based on its similarity to the original image, ensuring that the explanation focuses on the local neighborhood of the prediction. This regularized optimization produces interpretable coefficients that map directly to image regions, giving clinicians a visual explanation of why the model made a particular classification.

TL;DR: LIME generated 1,000 perturbations per image, identified the top 5 most influential superpixel regions, and produced heatmaps showing positive (red) and negative (blue) contributions. Malignant subtypes showed clear morphological focus areas, while benign images produced no malignancy-indicating regions, aligning with clinical expectations.
Page 22
Single-Dataset Evaluation and Missing External Validation

Single data source: The entire study relies on one dataset of 3,256 PBS images from a single hospital (Taleqani Hospital, Tehran, Iran). While the dataset covers 89 patients and four diagnostic classes, there is no external validation on images from other hospitals, imaging equipment, or patient populations. Staining protocols, camera settings, and slide preparation techniques vary across institutions, and the model's 99.32% accuracy may not generalize to images acquired under different conditions.

Limited augmentation strategy: Data augmentation was applied only to the benign class and only using geometric and color-based transformations (rotation, flipping, shifting, shearing, zoom). More advanced augmentation strategies such as generative adversarial networks (GANs), CutMix, or style transfer were not explored. The authors acknowledge that further research could optimize hyperparameters and explore advanced augmentation approaches, but the current study does not benchmark these alternatives.

No clinical deployment or prospective testing: The model was evaluated entirely in a retrospective, offline setting using a predefined train-test split. There was no prospective clinical trial, no comparison with pathologist performance on the same images, and no assessment of how LIME explanations affected clinical decision-making in practice. The study also ran on modest hardware (Dell XPS 13, Intel i5, 8GB RAM), which limits scalability discussions for real-time clinical deployment.

Narrow XAI scope: Only LIME was used for explainability. Other XAI techniques such as SHAP, Grad-CAM, or attention-based visualization were not compared. The benign class's inability to produce LIME explanations, while interpretively interesting, also raises questions about whether LIME is the most suitable XAI approach for all diagnostic classes in this context.

TL;DR: Key limitations include single-hospital dataset (3,256 images from Taleqani Hospital only), no external validation, no prospective clinical testing, no pathologist-vs-model comparison, limited augmentation techniques, and only LIME tested for explainability with no Grad-CAM or SHAP comparison.
Pages 22-23
Scaling Up: More Data, Better Security, and Advanced Architectures

The authors outline several directions for future research. First, they highlight the need for larger and more diverse datasets. Medical image data is inherently scarce due to privacy concerns, and patients must trust that their information will not be misused before sharing it. The authors suggest that incorporating a security feature into the model, potentially through federated learning or differential privacy, could encourage broader data sharing while protecting patient information.

Architecture and hyperparameter optimization: Future work could explore fine-tuning the ALL-Net architecture itself, such as experimenting with different numbers of convolutional blocks, filter sizes, or dense layer configurations. Hyperparameter optimization techniques like grid search, random search, or Bayesian optimization could be applied systematically to find the best combination of learning rate, batch size, dropout rate, and augmentation parameters. The 0.53-percentage-point gap between ALL-Net (99.32%) and DenseNet201 (99.85%) suggests room for architectural improvements that might close this gap without adding excessive complexity.

Multi-center validation and clinical integration: The most impactful next step would be validating ALL-Net on images from multiple hospitals across different countries, staining protocols, and imaging equipment. A prospective clinical study comparing ALL-Net's diagnostic accuracy with that of experienced hematopathologists would establish whether the model adds genuine clinical value. Integrating LIME visualizations into a clinical workflow, where pathologists can see which image regions drove the model's decision, could build trust and facilitate adoption.

The open-source availability of ALL-Net's code on GitHub and Zenodo (DOI: 10.5281/zenodo.14349780) provides a foundation for the research community to build upon. Researchers can reproduce the results, test the model on new datasets, and extend the architecture with additional XAI methods like Grad-CAM or SHAP for richer interpretability.

TL;DR: Future work should focus on multi-center validation, larger datasets with privacy-preserving frameworks, hyperparameter optimization to close the 0.53% gap with DenseNet201, prospective clinical trials with pathologist comparisons, and adding Grad-CAM/SHAP alongside LIME. Source code is publicly available on GitHub and Zenodo.
Citation: Thiriveedhi A, Ghanta S, Biswas S, Pradhan AK.. Open Access, 2025. Available at: PMC11888852. DOI: 10.7717/peerj-cs.2600. License: cc by.