Automatic Detection of Acute Leukemia (ALL and AML) Utilizing Customized Deep Graph Convolutional Neural Networks

PMC 2024 AI 8 Explanations View Original
Original Paper (PDF)

Unable to display PDF. Download it here or view on PMC.

Plain-English Explanations
Pages 1-3
Why Automated Leukemia Classification Matters, and What This Study Proposes

Leukemia is a hematologic malignancy originating in the bone marrow, characterized by the uncontrolled proliferation of abnormal white blood cells. Based on 2018 data, the United States alone reported over 60,000 new leukemia cases, accounting for roughly 3.5% of all cancer diagnoses. The disease is most frequently detected in patients under 15 and over 55 years of age. Acute leukemia is subdivided into acute lymphocytic leukemia (ALL) and acute myeloid leukemia (AML), both of which require rapid diagnosis to prevent clinical deterioration.

Traditional diagnosis relies on a pathologist examining blood smears under a microscope, assessing the morphology of lymphocyte and monocyte cells. This process is time-consuming, subjective, and heavily dependent on the pathologist's expertise and fatigue level. Visual distinction between ALL and AML is particularly challenging because healthy and diseased blood samples can appear morphologically similar, resulting in low diagnostic accuracy when performed manually.

Prior automated approaches have used deep learning architectures including CNNs, pre-trained networks (VGG16, ResNet50, Inception V3), and thresholding-based methods, achieving accuracies ranging from 84% to 98.51%. However, these methods suffered from limitations such as high computational complexity, lack of end-to-end functionality, reliance on manual feature extraction, and small or non-standardized databases. Zhou et al. achieved only 85% accuracy with fully connected neural networks, while Abhishek et al. reached just 84% with a modified VGG16.

This study introduces a novel end-to-end approach that fuses graph theory with deep convolutional neural networks (CNNs) for binary classification of ALL and AML. The key contributions include a new standardized database of blood sample images from 44 patients, a customized deep graph convolutional neural network (GCN) architecture with six graph convolutional layers, and a reported classification accuracy of 99.4%.

TL;DR: Leukemia affects over 60,000 new patients annually in the US. Manual microscopy-based diagnosis is slow and error-prone. Prior deep learning methods achieved 84-98.5% accuracy but had computational and dataset limitations. This study proposes a graph CNN approach that achieves 99.4% accuracy on a new 44-patient dataset for ALL vs. AML classification.
Pages 3-5
GANs and Graph Convolutional Networks: The Two Core Technologies

Generative Adversarial Networks (GANs): First introduced by Goodfellow et al. in 2014, GANs consist of two competing neural networks: a generator and a discriminator. The generator takes a random noise vector as input and produces synthetic images, while the discriminator learns to distinguish between real and generated data. Through this adversarial training process, the generator progressively improves its ability to create realistic samples. In this study, GANs were used specifically for data augmentation, addressing the class imbalance between ALL and AML images in the original dataset.

Graph Convolutional Networks (GCNs): Originally proposed by Michael Defferrard and colleagues, GCNs extend the concept of convolution from regular grid-structured data (like images in standard CNNs) to graph-structured data. The mathematical foundation relies on spectral graph theory, where the Laplacian matrix (L = D - W, where D is the degree matrix and W is the adjacency matrix) is decomposed using Singular Value Decomposition (SVD) to derive graph basis functions. These basis functions enable Fourier transforms on the graph domain, which in turn allow convolution operations to be applied to graph-structured inputs.

The convolution of two signals on a graph is computed using the spectral decomposition of the Laplacian: y = g(L)z = Ug(Lambda)U^T z, where U is the matrix of eigenvectors and Lambda contains the eigenvalues. This spectral approach allows the network to learn spatial relationships between graph nodes (image regions) in a mathematically principled way, capturing structural patterns that standard CNNs operating on pixel grids may miss.

The use of Chebyshev polynomials to approximate the graph filter function is a critical efficiency optimization. Rather than computing the full eigendecomposition of the Laplacian (which scales poorly), Chebyshev polynomial approximation allows localized filtering with controllable computational cost. The polynomial coefficients (P1 through P6) were set to 1 in the final model after trial-and-error optimization.

TL;DR: The study combines two deep learning techniques: GANs for data augmentation and graph convolutional networks (GCNs) for classification. GCNs use spectral graph theory and Chebyshev polynomial approximation to perform convolutions on graph-structured representations of blood cell images, capturing spatial relationships between image regions.
Pages 5-7
Building the Dataset: 44 Patients, GAN Augmentation, and Graph Construction

Data collection: The dataset was acquired from Ghazi Tabriz Medical Sciences Center under ethical code IR.1401.1.15. It comprised 44 patients (12 males, 32 females) aged 12 to 70 years. Each patient was diagnosed with a specific form of acute leukemia, and an oncologist verified all diagnoses. Five to seven blood smear images were collected per individual, yielding 190 total images of ALL and AML. The diagnostic pipeline involved three sequential steps: clinical evaluation with blood tests, quantification of blast cells in peripheral blood and bone marrow smears, and expert labeling of the leukemia subtype based on morphological analysis of lymphocyte and monocyte cells.

Pre-processing pipeline: All collected images were first resized to 226 x 226 pixels and converted to grayscale to reduce computational volume. Because there was a class imbalance between ALL and AML image counts, the authors employed a GAN network for data augmentation and class balancing. The generator network accepted a 1 x 100 input vector (uniform distribution) and produced 226 x 226 output images through six convolutional layers with dimensions of 512, 1,024, 2,048, 4,096, 8,192, and 51,076. The GAN used ReLU and hyperbolic tangent activation functions, a learning rate of 0.001, and 100 training iterations. This augmentation expanded the dataset from 190 to 500 images with equal class representation. A final Min-Max normalization step scaled all pixel values to the 0-1 range.

Graph construction: To convert images into graph-structured data, the authors applied a superpixel clustering process. Each image was segmented into approximately 150 superpixel regions (determined through trial and error). The mean pixel intensity within each region became the feature vector for that graph node. Edges were assigned based on spatial adjacency between regions, creating a graph adjacency matrix where neighboring regions were connected and non-neighboring regions were disconnected. This graph representation preserved spatial relationships while reducing the input dimensionality compared to raw pixel data.

TL;DR: 44 patients (12 male, 32 female, ages 12-70) provided 190 blood smear images. GAN augmentation expanded this to 500 balanced images. Images were resized to 226 x 226 grayscale, then converted into graphs using superpixel clustering with approximately 150 regions per image, where nodes represent region-averaged pixel intensities and edges represent spatial adjacency.
Pages 7-9
Six Graph Convolutional Layers Plus Softmax: The Proposed Architecture

The proposed deep model architecture consisted of six graph convolutional layers, each followed by batch normalization and a dropout layer to prevent overfitting. After the sixth graph convolutional layer, the output was flattened and passed through a fully connected layer with a Softmax activation function to produce the final ALL/AML classification. Each graph convolutional layer used the ReLU activation function.

Layer details: The first five graph convolutional layers each had weight tensors of shape (P, 32, 32), where P represents the Chebyshev polynomial order (set to 1 for all layers). Each of these layers produced 32 features per graph node, with a bias of 32 parameters and 1,024 x P + 32 total parameters per layer. The sixth graph convolutional layer had a weight tensor of shape (P6, 32, 2), reducing the feature dimension from 32 to 2 (corresponding to the two output classes). The Softmax layer then converted these two-dimensional scores into probability distributions over the ALL and AML classes.

Hyperparameter optimization: All hyperparameters were selected through systematic trial and error. The GAN used a batch size of 12 and the Adamax optimizer. The GCN used a batch size of 32, the SGD optimizer, a learning rate of 0.0001, a dropout rate of 0.2, a weight decay of 4 x 10^-4, and cross-entropy as the loss function. The authors tested multiple values for each parameter, including batch sizes of 4-12 for the GAN, 8-32 for the GCN, learning rates from 0.1 to 0.00001, and optimizers including Adam, SGD, Adadelta, and Adamax.

Data split and validation: The dataset was divided into 70% training, 20% validation, and 10% test sets. Additionally, 5-fold cross-validation was used to ensure all data participated in both training and testing, providing a more robust estimate of model performance. The model was implemented in Python on Google Colab Premium with a T60 GPU and 64 GB of RAM, while data preparation was conducted in MATLAB 2019a.

TL;DR: The architecture uses six graph convolutional layers (32 features each, Chebyshev polynomial order 1) with batch normalization, dropout (0.2), and a Softmax output. Trained with SGD optimizer, learning rate 0.0001, cross-entropy loss. Data split: 70% train, 20% validation, 10% test, plus 5-fold cross-validation.
Pages 10-12
How Layer Count, Polynomial Order, and Region Size Were Tuned

The authors systematically evaluated the impact of three key architectural decisions on model performance: the number of graph convolutional layers, the Chebyshev polynomial coefficient values, and the superpixel region count used during graph construction. Each variable was tested independently while holding others constant, balancing accuracy against computational speed.

Number of layers: The authors tested architectures with 2 through 7 graph convolutional layers. Six layers provided the best trade-off between classification accuracy and training speed. Adding a seventh layer did not improve accuracy and increased computation time, while using fewer than six layers resulted in lower classification performance. This suggests that six layers provide sufficient depth to capture the hierarchical structural features needed to distinguish ALL from AML blood cell morphology.

Chebyshev polynomial coefficients: The polynomial coefficients P1 through P6 control the order of approximation in the graph convolution filter. Setting all coefficients to 1 yielded the highest accuracy of 99%. Higher-order polynomials did not improve performance, likely because the local graph structure (based on superpixel adjacency) does not require long-range spectral filtering to discriminate between the two leukemia subtypes.

Clustering region size: The superpixel region count significantly affected model accuracy. Testing with 50, 100, 150, and 200 regions revealed that 100 regions achieved the highest accuracy of 99.4%. Performance dropped to 94.1% at 50 regions, 91% at 150 regions, and 82% at 200 regions. The 100-region setting appears to strike the optimal balance between preserving morphological detail and reducing graph complexity, as too few regions lose fine-grained cell features while too many create an overly sparse graph representation.

TL;DR: Six graph convolutional layers with Chebyshev polynomial order 1 and 100 superpixel regions per image yielded optimal performance. Accuracy by region count: 50 regions = 94.1%, 100 regions = 99.4%, 150 regions = 91%, 200 regions = 82%. The architecture was optimized for both speed and accuracy.
Pages 12-14
99.4% Accuracy With Strong Performance Across All Evaluation Metrics

The proposed deep graph convolutional network achieved convergence after approximately 120 training iterations (out of 150 total), at which point both training and validation accuracy stabilized and loss reached its minimum. The final model reported the following evaluation metrics for the binary ALL/AML classification: accuracy of 99.4%, sensitivity of 99.2%, precision of 98.1%, specificity of 97.3%, and a kappa coefficient of 0.85.

Confusion matrix analysis: The confusion matrix revealed that the model misclassified only two AML samples across the entire test set. No ALL samples were misclassified. The receiver operating characteristic (ROC) curve fell within the 0.9-1.0 range, confirming strong discriminative ability and the absence of significant overfitting during model training.

Cross-validation: The 5-fold cross-validation results demonstrated classification accuracy above 95% across all five folds, further confirming that the model generalized well and did not overfit to a particular data split. The T-SNE visualization of raw data versus the output of the final fully connected layer showed clear separation between ALL and AML clusters, indicating that the graph convolutional layers effectively learned discriminative feature representations.

These results represent a substantial improvement over most prior methods tested on leukemia classification. Among directly comparable studies, Zhou et al. achieved 85% with fully connected neural networks, Bhute et al. reached 90% with pre-trained networks (VGG16, ResNet60, Inception V3), Rastogi et al. obtained 96.15% with LeuFeatx, and Ansari et al. reported 98% using Type-2 fuzzy combined with CNN. Only Awais et al. achieved a comparable 99.15% using standard CNNs, though on a different private dataset.

TL;DR: The model achieved 99.4% accuracy, 99.2% sensitivity, 98.1% precision, 97.3% specificity, and a kappa of 0.85 for ALL vs. AML classification. Only 2 AML samples were misclassified. All 5 cross-validation folds exceeded 95% accuracy. ROC curve values ranged from 0.9 to 1.0.
Pages 15-16
The Model Holds Up Under Noise and Outperforms CNN, ResNet60, and VGG16

Clinical blood smear images frequently contain ambient noise from acquisition conditions such as lighting variation, camera quality, and staining inconsistencies. To test the model's resilience, the authors added white Gaussian noise at varying signal-to-noise ratios (SNRs) and measured classification accuracy under each condition. The proposed graph CNN maintained accuracy above 90% even at an SNR of 0 dB, which represents equal noise and signal power. This is a notably harsh noise condition that would severely degrade most standard image classifiers.

Head-to-head benchmarking: To ensure a fair comparison with existing methods, the authors retrained three widely used architectures (CNN without graph layers, ResNet60, and VGG16) on the same proposed database for 150 iterations each. The CNN architecture used was identical to the proposed model but without the graph convolutional layers, isolating the specific contribution of graph theory to classification performance. Across all tested conditions, the proposed graph CNN outperformed all three baselines in both clean and noisy environments.

The graph-based architecture's noise resilience likely stems from two properties of the graph representation. First, the superpixel clustering step averages pixel intensities within regions, which inherently smooths out local noise before the data enters the network. Second, the graph convolutional layers operate on relationships between regions rather than individual pixel values, making the learned features more robust to pixel-level perturbations. These structural advantages explain why the graph CNN degraded more gracefully under noise than pixel-based architectures like ResNet60 and VGG16.

TL;DR: The graph CNN maintained over 90% accuracy even at 0 dB SNR (equal noise and signal). When benchmarked against CNN (without graph layers), ResNet60, and VGG16 on the same dataset, the proposed model outperformed all three in both clean and noisy conditions. Superpixel averaging and region-level convolutions provide built-in noise smoothing.
Pages 16-17
Small Dataset, Binary Classification, and the Path Forward

Binary classification constraint: The most significant limitation of this study is that it only addresses binary classification between ALL and AML. Clinical leukemia diagnosis involves four major subtypes: ALL, AML, chronic lymphocytic leukemia (CLL), and chronic myeloid leukemia (CML). A practical clinical tool would need to distinguish among all four, plus potentially identify healthy samples. The authors acknowledge this and propose expanding to multi-class classification in future work.

Small dataset: Despite achieving 99.4% accuracy, the model was trained and evaluated on only 500 images (augmented from 190 originals) derived from 44 patients. This is a small sample by deep learning standards, and the GAN-generated images may not fully capture the true variability of leukemia morphology across diverse patient populations. Although the data availability statement indicates the dataset is private and restricted by the University Ethics Committee, future work would benefit from validation on larger, publicly available datasets such as ALL-IDB or CNMC to confirm generalizability.

Single-center design: All data came from Ghazi Tabriz Medical Sciences Center in Iran. Single-center studies carry inherent risks of selection bias and may not generalize to different patient demographics, staining protocols, or imaging equipment. Multi-center validation would be essential before any clinical deployment of this model.

GAN augmentation vs. traditional methods: The authors used GANs exclusively for data augmentation but did not compare this approach against traditional augmentation strategies such as rotation, flipping, scaling, and color jittering. Future studies could evaluate whether GAN-generated samples provide measurably better augmentation than simpler geometric transformations, or whether a hybrid approach combining both methods yields superior results. Additionally, the potential for transfer learning combined with graph convolutional architectures remains unexplored and could further improve performance on limited datasets.

TL;DR: Key limitations include binary-only classification (ALL vs. AML, not all four leukemia types), a small dataset of 500 images from just 44 patients at a single center, and no comparison of GAN augmentation against traditional augmentation methods. Future work should expand to multi-class classification, validate on larger multi-center datasets, and explore transfer learning with graph architectures.
Citation: Zare L, Rahmani M, Khaleghi N, Sheykhivand S, Danishvar S.. Open Access, 2024. Available at: PMC11273433. DOI: 10.3390/bioengineering11070644. License: cc by.