Deep Learning for Acute Leukemia Detection via Flow Cytometry

Plain-English Explanations

Overview

Page 1

What This Study Tackles and Why It Matters

Flow cytometry is a laboratory technique that rapidly analyzes individual cells by passing them through a laser beam, measuring their physical and chemical properties. It is the workhorse technology for diagnosing acute leukemia, the umbrella term for fast-growing blood cancers that include acute myeloid leukemia (AML) and B-acute lymphoblastic leukemia (B-ALL). Each cell is tagged with fluorescent antibodies against specific CD markers (such as CD34, CD45, CD19, and CD7), and the instrument records how brightly each marker lights up.

The Euroflow consortium, founded in 2006, standardized these protocols across European laboratories to improve accuracy and reproducibility. One of its most widely used panels is the acute leukemia orientation tube (ALOT), which screens for AML, B-ALL, and T-ALL in a single tube by combining eight antibodies. Despite this standardization, interpreting flow cytometry results remains slow and subjective. Technicians manually draw boundaries (called "gating") around cell populations on scatter plots, a process that takes five to ten minutes per sample and varies between readers.

This retrospective study from China Medical University Hospital set out to determine whether deep learning could automate two tasks: screening patients for acute leukemia and classifying the individual cell types present in each sample. The researchers collected data from 241 patients who underwent ALOT-based flow cytometry between 2017 and 2022, then trained AI models using ResNet-50 and a custom architecture called EverFlow.

TL;DR: Manual analysis of flow cytometry data for acute leukemia is slow and subjective. This study trained deep learning models (ResNet-50 and EverFlow) on 241 patients' ALOT flow cytometry data to automate leukemia screening and cell classification, achieving 94.6% sensitivity for AML and 98.2% for B-ALL.

Data and Study Design

Pages 2-3

Patient Cohort, ALOT Protocol, and Training Strategy

Patient population: The dataset comprised 241 patients from China Medical University Hospital spanning 2017 to 2022. Patients were grouped into five diagnostic categories: 41 with AML, 43 with B-ALL, 60 with complex conditions (potentially myelodysplastic syndrome or inconclusive findings), 64 with normal flow cytometry results, and 34 with other diseases including T-ALL, B-cell lymphoma, T-cell lymphoma, myeloma, and hemophagocytic lymphohistiocytosis. Each patient's FCS file contained up to 250,000 events (individual cells), with each event described by 12 channels: the fluorescence intensity of eight antibodies plus forward scatter and side scatter measurements.

Three-phase training design: The AI training was organized into three distinct phases. In Phase I, raw 12-channel data from complete FCS files were fed into ResNet-50 to predict the patient's disease directly from the unprocessed cell data. In Phase II, FCS files were manually separated by cell type, and a custom model called EverFlow was trained to recognize individual cell populations. In Phase III, the cell compositions generated by the Phase II model were combined with or without the original 12-channel data, and ResNet-50 was again used to predict disease status from this enriched input.

The data were split 80/20 for training and testing, and five-fold cross-validation was applied to the training set to guard against overfitting. This three-phase approach allowed the researchers to compare disease prediction from raw data alone (Phase I), cell-type classification accuracy (Phase II), and disease prediction from AI-derived cell compositions (Phase III).

TL;DR: The study used 241 patients across five diagnostic groups. Training was split into three phases: Phase I used raw flow cytometry data with ResNet-50, Phase II trained a custom EverFlow model to classify cell types, and Phase III combined AI-derived cell compositions with channel data for final disease prediction.

Model Architectures

Pages 2-3

ResNet-50 and the Custom EverFlow Architecture

ResNet-50 is a 50-layer deep residual neural network originally designed for image classification. It uses "skip connections" that allow information to bypass layers, preventing the vanishing gradient problem that plagues very deep networks. In this study, the authors repurposed ResNet-50 to process 12-channel flow cytometry data rather than images. Several alternative architectures were tested during Phase I, including a standard CNN, SEResNet-50, SEResNext50, ResNext50_32x4d, and EfficientNet-B4. All performed well, but ResNet-50 offered the best balance of speed and accuracy, so it was selected for Phases I and III.

EverFlow is a custom multi-level network architecture the authors designed specifically for flow cytometry data analysis. It uses three Conv1d (one-dimensional convolutional) layers, three BatchNorm1d (batch normalization) layers, and two MaxPool1d (max-pooling) layers, combined with the ReLU activation function and an Adaptive Average Pooling layer. A key component is the "Flow Block," which bundles Conv1d, BatchNorm1d, MaxPool1d, and ReLU into a reusable unit. By stacking multiple Flow Blocks, EverFlow builds hierarchical representations of the FCS data.

ResNet-50 proved unsuitable for Phase II (cell-type classification) because it caused overfitting on this more granular task. The simpler EverFlow architecture was better matched to the problem of distinguishing individual cell populations within a patient's sample. Training used the Ranger optimizer with a learning rate of 5E-3, and convergence typically occurred within 75 epochs.

TL;DR: ResNet-50, a 50-layer deep residual network, handled patient-level disease prediction in Phases I and III. For cell-type classification in Phase II, the authors created EverFlow, a lighter CNN architecture with stacked "Flow Blocks" purpose-built for flow cytometry data. ResNet-50 overfitted on cell classification, but EverFlow handled it well.

Phase I Results

Pages 3-4

Disease Detection from Raw Flow Cytometry Channels

Testing setup: In Phase I, 185 patients were used for training and validation via five-fold cross-validation, while the remaining 56 patients formed the independent test set. The AI received only the raw 12-channel data from each patient's FCS file and attempted to predict the diagnosis without any manual gating or cell-type labeling.

AML detection: The model achieved 91.1% accuracy for recognizing AML, with 80.0% sensitivity (correctly identifying 8 of 10 AML patients) and 93.5% specificity (correctly ruling out AML in 43 of 46 non-AML patients). The F1 score was 76.2%. While the sensitivity was promising, the 80% figure meant that two AML patients were missed in this phase.

B-ALL detection: Performance was stronger for B-ALL, with 94.6% accuracy, 90.91% sensitivity (10 of 11 B-ALL patients correctly identified), and 95.6% specificity (43 of 45 non-B-ALL patients correctly excluded). The F1 score reached 87.0%. These results established that deep learning could extract diagnostically useful patterns directly from raw flow cytometry channels without manual cell gating.

Challenging categories: Performance on the "complex" group (possible myelodysplastic syndrome) was notably lower, with only 42.9% sensitivity and an F1 of 44.4%. Similarly, the "other disease" category showed just 37.5% sensitivity. These weaker results were expected because the ALOT tube was designed specifically for acute leukemia screening, not for differentiating these less common or overlapping conditions.

TL;DR: Using raw 12-channel flow cytometry data alone, ResNet-50 achieved 91.1% accuracy for AML (80% sensitivity) and 94.6% accuracy for B-ALL (90.9% sensitivity). Complex and rare disease categories were harder, confirming that the ALOT tube is optimized for acute leukemia rather than broader diagnoses.

Phase II Results

Pages 4-5

Cell-Type Classification with EverFlow

Training data preparation: For Phase II, FCS files from 70 patients across seven diagnostic groups were manually gated by experts to isolate individual cell populations. This produced separate FCS files for each cell type: B lymphocytes, T lymphocytes, neutrophils, monocytes, erythrocytes, eosinophils, NK cells, CD34+ myeloid cells, CD34+ B-precursors, debris, and several pathological populations including AML cells (CD34+ and CD34-), B-ALL cells (CD34+ and CD34-), T-ALL cells, and B-cell chronic lymphoproliferative disorder (BCLPD). The training set ranged from about 4,000 to over 1.1 million events per cell type.

Normal cell identification: The EverFlow model identified physiological (normal) cells with accuracy exceeding 80% for most types. This is clinically significant because reliable identification of normal cell populations is the foundation for detecting abnormal ones. The model's ability to consistently recognize the baseline cell landscape meant it could flag deviations that suggest pathology.

Pathological cell identification: Performance on pathological cells varied substantially. AML cells that were CD34-positive were correctly identified 64.4% of the time, and B-ALL cells with CD34-positive status were identified at 62.6%. However, CD34-negative AML cells dropped to only 32.0% identification, and CD34-negative B-ALL fell to 18.5%. Only 14.6% of B-cell lymphoma cells were properly recognized. In contrast, T-ALL cells were identified at 97.7% because they are CD45-negative, making them readily distinguishable from normal T cells that are always CD45-positive.

Despite the lower per-cell sensitivity for some pathological types, the overall patient-level diagnosis was not severely impacted. The AI needed to detect only a subset of abnormal cells to flag the patient correctly, demonstrating that high per-cell sensitivity is not strictly necessary for accurate disease-level screening.

TL;DR: EverFlow classified normal cell types at over 80% accuracy. Pathological cell identification varied: CD34-positive AML and B-ALL cells were recognized at 62-64%, while CD34-negative variants dropped to 18-32%. T-ALL cells were easiest to spot at 97.7%. Importantly, even partial cell detection was sufficient for correct patient-level diagnosis.

Phase III Results

Pages 5-6

Combining Cell Composition with Channel Data for Final Diagnosis

Two configurations tested: Phase III used the same 185 training and 56 testing patient split as Phase I, but now the input included the cell compositions generated by the Phase II EverFlow model. Two configurations were evaluated: one using cell composition labels alone, and one combining cell labels with the original 12-channel data. ResNet-50 served as the classifier in both configurations.

Cell labels only: Using only the AI-derived cell compositions, AML detection reached 89.3% accuracy with 90.0% sensitivity and 89.1% specificity. B-ALL detection was even stronger at 98.2% accuracy, 90.91% sensitivity, and 100% specificity. The improvement in AML sensitivity from 80.0% in Phase I to 90.0% here indicates that the cell-composition information provided by EverFlow added meaningful diagnostic signal beyond the raw channel data.

Cell labels plus 12 channels: When both data types were combined, AML accuracy rose to 94.6% with 90.0% sensitivity and 95.7% specificity. B-ALL maintained 98.2% accuracy, 90.9% sensitivity, and 100% specificity. This combined approach produced the best overall performance, confirming that raw channel features and AI-derived cell compositions capture complementary information. The "other disease" category also improved to 75.0% sensitivity (up from 37.5% in Phase I), and "normal" identification reached 69.2% sensitivity.

The results demonstrate a clear trend: the AI performed optimally when given both the cell composition analysis from EverFlow and the underlying 12-channel measurements. Compared to manual gating, which takes five to ten minutes per sample, the AI completed its full analysis in under one minute, offering a substantial time savings for clinical laboratories.

TL;DR: Combining EverFlow-derived cell compositions with raw 12-channel data yielded the best results: 94.6% accuracy for AML (90% sensitivity) and 98.2% for B-ALL (90.9% sensitivity, 100% specificity). Analysis time dropped from 5-10 minutes with manual gating to under one minute with AI.

Limitations and Future Directions

Pages 6-7

CD34-Negative Challenges, Small Sample Size, and Next Steps

The CD34-negative problem: The most significant limitation was the AI's difficulty with CD34-negative pathological cells. CD34 is a stem cell marker commonly expressed on blast cells in acute leukemia. When AML or B-ALL cells lack this marker, they become much harder to distinguish from normal mature cells, even for human experts performing manual gating. The one AML patient missed by the AI had CD34-negative disease, and the one missed B-ALL patient also had CD34-negative cells that were misclassified as B-cell lymphoma. These are recognized pitfalls in clinical flow cytometry interpretation as well.

Myelodysplastic syndrome screening: The AI struggled with patients who potentially had myelodysplastic syndrome (sometimes called preleukemia), achieving only 42.9% sensitivity at best. This is not surprising because the ALOT tube was designed specifically for acute leukemia screening, not for myelodysplastic syndromes. A separate Euroflow protocol called the AML/MDS tube, which uses a more complex antibody panel, exists for this purpose. The authors noted that training AI on the AML/MDS protocol is a current priority.

Sample size and T-ALL: With only 241 patients total and just 11 T-ALL patients in the cohort, the dataset was relatively small by deep learning standards. The limited T-ALL representation made it impossible to adequately train or test the AI for this subtype at the patient level, despite the model's excellent 97.7% cell-level sensitivity. Additionally, Phase II's approach of pooling identical cell types across patients may have caused loss of within-patient contextual information, though the impact of this remains uncertain.

What distinguishes this work: The authors highlight three features that set their approach apart from prior studies. First, they used deep learning rather than traditional machine learning, which most earlier flow cytometry AI studies relied on. Second, their training focused on the Euroflow standardized protocol, which is highly reproducible across institutions. Third, the AI was purpose-built for acute leukemia detection rather than attempting to identify a broad range of diseases. Future work should expand the dataset, incorporate additional Euroflow protocols, and pursue prospective validation at multiple centers.

TL;DR: The main limitations were difficulty with CD34-negative leukemia cells, poor sensitivity for myelodysplastic syndrome (which ALOT was not designed to detect), and a small dataset of only 241 patients. The study's strengths include being the first deep learning approach to Euroflow-standardized flow cytometry, with plans to expand to additional protocols and larger cohorts.

Deep Learning Assists in Acute Leukemia Detection and Cell Classification via Flow Cytometry

Original Paper (PDF)