Deep Learning for NHL Classification on Histopathology

Overview & Background

Pages 1-2

Why Classifying Non-Hodgkin Lymphoma Subtypes on Histopathology Is Difficult

Non-Hodgkin lymphoma (NHL) is a heterogeneous group of hematological cancers and ranks among the 10 most common cancer types worldwide. In 2020, the Surveillance, Epidemiology, and End Results (SEER) database estimated 77,240 new cases and 19,940 deaths from NHL in the United States alone. The subtypes within NHL vary dramatically in clinical behavior and prognosis, making accurate subtyping essential for treatment decisions. Diagnosis typically requires histopathological examination of lymph node resection specimens, followed by expensive immunohistochemical and molecular analyses directed by an experienced hematopathologist.

The pathologist bottleneck: Accurate NHL subtyping demands deep expertise, extensive training, and access to costly laboratory methods. The number of pathologists is declining in many countries (notably in Germany, where this study was conducted), while the knowledge requirements keep growing. Pathologists outside major academic centers may lack both the specialized hematopathology experience and the molecular equipment needed for liberal use of ancillary tests. This creates a clear need for supplemental tools that can support morphological decision-making.

Prior work in digital pathology: Deep learning methods have already demonstrated high accuracy in classifying carcinoma subtypes and detecting lymph node metastases from solid tumors. However, studies specifically addressing NHL subtype classification on histopathological images remain limited. The few existing reports typically included only small cohorts (34 to 259 cases per entity) and used heterogeneous methods, making direct comparison difficult. This study set out to test whether an EfficientNet convolutional neural network (CNN) could reliably distinguish tumor-free lymph nodes from two common NHL subtypes: small lymphocytic lymphoma/chronic lymphocytic leukemia (SLL/CLL) and diffuse large B-cell lymphoma (DLBCL).

TL;DR: NHL accounts for 77,240 new US cases per year and requires expert pathologists for subtyping, but the pathologist workforce is shrinking. Deep learning studies on NHL classification are scarce. This study trained an EfficientNet CNN on 629 patients and 84,139 image patches to classify tumor-free lymph nodes, SLL/CLL, and DLBCL.

Methodology

Pages 2-3

Patient Cohort, Tissue Microarray Construction, and Image Patch Extraction

Cohort composition: The authors assembled 629 patients from the Institute of Pathology at Heidelberg University, supported by the National Center for Tumor Diseases (NCT) tissue biobank. The cohort included 129 SLL/CLL cases, 119 DLBCL cases, and 381 tumor-free control lymph nodes harvested from resection specimens of lung, colon, and pancreas surgeries. All lymphoma diagnoses followed the 2016 WHO Classification of Tumors of Hematopoietic and Lymphoid Tissue, with standard hematoxylin and eosin (H&E) staining and immunohistochemistry per current best-practice recommendations.

Tissue microarray and scanning: Rather than using full whole-slide images, the team constructed tissue microarrays (TMAs) and scanned them at 400x magnification using an Aperio SC2 slide scanner (Leica Biosystems). Scanned slides were imported into QuPath (v0.1.2), where a pathologist annotated tumor areas for each of the three classes. From these annotations, image patches of 100 x 100 micrometers (395 x 395 pixels) were extracted. Care was taken to avoid annotating beyond tissue core borders, preventing the algorithm from learning edge artifacts rather than cellular morphology.

Patch yield: A total of 84,139 image patches were extracted across all patients. The target was a minimum of 10 patches per patient, which was achieved in all but 7 cases. For tumor-free control lymph nodes, the authors retained the detailed sub-classification (lung, colon, pancreas origin) during training, anticipating that this granularity might improve classification accuracy. At evaluation time, these sub-classes were aggregated into a single "tumor-free reference LN" category.

Data splitting: Patients were randomly divided into training (60%), validation (20%), and test (20%) subsets. Critically, all patches from a given patient were assigned to the same subset, preventing data leakage. A checkpoint in the code verified that no patient appeared in more than one subset. These splits remained fixed throughout all experiments.

TL;DR: 629 patients (129 SLL/CLL, 119 DLBCL, 381 controls) yielded 84,139 image patches at 400x magnification. TMAs were scanned on an Aperio SC2, annotated in QuPath, and split 60/20/20 at the patient level to prevent data leakage.

Model Architecture

Pages 3-5

EfficientNet Architecture Selection and Hyperparameter Optimization

Why EfficientNet: The authors chose the EfficientNet family of CNNs because it achieved a high top-1 accuracy of 84.3% on the ImageNet benchmark while being smaller and significantly faster than competing architectures with comparable accuracy. EfficientNet uses a compound scaling method that systematically increases network width, depth, and image resolution using fixed scaling coefficients, rather than scaling only one dimension at a time. This balanced approach exploits the inherent relationship between these three dimensions to maximize performance per parameter.

Model selection (B0 through B4): The team trained and optimized five EfficientNet variants (B0, B1, B2, B3, B4). Each model was trained for 50 epochs using the Adam optimizer, with learning rates explored in the range of 10^-5 to 10^-6. The best learning rate for each variant was identified, and training continued until no further performance gain was observed. The final tuned hyperparameters were: B0 (learning rate 1e-6, batch size 256), B1 (1e-5, batch 128), B2 (9e-6, batch 128), B3 (8e-6, batch 64), and B4 (6e-6, batch 16). Batch size decreased with model scale because larger models and higher input resolutions consume more GPU memory.

Performance comparison: The B3 and B2 models achieved nearly identical overall accuracy, but the B3 model produced a slightly more accurate confusion matrix on the validation set. Since B4 did not outperform B3, the authors did not pursue the even larger B5 through B7 variants. The B3 model was selected as the final classifier for the independent test set. Training was performed on BwForCluster MLS&WISO production nodes using Nvidia Tesla K80 GPUs (for B0 through B3, with mirrored strategy across two GPUs) and an Nvidia GeForce RTX 2080Ti (for B4, single GPU), running TensorFlow 2.3.1 inside Singularity containers.

TL;DR: Five EfficientNet variants (B0-B4) were trained with learning rates from 1e-6 to 1e-5. The B3 model (learning rate 8e-6, batch size 64) was selected as the best performer. B4 offered no improvement over B3, so larger variants were skipped. Training used Tesla K80 and RTX 2080Ti GPUs with TensorFlow 2.3.1.

Results

Pages 5-6

Test Set Performance: Balanced Accuracy and Confusion Matrices

Patch-level and case-level evaluation: The selected EfficientNet B3 model was evaluated on the independent test set containing 16,960 image patches from 125 patients. At the patch level, each patch was assigned the class with the highest predicted probability. At the case (patient) level, the final classification was determined by majority vote across all patches belonging to that patient. The authors used balanced accuracy (BACC) rather than plain accuracy to account for class imbalance across the three categories.

Baseline performance: Without any quality control filtering, the model achieved a high BACC for DLBCL and tumor-free reference lymph nodes, with only a single missed case in each category. However, SLL/CLL predictions showed a lower BACC with multiple misclassifications. This asymmetry likely reflects the morphological similarity between SLL/CLL cells and normal lymphocytes, since neoplastic SLL/CLL cells can represent only a fraction of the overall image area, making them harder to distinguish from reactive lymphoid tissue.

DLBCL detection: The algorithm achieved 100% sensitivity and specificity for detecting DLBCL on the test set, correctly classifying every DLBCL case. This strong performance is consistent with the distinct large-cell morphology of DLBCL, which presents a visually more obvious pattern compared to the small, mature-appearing lymphocytes of SLL/CLL. The clear morphological contrast between DLBCL and both normal lymph nodes and SLL/CLL made it a comparatively easier classification target for the CNN.

TL;DR: The EfficientNet B3 model was tested on 16,960 patches from 125 patients. DLBCL achieved 100% sensitivity and specificity. SLL/CLL showed lower balanced accuracy due to morphological similarity with normal lymphocytes. Majority voting assigned patient-level classifications.

Quality Control

Pages 6-7

Patch-Level and Case-Level Quality Control Thresholds

Two-tiered quality filtering: The authors implemented a dual quality control system to improve classification reliability. The patch-based quality control (PQC) threshold filtered out any image patch whose highest predicted probability fell below a set cutoff, removing low-confidence predictions. The case-based quality control (CQC) threshold then filtered out entire patient cases where the proportion of patches assigned to the predicted class fell below a second cutoff. Together, these two thresholds acted as a confidence gate, flagging uncertain cases for human review rather than forcing a potentially incorrect classification.

Impact on balanced accuracy: The authors systematically tested PQC thresholds from 50% to 90% and CQC thresholds from 50% to 90%. Increasing the CQC threshold steadily improved overall BACC. At the most stringent combination (PQC = 0.9, CQC = 0.9), the model achieved a BACC of 95.56% on the test set, with only 3 out of 102 remaining patients misclassified. However, this came at the cost of filtering out approximately 24.8% of cases as "not meeting threshold," meaning those cases would be flagged for manual pathologist review.

Clinical relevance of quality controls: The authors argue that quality control limits are essential for any real-world diagnostic deployment. In a routine setting, the algorithm would classify straightforward cases automatically while routing ambiguous cases (those failing the quality thresholds) to a pathologist for additional workup. This mirrors existing clinical practices where algorithmic confidence scores trigger escalation. Only one prior lymphoma classification study had used quality control limits at the patch level to generate heatmaps, but none had applied them systematically to the final case-level classification as done here.

TL;DR: A dual quality control system (patch-level PQC and case-level CQC) boosted BACC to 95.56% at PQC = 0.9 and CQC = 0.9, with only 3/102 patients misclassified. The trade-off: 24.8% of cases were filtered for manual review. This approach routes uncertain cases to pathologists rather than risking misdiagnosis.

Explainability

Page 7

SmoothGrad Heatmaps Confirm the Model Learns Cellular Morphology

SmoothGrad method: To verify that the CNN was basing its predictions on biologically meaningful features rather than artifacts, the authors applied SmoothGrad, a gradient-based explainability technique. SmoothGrad works by adding small amounts of noise to the input image (in this case, a noise level of 0.5%), computing the gradient of the predicted class with respect to the input pixels, and averaging these gradients across 50 samples. The resulting heatmap highlights which pixel regions most strongly influenced the model's prediction. The heatmaps were normalized to a [0, 1) scale using the maximum gradient per channel.

Key findings: The SmoothGrad heatmaps showed high activation in areas overlapping with individual cells, confirming that the algorithm was focusing on cell morphology rather than on background tissue, staining artifacts, or edge effects. For lung-origin control lymph nodes, the heatmaps also highlighted extracellular anthracosis (carbon pigment deposits), which is a known histological feature of lung-associated lymph nodes. This finding provided additional confidence that the model had learned clinically relevant morphological patterns, not spurious correlations.

Significance for trust: Explainability is a critical requirement for any AI system intended for clinical deployment. By demonstrating that the EfficientNet B3 model attends to cellular and extracellular morphological structures, the authors provide evidence that the learned representations are biologically interpretable. This is particularly important in pathology, where the visual features driving a diagnosis must be traceable and auditable by the reviewing pathologist.

TL;DR: SmoothGrad heatmaps (noise 0.5%, 50 samples) confirmed the CNN focuses on cell morphology and known histological features like anthracosis in lung lymph nodes, not on artifacts or tissue edges. This provides evidence the model learned biologically meaningful patterns.

Computational Performance

Pages 7-8

Inference Speed on CPU and GPU, and Implications for Routine Diagnostics

Benchmarking setup: To estimate real-world inference latency, the authors classified a random image patch 1,000 times using a TensorFlow tf.data pipeline with a batch size of 1. They tested on two hardware configurations: a single thread of an Intel Core i9-9880H CPU (2.3 GHz) and an Nvidia Quadro T2000 GPU. On the CPU, the model processed each patch in 203 ms. On the GPU, this dropped to 107 ms per patch.

Scaling considerations: A typical patient in this study had approximately 100 to 150 image patches. At 107 ms per patch on the Quadro T2000, classifying all patches for a single patient would take roughly 10 to 16 seconds on a GPU. On a CPU, the same task would require 20 to 30 seconds. While these times are acceptable for a supplementary diagnostic tool, real-time whole-slide analysis (which can generate thousands of patches) would require more powerful hardware or optimized inference pipelines such as batched prediction and mixed-precision computation.

CPU accessibility matters: The authors highlight that most pathology labs worldwide are not equipped with GPUs, making CPU inference performance a practical consideration. At 203 ms per patch on a standard laptop CPU, the model remains usable in resource-limited settings, though not at interactive speeds. Finding equally fast inference methods on CPUs would be beneficial for routine diagnostic adoption, particularly in low-resource environments where GPU infrastructure is unavailable.

TL;DR: Inference took 203 ms/patch on CPU (Intel i9-9880H) and 107 ms/patch on GPU (Nvidia Quadro T2000). A typical patient (~100-150 patches) would require 10-30 seconds depending on hardware. CPU-based inference is feasible for resource-limited labs, though GPU acceleration is preferred for throughput.

Limitations & Future Directions

Pages 8-10

Sample Size Constraints, Entity Coverage, and the Path Forward

Limited entity coverage: The model was trained on only two B-cell NHL subtypes (SLL/CLL and DLBCL) plus tumor-free controls. It cannot be expected to classify other B-cell NHLs (such as follicular lymphoma, mantle cell lymphoma, or marginal zone lymphoma), T-cell lymphomas, or Hodgkin lymphoma. In clinical practice, a pathologist encounters a far broader spectrum of lymphoid neoplasms, so the current algorithm serves as a proof of concept rather than a comprehensive diagnostic tool.

Morphological heterogeneity within subtypes: Both SLL/CLL and DLBCL encompass significant morphological variation. SLL/CLL can show extensive plasmacytoid differentiation, large confluent proliferation centers (not representing Richter transformation), and variable proliferative activity linked to IGHV gene mutation status. DLBCL includes activated B-cell-like (ABC) and germinal center B-cell-like (GCB) molecular subtypes with distinct morphological, immunohistological, and genetic features. The training set of 378 cases (after the 60% split) can only capture a fraction of this morphological spectrum, and rare variants may be underrepresented.

SLL/CLL sensitivity gap: The lower sensitivity for SLL/CLL, even after quality control filtering, is a meaningful limitation for screening applications where false negatives must be minimized. The authors note that neoplastic SLL/CLL cells may represent only a small fraction of the overall image area in some patches, causing the model to misclassify them as normal lymph nodes. Increasing the training dataset size, incorporating multiple magnifications, and adding immunohistochemistry-stained sections could help address this gap.

TMA vs. whole-slide images: The study used tissue microarrays rather than full whole-slide images. Although the authors expect comparable performance on whole slides (since individual patches from TMAs and whole slides are similar), this has not been explicitly validated. Whole-slide analysis introduces additional challenges such as tissue folding, variable staining quality, and the need for automated region-of-interest selection. Future studies should validate the algorithm on whole-slide images from multiple institutions to assess generalizability.

Clinical integration vision: The authors envision a workflow where the deep learning algorithm provides an initial classification, which is then confirmed by a pathologist who orders a small, targeted panel of confirmatory immunohistochemical or molecular tests. This would be especially useful in settings where lymph nodes are reviewed for carcinoma metastasis by pathologists with limited hematopathology expertise, since the algorithm could raise alertness for an unsuspected hematological neoplasm such as SLL/CLL. Expanding the model to include more NHL subtypes, validating on multi-center whole-slide data, and integrating with laboratory information systems are key next steps.

TL;DR: Key limitations include coverage of only 2 NHL subtypes out of dozens, morphological heterogeneity within SLL/CLL and DLBCL, lower SLL/CLL sensitivity, and use of TMAs instead of whole-slide images. Future work should expand entity coverage, validate on multi-center whole slides, and integrate into pathology workflows as a triage tool.

Deep Learning for the Classification of Non-Hodgkin Lymphoma on Histopathological Images

Original Paper (PDF)

Plain-English Explanations