Histological Image Classification: Follicular Lymphoma vs. Reactive Hyperplasia

Overview and Background

Pages 1-4

The Diagnostic Challenge of Follicular Lymphoma vs. Reactive Tissue

The fundamental question facing any pathologist examining a lymph node biopsy is whether the process is benign or malignant. Follicular lymphoma (FL) is the most common indolent B-cell lymphoma, representing roughly 40% of adult lymphomas in Western countries and 20% worldwide. The median age of presentation is 65 years, and patients typically present with generalized lymphadenopathy and often asymptomatic bone marrow infiltration in approximately 40% of cases. Under the microscope, FL shows enlarged lymph nodes with architectural effacement by uniform, closely packed neoplastic follicles.

The diagnostic overlap: Reactive lymphoid hyperplasia (follicular hyperplasia) can mimic FL on hematoxylin and eosin (H&E) staining. Key histological features that favor FL include follicles with predominance of centrocytes, interfollicular centrocytes, absence of mantle zones, absence of starry-sky patterns, and close packing of follicles. In contrast, reactive hyperplasia tends to show preserved germinal center polarization, tangible body macrophages, well-defined mantle zones, and heterogeneous follicle distribution. Definitive diagnosis often requires immunohistochemistry (CD20, CD10, BCL6, BCL2) and sometimes molecular analysis for the characteristic t(14;18)(q23;q32)/BCL2::IGH translocation.

Prior AI work in lymphoma: While convolutional neural networks (CNNs) have been applied to lymphoma since 2016, most research has focused on radiological images. Only a handful of studies have used H&E histological images. Hashimoto et al. classified 262 malignant lymphoma cases, El Achi et al. classified 128 cases into four diagnostic categories, and Li et al. developed the GOTDP-MP-CNNs platform. However, none of these studies tackled the specific FL vs. reactive tissue differential in a large case series. This study aimed to fill that gap using 221 cases (177 FL and 44 reactive lymphoid tissue) with a ResNet-18 transfer learning approach.

TL;DR: Follicular lymphoma accounts for 40% of adult lymphomas in the West and can be histologically difficult to distinguish from reactive lymphoid hyperplasia. This study used 221 cases (177 FL, 44 reactive) with a ResNet-18 CNN to automate the differential diagnosis on H&E-stained images, addressing a gap in prior AI research.

Methodology

Pages 5-7

Image Acquisition, Preprocessing, and Dataset Construction

Whole-tissue H&E-stained glass slides from 177 follicular lymphoma and 44 reactive lymphoid tissue cases (tonsils and lymph nodes) were digitized using a NanoZoomer S360 digital slide scanner (Hamamatsu Photonics). The scanned images were visualized using NDP.view2 software and exported as JPEG files at 200x magnification and 150 dpi. These full-resolution images were then split into 224 x 224 x 3 pixel image patches using PhotoScape v3.7.

Manual curation: After splitting, a pathologist specialist (J.C., MD PhD) manually curated all patches. Patches that were not exactly 224 x 224 pixels, contained less than 20-30% tissue, or showed artifacts such as broken, folded, or nondiagnostic tissues were excluded. This quality control step is critical because, as the authors emphasize, neural networks are only as good as the data they are trained on. The final dataset comprised 1,495,014 image patches totaling 64.9 GB, with 1,004,508 patches from FL cases (42.0 GB) and 490,506 patches from reactive tissue (22.9 GB).

Transfer learning with ResNet-18: The study used a transfer learning approach, loading the pretrained ResNet-18 network and replacing the final layers to adapt it for the two-class FL vs. reactive tissue classification. ResNet-18 is a directed acyclic graph (DAG) network with 71 layers and 78 connections that uses residual (shortcut) connections to improve gradient flow and mitigate the vanishing gradient problem. The network accepts 224 x 224 x 3 input images, matching the prepared patch dimensions exactly.

Training configuration: Training was conducted on a workstation with an AMD Ryzen 9 5900X processor, 48 GB RAM, and an NVIDIA GeForce RTX 4080 SUPER (16 GB) GPU using MATLAB R2023b. The training options included stochastic gradient descent with momentum (sgdm) as the solver, an initial learning rate of 0.001, minibatch size of 128, a maximum of 5 epochs, and a validation frequency of 50 iterations. Image normalization was applied using z-score normalization with mean and standard deviation derived from the training data.

TL;DR: A total of 1,495,014 image patches (64.9 GB) were generated from 221 cases, manually curated by a pathologist, and used to fine-tune a pretrained ResNet-18 network (71 layers, 78 connections) via transfer learning with sgdm optimizer, learning rate 0.001, batch size 128, and 5 epochs on an NVIDIA RTX 4080 SUPER GPU.

Data Partitioning

Pages 8-9

Two-Tier Partitioning Strategy to Prevent Information Leakage

The authors implemented two distinct data partitioning strategies, recognizing a critical methodological concern in patch-level histopathology classification. In the first strategy (Type 1), all image patches were pooled and randomly split: 70% for training (919,153 patches), 10% for validation (131,308 patches), and 20% for testing (262,615 patches). This patch-level random split carries a risk of information leakage because patches from the same patient can appear in both training and testing sets.

Patient-level validation (Type 2): To address this, the authors designed a hybrid partitioning approach. Before the main training, 10 FL cases and 10 reactive tissue cases were set aside as an entirely independent test set (Set 2). These 20 cases contributed 190,880 patches (82,263 FL patches and 108,617 reactive patches) that were never seen during training or validation. The remaining cases were used for training/validation and a patch-level test (Set 1). Crucially, Set 2 was evaluated at the case (patient) level rather than the patch level, meaning the system had to correctly classify the majority of patches from each patient to get the overall case diagnosis right.

This two-tier approach is a significant methodological strength. Many deep learning studies in digital pathology only report patch-level metrics, which can be inflated by information leakage. By including an independent patient-level test set, the authors provided a more clinically realistic evaluation of how the model would perform when encountering entirely new patients.

TL;DR: Type 1 partitioning split patches 70/10/20 across training (919,153), validation (131,308), and testing (262,615). Type 2 held out 20 complete cases (190,880 patches) as an independent patient-level test set to guard against information leakage between training and testing splits.

Results

Pages 10-14

Patch-Level Classification Performance

The training process converged rapidly, reaching a stable state during the first epoch. The authors attribute this fast convergence to two factors: the use of transfer learning from a pretrained ResNet-18 (which already contains useful low-level image features) and the high quality of the manually curated dataset. The full training run completed in 19.5 hours across 5 epochs, with 40,875 total iterations (8,175 iterations per epoch). The validation accuracy reached 99.81%.

Test set performance: On the held-out 20% test set (Type 1), the model achieved an accuracy of 99.80% at the image-patch level for FL classification. The full performance metrics were: precision 99.8%, recall/sensitivity 99.8%, false positive rate 0.35%, specificity 99.7%, and F1 score 99.9%. These numbers indicate extremely strong discriminative ability between FL and reactive lymphoid tissue at the patch level, with fewer than 4 in every 1,000 reactive patches incorrectly flagged as FL.

Type 2 partitioning results: When the same training approach was repeated with the hybrid partitioning (excluding the 20 independent cases), the training took 14.3 hours across 35,900 iterations. The validation accuracy was 99.83%, and the test Set 1 accuracy was also 99.83%. These results closely mirror the Type 1 findings, confirming that the model's performance was not inflated by information leakage.

TL;DR: Patch-level accuracy reached 99.80% on the Type 1 test set, with precision 99.8%, recall 99.8%, specificity 99.7%, F1 score 99.9%, and a false positive rate of just 0.35%. The Type 2 partitioning confirmed these results with 99.83% accuracy on its own test set, ruling out information leakage.

Results

Pages 15-18

Patient-Level Validation and Discordant Cases

The independent patient-level test (Set 2) provided a more clinically realistic evaluation. Each of the 20 withheld cases was analyzed individually, with the percentage of patches classified as FL or reactive used to determine the overall case-level diagnosis. All 10 follicular lymphoma cases were correctly predicted, with FL patch percentages ranging from 81.1% (FL6) to 100% (FL3 and FL8). Most FL cases had very high FL patch rates above 98%.

Reactive tissue results and misclassifications: Among the 10 reactive lymphoid tissue cases, 8 were correctly classified with reactive patch percentages ranging from 74.1% (R5) to 100% (R3). However, 2 out of 10 reactive cases (20%) were incorrectly diagnosed as FL. Case R7 had only 36.7% of patches classified as reactive (63.3% classified as FL), and case R9 had only 48.5% of patches classified as reactive (51.5% classified as FL). The authors examined these discordant cases and found that both showed a nodular pattern with slightly homogeneous follicles, features that closely mimic FL morphology.

This finding is clinically important. The two misclassified reactive cases represent exactly the kind of diagnostically challenging tissue that also trips up human pathologists. The fact that the model struggled with these ambiguous cases suggests it is learning the same morphological features that pathologists rely on, rather than exploiting artifacts. It also underscores that AI classification should be viewed as a screening aid rather than a standalone diagnostic tool, particularly for borderline cases where immunohistochemistry remains essential.

TL;DR: All 10 FL cases were correctly classified at the patient level (FL patch rates 81.1-100%). Of 10 reactive cases, 8 were correct (74.1-100% reactive rates), but 2 cases with FL-mimicking nodular patterns were misclassified (R7 at 36.7% reactive, R9 at 48.5% reactive). The 20% case-level error rate for reactive tissue highlights the need for AI as a screening aid, not a standalone tool.

Explainability

Pages 15-16

Explainable AI: Grad-CAM, LIME, and Occlusion Sensitivity

Deep learning models are often described as "black boxes" because understanding how they reach a classification decision is not straightforward. This study applied three explainable AI (XAI) techniques to investigate what the trained ResNet-18 was "looking at" when making its predictions. Each method takes a different approach to generating visual explanations.

Grad-CAM (Gradient-weighted Class Activation Mapping): This technique uses the classification score gradients with respect to the final convolutional feature map to create a coarse localization map. Red regions in the Grad-CAM overlay indicate areas that most strongly influenced the network's prediction for a given class. The authors found that Grad-CAM was the easiest to interpret, as it consistently highlighted lymphocyte-rich areas that are diagnostically relevant. However, Grad-CAM has lower spatial resolution and can miss fine details.

LIME (Local Interpretable Model-Agnostic Explanations): LIME generates explanations by sampling the model's output at nearby inputs and constructing a simple local model. It produces a feature importance map overlaid on the original image with transparency, showing which image regions most affected the classification score. Occlusion sensitivity: This perturbation-based method systematically occludes small regions of the input image and measures the effect on the prediction probability. The brightest regions in the resulting heat map indicate locations where occlusion had the biggest impact. While more computationally expensive than Grad-CAM, occlusion sensitivity provides higher spatial resolution and can reveal fine-grained diagnostic features.

Together, these three XAI methods confirmed that the model's decisions were driven by histologically meaningful tissue features rather than image artifacts or background patterns. This is a necessary step toward clinical trust, as pathologists need to understand why an AI system reaches a particular diagnosis before they can rely on it in practice.

TL;DR: Three XAI methods were applied: Grad-CAM (fast, highlights lymphocyte areas, lower resolution), LIME (local feature importance via surrogate models), and occlusion sensitivity (perturbation-based, higher resolution but slower). All three confirmed the network focused on histologically meaningful features, with Grad-CAM being the most interpretable for pathologists.

Limitations

Pages 18-20

Constraints of a Narrow AI Approach

Task specificity: The authors explicitly characterize their system as "narrow AI," meaning it can only distinguish between two categories: follicular lymphoma and reactive lymphoid tissue. In clinical reality, there are over 200 types of lymphoma, and a pathologist must consider many other differential diagnoses including mantle cell lymphoma, marginal zone lymphoma, small lymphocytic lymphoma, and Hodgkin lymphoma. The model cannot handle any of these, and misapplying it outside its trained scope would produce unreliable results.

Single-center data and imaging platform: All cases came from Tokai University, and all slides were scanned on a single NanoZoomer S360 scanner at a fixed magnification (200x) and resolution (150 dpi). Differences in tissue processing, staining protocols, scanning hardware, and image resolution across institutions could significantly degrade performance. The authors acknowledge this as a limitation for generalization. No external validation cohort from a different institution was included.

Imbalanced dataset: The dataset contained 177 FL cases but only 44 reactive tissue cases, creating a roughly 4:1 imbalance. At the patch level, FL contributed over 1 million patches compared to roughly 490,000 for reactive tissue. While the model still achieved high accuracy, the 20% case-level misclassification rate for reactive tissue in the independent test set may partly reflect this imbalance. The model may be biased toward predicting FL in ambiguous cases.

No integration with immunohistochemistry: The study used only H&E-stained images. In clinical practice, the differential between FL and reactive hyperplasia often requires BCL2, CD10, CD20, and BCL6 immunostaining, as well as molecular analysis for t(14;18). Integrating multi-modal data (H&E plus IHC) could potentially improve performance, but the authors note that standardizing the immunohistochemical panel across all samples would be a prerequisite.

TL;DR: Key limitations include narrow two-class scope (vs. 200+ lymphoma types), single-center single-scanner data, a 4:1 class imbalance (177 FL vs. 44 reactive cases), and the absence of immunohistochemistry or molecular data. These factors limit generalizability and clinical deployment without further validation.

Future Directions

Pages 19-20

Transfer Learning as a Foundation for Multiclass Lymphoma Diagnosis

The authors position this trained ResNet-18 model as a foundation for future work rather than a final clinical product. The primary proposed next step is to use the current model as a pretrained network for transfer learning, retraining it on larger, more diverse datasets that include multiple lymphoma subtypes. This approach would leverage the FL-specific features the network has already learned while expanding its diagnostic capabilities.

Multi-magnification analysis: The study used patches exclusively at 200x magnification. The authors cite work by Miyoshi et al. that used image patches at multiple magnifications (5x, 20x, and 40x), and suggest this multi-scale approach could be valuable because it captures both architectural patterns (visible at low magnification) and cytological details (visible at high magnification). Integrating these scales might improve the model's ability to handle borderline cases like the two reactive tissue misclassifications observed in this study.

Multimodal integration: Combining H&E image analysis with immunohistochemical data and molecular characteristics could substantially boost diagnostic accuracy, particularly for ambiguous cases. The authors also note that the dataset and trained model have been made publicly available through the OpenAIRE Zenodo repository, which should facilitate external validation studies and collaborative development by other research groups.

Scaling this work to a clinically deployable tool would require multi-center training data, prospective validation, and regulatory approval. Nonetheless, the 99.80% patch-level accuracy and successful patient-level classification demonstrated here provide a strong proof of concept that deep learning can meaningfully assist pathologists in one of the most common differential diagnoses in lymphoma pathology.

TL;DR: The trained model is intended as a transfer learning foundation for multiclass lymphoma classification. Future improvements include multi-magnification analysis (5x, 20x, 40x), integration with immunohistochemistry and molecular data, and multi-center validation. The dataset is publicly available on Zenodo for external research.

Histological Image Classification Between Follicular Lymphoma and Reactive Lymphoid Hyperplasia

Original Paper (PDF)

Plain-English Explanations