AI in Lymphoma Histopathology - EliminateCancer.ai

Overview and Background

Pages 1-2

Why Lymphoma Subtyping Is Hard and Why AI Could Help

Lymphomas are cancers derived from lymphocytes, and accurate diagnosis requires distinguishing among many subtypes that can look remarkably similar under the microscope. A definitive diagnosis typically depends not just on hematoxylin and eosin (H&E) stained tissue but also on immunohistochemical (IHC) stains, flow cytometry, and molecular studies. Unlike many areas of pathology where H&E alone suffices, lymphoma diagnosis almost always requires identifying the cell of origin (B-cell, T-cell, or NK cell), which cannot be reliably determined from H&E sections alone.

The cost barrier: While H&E staining is inexpensive and widely available, IHC stains and flow cytometry require costly equipment, expensive reagents, and specially trained personnel. The global shortage of pathologists makes this problem even more acute, particularly in low- and middle-income countries. Strategies that help general pathologists reduce the number of ancillary studies needed could meaningfully lower the cost of lymphoma diagnosis.

Prior AI work: Machine learning tools applied to H&E-stained lymphoma images have achieved accuracies of 94% to 100%, but only when classifying between 2 to 4 diagnostic categories (for example, DLBCL vs. non-DLBCL, or DLBCL vs. Burkitt lymphoma). These narrow classification tasks do not reflect the full complexity of real-world pathology, where a pathologist must distinguish among many more subtypes simultaneously.

This study introduces LymphoML, an interpretable machine learning approach that classifies lymphomas into eight diagnostic categories using only H&E-stained tissue microarray (TMA) cores. The eight categories are: aggressive B-cell lymphoma (Agg BCL), diffuse large B-cell lymphoma (DLBCL), follicular lymphoma (FL), classic Hodgkin lymphoma (CHL), mantle cell lymphoma (MCL), marginal zone lymphoma (MZL), natural killer T-cell lymphoma (NKTCL), and mature T-cell lymphoma (TCL). These groupings are therapeutically driven, meaning subtypes that require similar treatments are binned together.

TL;DR: Lymphoma diagnosis usually requires expensive IHC stains beyond basic H&E. Prior AI studies achieved 94-100% accuracy but only on 2-4 subtypes. LymphoML tackles 8 diagnostic categories using only H&E tissue, aiming to reduce costs especially in resource-limited settings.

Dataset and Study Design

Pages 3-4

670 Lymphoma Cases from Guatemala, Eight Subtypes

The dataset consisted of 670 formalin-fixed, paraffin-embedded (FFPE) biopsy specimens collected at the Instituto de Cancerologia y Hospital Dr. Bernardo Del Valle (INCAN) in Guatemala between 2006 and 2018. Half of each FFPE block was shipped to Stanford University for whole-slide image generation. Two hematopathologists reviewed the slides, selected regions of interest, and constructed tissue microarrays (TMAs) with two cores per sample. The TMAs were scanned at 40x magnification (0.25 microns per pixel) on an Aperio AT2 scanner.

Diagnostic categories: Diagnoses were established using the WHO classification and binned into 8 categories. The dataset distribution was heavily imbalanced: DLBCL dominated with 272 cases, followed by CHL (97), MCL (63), FL (53), NKTCL (46), TCL (36), MZL (25), and Agg BCL (10). Of the 670 specimens, 68 failed quality control (insufficient tissue per core or missing ground-truth diagnoses) and were excluded, leaving 602 samples.

Data splits: The remaining 602 samples were split at the core level into training (70%), validation (10%), and test (20%) sets. Stratified sampling ensured proportional representation of all eight categories in each split. Importantly, all patches from the same patient were kept in the same data split to prevent data leakage. All TMA blocks were also stained for 46 different IHC markers to establish ground-truth diagnoses.

The study population from Guatemala is notable because it represents a cohort not typically included in digital pathology datasets, making this work particularly relevant for building diverse, globally representative training data for computational pathology tools.

TL;DR: 670 lymphoma cases from Guatemala, 602 after quality control, split 70/10/20 into train/validation/test sets across 8 diagnostic categories. DLBCL was the largest class (272 cases) and Agg BCL the smallest (10 cases). Ground truth was established by hematopathologists using H&E plus 46 IHC stains.

Methodology

Pages 4-6

Feature Engineering: From Nuclei Segmentation to 1,595 Features

LymphoML follows an interpretable, feature-engineering pipeline rather than an end-to-end deep learning approach. The process starts with patch extraction: non-overlapping patches are extracted at 40x magnification from each TMA core, with patches that are more than 95% background (defined as pixels with saturation below 0.05 in HSV space) excluded.

Nuclei segmentation: The authors evaluated two deep learning segmentation models, HoVer-Net (based on a pre-trained ResNet-50 architecture) and StarDist (which predicts star-convex polygon representations for each nucleus). Comparing the two on segmented patches yielded a mean Intersection over Union (mIoU) of 0.762. StarDist was selected because models using its segmentations achieved marginally higher accuracy (64.3% vs. 61.5% for HoVer-Net).

Feature extraction: Using the StarDist binary masks, the pipeline extracts geometric features for each nucleus including Feret diameters, convex hull area, circularity, elongation, and convexity. CellProfiler was then used to extract a comprehensive set of morphological, color intensity, and textural features from both nuclei and cells. A color deconvolution step first separated hematoxylin and eosin channels. For each measurement, the mean, standard deviation, skew, kurtosis, and percentiles were computed across all nuclei in a patch, yielding a total of 1,595 features per patient.

Spatial features: Architectural features were derived from two sources: CellProfiler's spatial features (bounding box coordinates, centroids) and clustering tendency (CT) features computed using Ripley's K function, which measures the spatial distribution of cell locations at different radii. The optimal radii range was determined by cross-validation.

Classification models: LightGBM, a gradient-boosted tree algorithm, was used for classification. Focal loss and balanced class weighting (inversely proportional to class frequency) addressed the severe class imbalance. The authors also benchmarked two deep learning models: a ResNet-50 pre-trained on H&E and IHC slides, and a TripletNet pre-trained on the CAMELYON16 breast cancer dataset. These used 224x224 pixel patches with 50% overlap and were fine-tuned with a learning rate of 0.001, Adam optimizer, batch size of 128, and 100 epochs.

TL;DR: LymphoML segments nuclei with StarDist (mIoU 0.762 vs. HoVer-Net), extracts 1,595 morphological, textural, and spatial features per patient via CellProfiler, and classifies using LightGBM with focal loss. Deep learning baselines (ResNet-50, TripletNet) were also tested for comparison.

Nuclear Morphology Results

Pages 6-7

Nuclear Shape Features Drive Diagnostic Accuracy

The first set of experiments tested whether nuclear shape features had higher diagnostic yield than nuclear texture or cytoplasmic features. The model using only nuclear morphological features achieved 59.7% top-1 test accuracy (95% CI: 51.2% to 68.2%). Adding nuclear texture or cytoplasmic features improved performance by only 1-2%, suggesting that shape alone carries most of the diagnostic signal.

Per-subtype performance: Nuclear features were most discriminative for DLBCL (F1: 76.2%), classic Hodgkin lymphoma (F1: 65.3%), and mantle cell lymphoma (F1: 51.6%). This aligns with known pathological criteria. DLBCL is defined by sheets of large B-cells with nuclei at least the size of a histiocyte nucleus, so nuclear size features naturally help distinguish it from other subtypes.

SHAP feature importance: Using SHapley Additive exPlanation (SHAP) analysis, the authors identified the most impactful features. The majority of the top 20 nuclear features were area-shape measurements: mean radius, minor axis length, maximum Feret diameter, solidity, orientation, maximum radius length, and nuclei area. A parsimonious model using only the top 8 SHAP-selected features per class achieved 61.2% accuracy (95% CI: 53.4% to 69.0%) using just 10% of all features, demonstrating that a small set of interpretable features captures most of the diagnostic information.

Grouped feature analysis: When related morphological features were grouped and analyzed via SHAP, the nuclear size feature group had the largest mean absolute SHAP value. This confirms that among all nuclear features, size-related measurements were most helpful for classifying DLBCL, CHL, and MCL. The minor axis length, for example, was significantly different between DLBCL and MCL cases, consistent with the WHO definition of DLBCL as a large-cell lymphoma.

TL;DR: Nuclear shape features alone achieved 59.7% accuracy across 8 subtypes. SHAP analysis showed nuclear size features (mean radius, minor axis length, Feret diameter) were most discriminative. A parsimonious model with just the top 8 features per class still reached 61.2% accuracy using only 10% of available features.

Architecture and Deep Learning Comparison

Pages 7-8

Interpretable Models Beat Black-Box Deep Learning on Limited Data

Incorporating architectural features (spatial relationships between nuclei) further improved performance. The best model using all H&E features except clustering tendency, labeled the "Best H&E Model," achieved 64.3% top-1 accuracy (95% CI: 55.7% to 72.9%). This was the highest accuracy among all H&E-based models, though the improvement over the nuclear morphological model alone did not reach statistical significance. Notably, the Best H&E Model achieved 71.0% F1 score for MCL (95% CI: 55.0% to 87.0%), a 19.4 percentage point improvement over the nuclear-only model's MCL F1 of 51.6%.

Deep learning comparison: The authors hypothesized that interpretable feature-engineering models would outperform deep learning given the limited labeled examples per subtype. This proved correct. TripletNet achieved only 52.8% test accuracy (95% CI: 44.2% to 61.4%) and ResNet achieved 53.5% (95% CI: 44.8% to 62.2%). ResNet was statistically significantly inferior to the Best H&E Model. Both deep learning approaches underperformed the nuclear morphological model by approximately 5% in both accuracy and F1 score.

The key insight is that with limited labeled data spread across many categories, feature engineering approaches that encode domain-relevant knowledge (nuclear size, shape, spatial arrangement) outperform end-to-end deep learning that must learn these representations from scratch. The authors note that deep learning models would likely match prior published accuracies if restricted to the same small number of classes with sufficient samples per class.

TL;DR: The Best H&E Model (interpretable, LightGBM) achieved 64.3% accuracy, beating both TripletNet (52.8%) and ResNet (53.5%) by more than 10 percentage points. Feature engineering outperformed deep learning because the dataset had too few labeled examples per subtype for end-to-end learning to work effectively.

Pathologist Comparison

Pages 8-9

LymphoML Matches or Exceeds Hematopathologists on H&E Alone

The Best H&E Model was compared against four pathologists: two hematopathologists reviewing H&E TMAs, one hematopathologist reviewing whole-slide images (WSIs), and one general pathologist reviewing WSIs. All pathologists were blinded to IHC results and the final diagnosis. The Best H&E Model's accuracy of 64.3% surpassed all pathologists. Hematopathologist 1 on TMAs achieved 56.1%, Hematopathologist 2 on TMAs achieved 60.1%, Hematopathologist 3 on WSIs reached 63.5%, and the General Pathologist on WSIs scored 56.1%.

Statistical testing: Two-tailed paired t-tests and Two One-Sided Tests (TOST) for equivalence confirmed that the Best H&E Model was non-inferior to the General Pathologist on WSIs and Hematopathologist 1 on TMAs. There was no statistically significant difference between the model and any pathologist. The model's AUROC was 85.9%, with sensitivity of 66.9% and specificity of 88.7%.

Per-subtype strengths and weaknesses: The model excelled at MCL classification with a 71.0% F1 score, surpassing all hematopathologists by more than 18 percentage points and achieving statistical superiority over Hematopathologist 1 on TMAs and Hematopathologist 3 on WSIs. However, the model completely failed on MZL (F1: 0.0%) and TCL (F1: 0.0%), categories where pathologists achieved 30% and 23.5% F1 scores respectively. The model performed consistently well only for DLBCL (F1: 78.7%), CHL (F1: 74.5%), and MCL (F1: 71.0%), the three categories with sufficiently large numbers of cases in the cohort.

TL;DR: The Best H&E Model (64.3% accuracy, 85.9% AUROC) was non-inferior to hematopathologists and outperformed all four pathologists in overall accuracy. It achieved 71.0% F1 for MCL, beating pathologists by over 18 points, but scored 0.0% F1 for MZL and TCL due to limited training examples.

Immunostain Integration

Pages 9-10

Six IHC Stains Plus H&E Match the Diagnostic Power of 46 Stains

A key clinical finding of this study is that combining the H&E-based model with results from just six immunohistochemical stains (CD10, CD20, CD3, EBV-ISH, BCL1/cyclin D1, and CD30) achieved nearly identical diagnostic accuracy to using a full panel of 46 IHC stains. The baseline model using all 46 immunostains (without H&E) achieved 86.1% accuracy (95% CI: 80.0% to 92.2%). Using only the six selected stains without H&E yielded 75.2% accuracy (95% CI: 68.2% to 82.2%), which was statistically inferior to the full 46-stain panel.

The critical combination: When the Best H&E Model was augmented with the six selected immunostains, accuracy jumped to 85.3% (95% CI: 79.9% to 90.7%), showing no statistically significant difference from the 46-stain model. This is the first demonstration that combining computational H&E analysis with a standardized, limited IHC panel can match the diagnostic accuracy of a comprehensive stain panel.

For this analysis, lymphoma subtypes were also grouped into five clinically actionable categories: B-cell lymphomas (DLBCL and Agg BCL), CHL, FL and MZL, MCL, and T-cell lymphomas (NKTCL and TCL). The six candidate stains were selected based on their expected diagnostic yield given the categories in the cohort. Each immunostain result (positive, negative, or "cannot interpret") was included as an additional categorical feature to the Best H&E Model.

The cost implications are significant. H&E-stained slides are cheaper than IHC stains by at least an order of magnitude. If computational tools can extract maximal diagnostic information from H&E and reduce the required IHC panel from 46 stains to 6, the savings in reagent costs, equipment time, and personnel effort could be substantial, especially in resource-limited settings where repeat biopsies to obtain additional tissue for staining may be impractical.

TL;DR: The H&E model plus just 6 IHC stains (CD10, CD20, CD3, EBV-ISH, BCL1, CD30) achieved 85.3% accuracy, statistically equivalent to the full 46-stain panel (86.1%). Six stains alone without H&E only reached 75.2%. This is the first evidence that AI-assisted H&E analysis can dramatically reduce the number of expensive stains needed.

Limitations and Future Directions

Pages 10-12

Single-Center Data, Small Tissue Samples, and Class Imbalance

Single-institution data: All specimens were collected and processed at a single institution using a single slide scanner (Aperio AT2). The model's generalizability to cohorts from other institutions, collected with different technical setups and scanned on different machines, remains unvalidated. Scanner-dependent color and resolution variations are a well-known challenge in digital pathology.

TMAs vs. whole-slide images: Tissue microarrays capture only a small portion of the full tumor volume and are much smaller than whole-slide images. The authors acknowledge that training on WSIs would likely yield more powerful models. However, TMAs have a practical advantage: they do not require expensive manual patch-level annotations because the cores are already enriched for lymphoma tissue, making them more cost-effective for computational tool development in low- and middle-income countries.

Severe class imbalance: Several diagnostic categories had very few examples. Agg BCL, MZL, and TCL each had fewer than 10 patients in some splits, which is insufficient to capture the wide morphological variability within these subtypes. This directly explains the model's failure on MZL and TCL (both 0.0% F1). The model achieved consistent performance only for the three best-represented categories: DLBCL, CHL, and MCL.

Clinical implementation gap: To deploy LymphoML in a clinical setting, a pathologist would need to first suspect one of the eight validated diagnostic categories, select a region of interest, and then let the model render a favored diagnosis. Future work would need to expand the number of diagnostic categories, incorporate a more diverse set of background tissues, and potentially enable automated identification of target lesions without requiring manual region selection. Despite these limitations, the study demonstrates that interpretable feature engineering can extract meaningful diagnostic information from limited tissue, and that combining H&E analysis with a small standardized IHC panel could substantially reduce lymphoma diagnostic costs worldwide.

TL;DR: Key limitations include single-center data (one scanner, one institution), TMA cores that are much smaller than WSIs, and severe class imbalance (fewer than 10 cases for some subtypes). Future work needs multi-center validation, larger and more balanced datasets, and automated lesion identification for real clinical deployment.

Artificial Intelligence in Lymphoma Histopathology: Systematic Review

Original Paper (PDF)

Plain-English Explanations