AI Performance in Detecting Lymphoma from Medical Imaging

Overview & Background

Pages 1-2

Why AI for Lymphoma Detection Matters

Lymphoma is a clonal malignancy of lymphocytes diagnosed in approximately 280,000 people annually worldwide. Non-Hodgkin lymphoma (NHL), derived from mature lymphoid cells, accounts for 90.36% of disability-adjusted life-years (DALYs) among lymphomas, while Hodgkin lymphoma (HL) accounts for 14.81%. Roughly 30% of NHL cases arise in extranodal sites, and some subtypes such as diffuse large B-cell lymphoma (DLBCL) are considered very aggressive, making early and timely detection critical for guiding treatment and improving quality of life.

Diagnostic complexity: Lymphoma classification is inherently difficult because lymphocytes serve diverse physiologic immune functions depending on lineage and differentiation stage. Even experienced hematopathologists struggle to identify different subtypes. Diagnosis typically requires growth pattern analysis, cytologic features, immunohistochemistry, molecular pathology, and genomic characterization. Inter-observer variability among experts ranges widely from 14.8% to 27.3% when using imaging methods such as CT, MRI, and whole slide imaging (WSI) on the same sample.

The promise of AI: Artificial intelligence offers the potential to extend noninvasive tissue analysis beyond established imaging metrics, enable automatic image classification, and improve diagnostic accuracy. Machine learning (ML) and deep learning (DL), two branches of AI, have shown promising results for malignant lymphoma detection. However, prior to this study, no systematic assessment of AI diagnostic performance in lymphoma had been conducted, making this the first meta-analysis of its kind in this disease area.

TL;DR: Lymphoma affects 280,000 people per year, with NHL responsible for over 90% of disease burden. Expert inter-observer variability ranges from 14.8% to 27.3%, highlighting the need for objective AI-based diagnostic tools. This is the first systematic review and meta-analysis of AI performance in lymphoma detection.

Methodology

Pages 2-3

Search Strategy, Eligibility, and Quality Assessment

The study protocol was registered with PROSPERO (CRD42022383386) and followed PRISMA 2020 guidelines. The authors searched four databases: Medline, Embase, IEEE, and Cochrane, up to December 2023, with no restrictions on region, language, participant characteristics, imaging modality, AI model type, or publication type. The search strategy was developed collaboratively with experienced clinicians and medical researchers.

Inclusion criteria: Studies were included if they reported diagnostic performance of AI models for lymphoma detection using medical imaging, and provided raw performance data such as sensitivity, specificity, AUC, negative predictive values (NPVs), or positive predictive values (PPVs). Eligibility was assessed independently by two investigators, with disagreements resolved through discussion with a third collaborator.

Exclusion criteria: Case reports, reviews, editorials, letters, conference abstracts, and studies using waveform data (EEG, ECG, visual field data) were excluded. Studies that evaluated image segmentation rather than disease classification, did not use histopathology or expert consensus as the reference standard, or relied on animal or non-human samples were also removed.

Quality assessment: The authors used QUADAS-AI, an AI-specific extension of QUADAS-2 and QUADAS-C, to evaluate risk of bias and applicability concerns. This is notable because most prior reviews in medical AI used QUADAS-2, which does not address terminology and challenges specific to AI diagnostic studies such as algorithm validation and data preprocessing.

TL;DR: Registered on PROSPERO, followed PRISMA 2020, and searched Medline, Embase, IEEE, and Cochrane with no restrictions. Used QUADAS-AI (not the standard QUADAS-2) for quality assessment, providing a more rigorous, AI-specific framework for evaluating bias and applicability.

Study Selection

Pages 3-4

What the 30 Included Studies Looked Like

The initial search identified 1,155 records. After removing 45 duplicates, 1,110 were screened, and 1,010 were excluded for not meeting inclusion criteria. Of the 100 full-text articles reviewed, 70 were excluded, leaving 30 studies focused on lymphoma. Of these 30, sixteen provided sufficient data (contingency tables) to be included in the quantitative meta-analysis.

Study designs and data sources: Twenty-nine of the 30 studies used retrospective data, with only one prospective study. Six studies used open-access data sources. Only five studies excluded low-quality images, while ten did not report anything about image quality. Six studies performed external validation using out-of-sample data, and fifteen did not report the type of internal validation used.

Lymphoma subtypes and imaging modalities: The studies covered a broad range of subtypes: six focused on primary central nervous system lymphoma (PCNSL), six on DLBCL, four on acute lymphoblastic leukemia (ALL), and two on NHL broadly. Individual studies addressed extranodal NK/T-cell lymphoma (ENKTL), splenic and gastric marginal zone lymphomas, and ocular adnexal lymphoma. Imaging modalities included MRI (6 studies), WSI (4 studies), microscopic blood images (4 studies), PET/CT (3 studies), and histopathology images (2 studies).

Algorithms used: Seven studies employed ML algorithms and twenty-three used DL algorithms. Architectures included CNNs, ResNet-18, ResNet-50, EfficientNet, Inception-v3, VGG-16, VGG-19, DenseNet-121, 3D U-Net/ResNet-18, LASSO, artificial neural networks (ANN), stochastic Bayesian neural networks (BNN), and multiple instance learning (MIL). Six studies applied transfer learning while ten did not.

TL;DR: From 1,155 initial records, 30 studies were included (16 in the meta-analysis). Nearly all were retrospective. Studies spanned PCNSL, DLBCL, ALL, ENKTL, and other subtypes across MRI, WSI, PET/CT, and blood image modalities. DL (23 studies) was far more common than ML (7 studies).

Pooled Diagnostic Performance

Pages 4, 12

How Well AI Algorithms Performed Overall

Across the 16 studies included in the meta-analysis, the pooled sensitivity was 87% (95% CI: 83-91%), pooled specificity was 94% (95% CI: 92-96%), and the AUC was 0.97 (95% CI: 0.95-0.98). These results were generated using hierarchical summary receiver operating characteristic (SROC) curves, which account for both between-study and within-study variation and provide greater credibility for analyses involving small sample sizes.

Context against conventional methods: The pooled AUC of 97% aligns closely with established diagnostic methods. Whole-body MRI (WB-MRI), an emerging radiation-free technique, achieves an AUC of 96% (95% CI: 91-100%). The current reference standard, 18F-FDG PET/CT, has a reported AUC of 87% (95% CI: 72-97%). Basic CT alone achieves a sensitivity of 81% and a specificity of just 41%. AI algorithms therefore appear competitive with or superior to several conventional imaging approaches, though direct comparisons were inconsistent across studies due to differences in lymphoma subtypes, modality protocols, and reference standards.

Heterogeneity: Extreme heterogeneity was observed among the included studies. Sensitivity had an I-squared of 99.35% and specificity had an I-squared of 99.68% (p less than 0.0001). This high heterogeneity was driven by differences in sample sizes, algorithms, geographic distribution, and the comparison of AI-assisted versus unassisted clinicians. A funnel plot showed no evidence of publication bias (p = 0.49).

TL;DR: Pooled AI performance: sensitivity 87%, specificity 94%, AUC 97%. This is comparable to WB-MRI (AUC 96%) and superior to standard CT (sensitivity 81%, specificity 41%) and 18F-FDG PET/CT (AUC 87%). Extreme heterogeneity (I-squared over 99%) was present but no publication bias was detected (p = 0.49).

Subgroup Analyses

Pages 12-13

Machine Learning vs. Deep Learning, Transfer Learning, and Geographic Patterns

ML vs. DL: When broken down by algorithm type, ML models showed a pooled sensitivity of 93% (95% CI: 88-95%) compared to 86% (95% CI: 80-90%) for DL. Specificity was similar: 92% (95% CI: 87-95%) for ML and 94% (95% CI: 92-96%) for DL. The higher sensitivity for ML was somewhat counterintuitive, but the authors noted this may reflect the limited dataset sizes in the included studies, where simpler ML methods can sometimes outperform data-hungry DL approaches.

Transfer learning: Six studies used transfer learning and ten did not. Models with transfer learning achieved a pooled sensitivity of 88% (95% CI: 80-93%) and specificity of 95% (95% CI: 92-97%), compared to 85% (95% CI: 80-89%) sensitivity and 91% (95% CI: 88-93%) specificity without it. Transfer learning, which involves reusing a pre-trained model on a new task, has been shown to accelerate learning speed, reduce data requirements, and enhance diagnostic accuracy. McAvoy et al. reported that transfer learning with a high-performing CNN architecture could classify glioblastoma (GBM) and PCNSL with 91-92% accuracy.

Sample size and geography: Studies with sample sizes under 200 (n=11) achieved a pooled sensitivity of 88% (95% CI: 84-92%) and specificity of 91% (95% CI: 87-94%), while studies with over 200 samples (n=5) reached 86% (95% CI: 78-91%) sensitivity and 95% (95% CI: 92-97%) specificity. Geographically, Asian studies (n=10) had higher sensitivity at 88% (95% CI: 83-91%) versus 83% (95% CI: 72-90%) for non-Asian studies (n=6). Asian studies also showed higher specificity at 94% (95% CI: 92-96%) versus 91% (95% CI: 82-96%). No significant between-subgroup differences were found in meta-regression for these factors.

TL;DR: ML sensitivity (93%) exceeded DL (86%), likely reflecting small dataset effects. Transfer learning improved both sensitivity (88% vs. 85%) and specificity (95% vs. 91%). Asian studies showed higher sensitivity (88% vs. 83%) and specificity (94% vs. 91%) than non-Asian studies.

AI vs. Human Clinicians

Pages 13, 16

Direct Comparisons Between AI Algorithms and Physicians

Three studies directly compared the diagnostic accuracy of AI algorithms against human clinicians using the same datasets. The results strongly favored AI: the pooled sensitivity was 91% (95% CI: 86-94%) for AI algorithms versus 70% (95% CI: 65-75%) for human clinicians. Pooled specificity was 96% (95% CI: 93-97%) for AI versus 86% (95% CI: 82-89%) for clinicians. This difference in AI versus clinician performance was the only statistically significant source of between-subgroup heterogeneity in the meta-regression analysis (p = 0.01 for sensitivity).

Caveats and context: While AI demonstrated clear quantitative advantages in sensitivity and specificity, the authors emphasized that AI does not incorporate all the information physicians rely on when evaluating a complex case. AI excels at rapid image processing and can work continuously, but clinical decision-making integrates demographic information, patient history, and contextual factors that current AI models do not fully capture. Only three of the 30 included studies performed this head-to-head comparison, which limits the ability to generalize these findings.

Future of AI-physician collaboration: The authors argued that the AI-versus-physician dichotomy is no longer productive. Instead, an AI-physician combination would drive the field forward and reduce healthcare system burdens. Physicians could combine demographic and clinical workflow data with AI outputs, while AI could serve as a cost-effective initial screening or risk categorization tool to improve workflow efficiency. The establishment of cloud-sharing platforms and standardized annotated datasets would be critical to enabling this collaborative model.

TL;DR: In head-to-head comparisons (3 studies), AI achieved 91% sensitivity and 96% specificity versus 70% sensitivity and 86% specificity for clinicians. This was the only statistically significant source of between-subgroup heterogeneity (p = 0.01). However, only 3 studies made this comparison, and AI-physician collaboration is seen as the most promising path forward.

Limitations

Pages 14-17

Methodological Concerns and Quality Gaps in the Evidence Base

Retrospective design dominance: Only one of the 30 included studies was prospective, and it did not provide a contingency table for meta-analysis. Twelve studies used open-access databases or non-target medical records, and only eleven were conducted in real clinical environments. Retrospective studies with in-silico data sources may not reflect applicable population characteristics or appropriate minority group proportions. Ground truth labels in open-access databases were often derived from data collected for other purposes, with poorly defined disease criteria.

Validation deficiencies: Only six studies performed external validation. For internal validation, three used random splitting and twelve used cross-validation methods. Performance evaluated on in-sample homogeneous datasets may lead to uncertainty about diagnostic accuracy estimates. Only five studies excluded poor-quality images, and none performed quality control on ground truth labels, leaving AI models vulnerable to unidentified biases.

Quality assessment and reporting bias: Using QUADAS-AI, fourteen studies had high or unclear risk of bias in subject selection due to unreported training/validation/test set breakdowns or derivation from open-source datasets. Seventeen studies had high or unclear risk for the index test domain due to lack of external verification. Ten studies had unclear risk in the reference standard domain. Researchers tend to selectively report favorable results, which may further inflate accuracy estimates. Current reporting standards (STARD-2015) are not fully applicable to AI research specifics.

Scope limitations of the meta-analysis itself: The relatively small number of included studies could have skewed diagnostic performance estimates. The restricted number of studies per subgroup (by lymphoma subtype or imaging modality) prevented comprehensive assessment of heterogeneity sources. The wide range of imaging technologies, patient populations, pathologies, study designs, and AI models may have affected accuracy estimation. Finally, the study only evaluated diagnostic performance and cannot speak to AI's impact on patient treatment and outcomes.

TL;DR: Major quality gaps: 29 of 30 studies were retrospective, only 6 performed external validation, only 5 excluded poor-quality images, and none quality-controlled ground truth labels. QUADAS-AI flagged 14 studies for subject selection bias and 17 for index test bias. Results may overestimate real-world AI performance.

Future Directions

Pages 17-18

What Needs to Happen Next for AI in Lymphoma Detection

Multi-center prospective studies: The authors called for a concerted push toward multi-center prospective studies and expansive open-access databases. These should explore various ethnicities, hospital-specific variables, and nuanced population distributions to authenticate the reproducibility and clinical relevance of AI models. The establishment of interconnected networks between medical institutions, with unified standards for data acquisition, labeling procedures, and imaging protocols, would enable meaningful external validation.

Improved transparency and reporting: The authors advocated for prospective registration of diagnostic accuracy studies with a priori analysis plans to improve transparency and objectivity. They also encouraged AI researchers to report studies that do not reject the null hypothesis, which would improve both the impartiality and clarity of the evidence base. Disease-specific AI reporting guidelines for lymphoma and related cancers remain absent and are critically needed.

Domain-specific AI models: While time-consuming and difficult, the development of "customized" AI models tailored to specific lymphoma subtypes is highlighted as a pertinent suggestion. This approach would encompass meticulous feature engineering, optimized AI architecture selection, and intricate procedures such as image segmentation and transfer learning. Such tailored models could yield substantial clinical benefits compared to generic AI approaches trained on heterogeneous datasets.

Standardization of quality assessment: The QUADAS-AI framework used in this review provides a starting point, but it still faces challenges including incomplete uptake, lack of a formal quality assessment tool, unclear methodological interpretation (especially regarding validation types and human performance comparisons), unstandardized nomenclature, heterogeneous outcome measures, and applicability issues. Addressing these gaps will be essential as the field matures.

TL;DR: Key next steps include multi-center prospective studies, interconnected institutional networks with unified data standards, prospective study registration, publication of negative results, domain-specific AI models for lymphoma subtypes, and refinement of the QUADAS-AI quality assessment framework.

Artificial intelligence performance in detecting lymphoma from medical imaging: a systematic review

Original Paper (PDF)

Plain-English Explanations