A deep learning diagnostic platform for diffuse large B-cell lymphoma with high accuracy

Communications Medicine 2020 AI 8 Explanations View Original
Original Paper (PDF)

Unable to display PDF. Download it here or view on PMC.

Plain-English Explanations
Pages 1-2
Why Diagnosing DLBCL Demands Near-Perfect Accuracy

Diffuse large B-cell lymphoma (DLBCL) is the most common aggressive form of non-Hodgkin lymphoma, characterized by a diffuse or sheet-like proliferation of large neoplastic B cells. Accurate diagnosis is critical because treatment regimens differ substantially between DLBCL and morphologically similar conditions. Pathologists must exclude other B-cell lymphomas with large-cell morphology, including mantle cell lymphoma, lymphoblastic lymphoma, and plasmablastic lymphoma, as well as non-hematopoietic tumors such as carcinoma, melanoma, and sarcoma that can mimic DLBCL. Immunohistochemistry (checking for B-cell markers like CD20 and PAX5) and flow cytometry are routinely required to supplement H&E morphology.

The clinical bar: In clinical practice, diagnostic accuracy of 100% or greater than 99% is considered mandatory to avoid missing any patients who require treatment. At the time of this study, no existing deep learning model had reached that threshold for any hematopoietic malignancy based solely on reading H&E-stained pathologic slides. Prior work by Achi et al. (2019) achieved only 95% diagnostic accuracy for DLBCL in an intra-hospital test, leaving a meaningful gap before clinical deployment could be justified.

The data challenge: Medical imaging datasets are inherently small compared to general computer vision benchmarks. Acquiring, annotating, and distributing pathologic images is expensive, requires expert clinicians, and is constrained by patient privacy regulations. Typical hospitals can provide only hundreds of DLBCL cases, not thousands. This places heavy emphasis on data augmentation, regularization, and transfer learning to prevent overfitting while still reaching the required accuracy levels.

TL;DR: DLBCL is the most common aggressive non-Hodgkin lymphoma. Clinical diagnosis demands near-100% accuracy, but prior AI models reached only 95%. Small dataset sizes from individual hospitals make this a particularly difficult deep learning problem.
Pages 2-3
GOTDP-MP-CNNs: An Ensemble of 17 Pretrained Networks

Rather than relying on a single convolutional neural network (CNN), the authors developed the Globally Optimized Transfer Deep-Learning Platform with Multiple Pretrained CNNs (GOTDP-MP-CNNs). This platform combines 17 distinct pretrained architectures: AlexNet, GoogleNet (ImageNet), GoogleNet (Places365), ResNet18, ResNet50, ResNet101, VGG16, VGG19, InceptionV3, InceptionResNetV2, SqueezeNet, DenseNet201, MobileNetV2, ShuffleNet, Xception, NASNetMobile, and NASNetLarge. Each network was originally trained on ImageNet (over one million high-resolution images across 1,000 object categories) and then fine-tuned on the DLBCL pathology data through transfer learning.

Global optimization: A key innovation is the use of SDL (a global optimization algorithm originally developed for DNA microarray data analysis) to tune training hyperparameters across all 17 networks simultaneously. Rather than manually testing individual networks and hoping to find the best performer by trial and error, the platform treats all 17 models as a pool. Learning rates ranged from 0.01 to 0.0001, mini-batch sizes from 32 to 128, and training ran for up to 30 epochs with validation frequency of 20. Three optimization algorithms (SGDM, RMSProp, and Adam) were used to minimize loss functions across the ensemble.

Ensemble scoring: For classification, no single best-performing model was selected. Instead, each of the 17 retrained networks produced a prediction score for both the DLBCL and non-DLBCL classes. The final diagnosis was determined by averaging these scores across all 17 models: if the mean DLBCL score was greater than or equal to the mean non-DLBCL score, the image was classified as DLBCL. This ensemble approach leverages the complementary strengths of architectures with different design philosophies, from lightweight mobile-optimized networks to deep residual models.

Software and hardware: The entire platform was built on MATLAB R2019a, using the deep learning toolbox and the image processing toolbox. Remarkably, all experiments were conducted on modest hardware: an Intel Core i5-8250U CPU server with 8 GB RAM and a Microsoft Surface Pro laptop with an Intel Core i7-4650U. No GPU acceleration was used, demonstrating that the approach is feasible without specialized computing infrastructure.

TL;DR: The GOTDP-MP-CNNs platform combines 17 pretrained CNNs (from AlexNet to NASNetLarge) into a single ensemble. SDL global optimization tunes hyperparameters across all networks. Final diagnosis averages prediction scores from all 17 models. Built in MATLAB on consumer-grade hardware with no GPU required.
Pages 2-4
Three Hospitals, Small Datasets, and Deliberate Image Variation

Pathologic samples were collected from three independent hospitals, each with different slide preparation and imaging workflows. Hospital A contributed 500 DLBCL and 505 non-DLBCL cases, each represented by a single photographed image at x400 magnification (2,592 x 1,944 pixels, 14.4 MB per image, pixel size 2.2 um). Hospital C provided 204 DLBCL and 198 non-DLBCL cases, also one photograph per patient (2,048 x 1,536 pixels, 5-8 MB, pixel size 3.45 um, cropped to 1,075 x 1,075). Hospital B used whole-slide image (WSI) scanning, contributing 1,467 DLBCL images from 163 cases and 1,656 non-DLBCL images from 184 cases, with nine randomly selected 945 x 945 pixel regions per patient.

Non-DLBCL categories: The non-DLBCL class was deliberately broad, encompassing reactive/non-neoplastic lymph nodes, metastatic carcinomas, melanomas, and other lymphoma subtypes including small lymphocytic lymphoma/chronic lymphocytic leukemia, mantle cell lymphoma, follicular lymphoma, classical Hodgkin lymphoma, and T-cell lymphomas. This diversity forced the AI models to learn robust discriminative features rather than simply distinguishing DLBCL from normal tissue.

Data splitting and no curation: For each hospital, 80% of images were used for training, 10% for validation, and 10% for testing. The authors also tested 0.6/0.2/0.2 and 0.7/0.15/0.15 splits and found no significant difference in results. Critically, the datasets were not curated. Images with poor quality, air bubbles, empty spaces, and tissue-processing artifacts were intentionally retained to test real-world applicability. All diagnoses were confirmed by board-certified hematopathologists and cross-referenced with immunohistochemistry, molecular biology, and clinical data.

TL;DR: Hospital A: 1,005 photographed images. Hospital B: 3,123 scanned WSI patches from 347 patients. Hospital C: 402 photographed images. Non-DLBCL included reactive nodes, carcinomas, melanomas, and multiple lymphoma subtypes. Data split 80/10/10 with no image curation, preserving artifacts for real-world testing.
Pages 3-5
Near-Perfect Intra-Hospital Diagnostic Accuracy

Each hospital's AI model was trained and tested exclusively on data from that hospital. The results were striking. Model A (hospital A) achieved 100% diagnostic accuracy on its test set. Model C (hospital C) also achieved 100%. Model B (hospital B, using WSI patches) reached 99.71%. By comparison, the individual CNNs within the ensemble showed average accuracies ranging from 87% to 96% across the three hospitals. For example, in hospital A, individual network accuracies ranged from 86.14% (ResNet50) to 98.02% (Xception). In hospital B, they ranged from 84.57% (DenseNet201) to 96.14% (InceptionResNetV2). The ensemble consistently outperformed every individual network.

CIFAR-10 validation: To independently validate their core algorithms beyond medical imaging, the authors tested the GOTDP-MP-CNNs on the CIFAR-10 benchmark (60,000 32x32 color images in 10 classes). Their platform achieved 96.88% accuracy, exceeding the previously published best result of 96.53% out of 49 independent submissions. This external validation confirmed that the ensemble optimization strategy generalizes beyond the DLBCL domain.

Comparison with pathologists: In a head-to-head test using 531 additional H&E-stained images from hospital B, seven experienced pathologists were invited to read the slides at x400 magnification. The pathologists spent approximately 60 minutes on average and achieved a maximum accuracy of only 74.39%. The AI model completed the same task in under one minute with 100% accuracy. The authors note that pathologists typically rely on multiple magnification levels plus immunohistochemistry and molecular tests, so 74.39% on H&E alone is not unexpected. Nevertheless, the AI model's ability to extract diagnostic features from a single magnification level was clearly superior.

TL;DR: Intra-hospital accuracy: 100% (Hospital A), 99.71% (Hospital B), 100% (Hospital C). Individual CNNs ranged from 84-98%. CIFAR-10 benchmark: 96.88% vs. prior best of 96.53%. AI vs. 7 pathologists on 531 images: 100% vs. 74.39% maximum, completed in under 1 minute vs. approximately 60 minutes.
Pages 5-6
Technical Variability Degrades Cross-Hospital Performance

When Model A (trained on hospital A data) was applied directly to hospital C images without any preprocessing, diagnostic accuracy dropped from 100% to 82.09%. The authors identified a key source of this degradation: image shape mismatch. Hospital A used rectangular images (4:3 aspect ratio), while hospital C images were square. When fed into Model A, the square images were distorted, introducing spurious variation. After normalizing the hospital C images to match hospital A's aspect ratio, accuracy recovered to 90.50%, a substantial improvement but still a 10% drop from the intra-hospital baseline.

Eliminating technical variation: Two additional experiments confirmed that the accuracy drop was driven by technical, not biological, differences. First, Model A was tested on 110 new images collected from the same hospital A after the model was built. Accuracy remained at 100%, confirming the model generalizes within its own institution. Second, Model B (trained on hospital B WSIs) was applied to images from a fourth hospital (hospital D), where slide preparation procedures were deliberately matched to hospital B's protocols and the same scanner was used for image collection. This cross-hospital test also achieved 100% accuracy, demonstrating that standardizing preparation and imaging equipment completely eliminates the performance gap.

Practical implications: These results reveal that the AI models are sensitive to staining protocols, tissue processing methods, and imaging equipment rather than to genuine morphological differences between patient populations. The authors conclude that two viable strategies exist: either standardize slide preparation and imaging across all hospitals (challenging but effective), or build hospital-specific "customized" AI models using the relatively small datasets that each institution can provide.

TL;DR: Cross-hospital accuracy dropped from 100% to 82.09% (or 90.50% after shape normalization) due to differences in slide preparation and imaging. Within the same hospital, 100% accuracy was maintained on new data. When preparation and scanning equipment were standardized across hospitals B and D, cross-hospital accuracy returned to 100%.
Pages 5-6
Investigating the Single Discordant Case in Hospital B

The 99.71% accuracy in hospital B implied that at least one case was misclassified. The authors traced this to a single patient who was originally diagnosed as DLBCL by pathologists but was flagged as non-DLBCL by the AI model. Rather than treating this as a model failure, the team conducted a thorough clinical review. The patient had a poor response to conventional DLBCL therapy and lacked typical DLBCL clinical symptoms, raising questions about the original diagnosis.

Diagnostic history: Tracing the patient's records revealed that the original diagnosis was follicular lymphoma or follicular diffuse mixed type, a condition that can progress to DLBCL over time. The pathologic images analyzed by the AI model reflected tissue collected before the patient had fully transitioned to DLBCL. In retrospect, the AI model's classification of this case as non-DLBCL may have been more accurate than the pathologists' diagnosis. If this case were excluded (as the authors argue it should have been, given the ambiguous histology), hospital B's accuracy would also reach 100%.

Clinical significance: In a clinical screening workflow, the AI model would serve as an initial filter, allowing pathologists to skip reading a large number of clearly non-DLBCL slides. False-positive cases (non-DLBCL flagged as DLBCL) are acceptable in this context because pathologists would review those cases anyway. False negatives (missed DLBCL) are not acceptable. The zero false-negative rate across all three hospitals (with the arguable exception of the borderline case above) supports the model's suitability for this screening role. Both false-positive and false-negative rates were zero for hospitals A and C, and the false-positive rate was zero for hospital B as well.

TL;DR: The single "missed" case in hospital B (99.71%) was a patient originally diagnosed with follicular lymphoma who was transitioning to DLBCL. The AI may have been more accurate than the pathologists. Excluding this ambiguous case yields 100% accuracy. False-positive and false-negative rates were zero across hospitals A and C.
Pages 7-8
Transfer Learning, Scoring, and the Ensemble Decision Rule

Transfer learning approach: All 17 pretrained networks were initialized with ImageNet weights and biases. Fine-tuning was performed using learning rates from 0.01 to 0.0001, mini-batch sizes of 32 to 128, and a maximum of 30 training epochs. Three optimization algorithms were tested: Stochastic Gradient Descent with Momentum (SGDM), RMSProp, and Adam. The SDL global optimization algorithm selected the best combination of hyperparameters for each network within the ensemble, eliminating the need for manual hyperparameter tuning.

Ensemble decision rule: Each of the 17 retrained models produces a score matrix S with two rows (DLBCL and non-DLBCL) and N columns (one per model). The final DLBCL score (TS_DLBCL) is the average of all 17 individual DLBCL scores, and similarly for TS_Non-DLBCL. The diagnosis is DLBCL if TS_DLBCL is greater than or equal to TS_Non-DLBCL, and non-DLBCL otherwise. This simple averaging approach avoids complex weighting schemes while still capturing the consensus of diverse architectures.

Evaluation metrics: The authors defined standard metrics: sensitivity (TP / (TP + FN)), specificity (TN / (TN + FP)), and accuracy ((TP + TN) / (TP + TN + FP + FN)). True positives were cases correctly identified as DLBCL, and true negatives were cases correctly identified as non-DLBCL. The emphasis was on eliminating false negatives (missed DLBCL cases) while minimizing false positives, reflecting the clinical priority of never missing a lymphoma diagnosis.

TL;DR: Transfer learning from ImageNet with 3 optimizers (SGDM, RMSProp, Adam), learning rates 0.01-0.0001, and up to 30 epochs. Diagnosis is decided by averaging prediction scores across all 17 networks. Clinical priority: zero false negatives to avoid missed DLBCL diagnoses.
Pages 6-8
Single-Magnification Design, Hospital Specificity, and Path Forward

Cross-hospital generalization: The most significant limitation is that AI models trained on one hospital's data do not reliably transfer to another hospital without standardizing slide preparation and imaging equipment. The 10-18% accuracy drop in cross-hospital tests (from 100% to 82.09-90.50%) means that each hospital would need to build its own model or adopt unified protocols. In a fragmented healthcare system with thousands of independent pathology labs, this represents a substantial barrier to widespread deployment.

Single magnification: All models were trained and tested exclusively on x400 magnification images. In routine practice, pathologists examine tissue at multiple magnification levels to assess architectural patterns at low power and cellular details at high power. The AI model's reliance on a single magnification level means it may miss features that are only apparent at other scales. Integrating multi-scale analysis could potentially improve both accuracy and diagnostic confidence.

Binary classification only: The current system classifies images as DLBCL or non-DLBCL, which is clinically useful as a screening step but does not address DLBCL subtype classification (germinal center B-cell vs. activated B-cell, for example) or differentiation among the various non-DLBCL entities. Extending the platform to multi-class classification across all hematopoietic malignancies would be a more complete clinical tool. The authors suggest this as a natural next step.

Dataset size and diversity: While the authors demonstrate that high accuracy is achievable with fewer than 1,000 samples per hospital, the total dataset across all three hospitals remains modest. Hospital-specific models were validated only on held-out test sets from the same institution, with no external validation cohort beyond the cross-hospital experiments. Prospective, multi-site clinical trials with standardized imaging would provide stronger evidence for clinical adoption. The data is available upon request but is not deposited in a public repository, limiting independent reproducibility.

Future directions: The authors envision extending the GOTDP-MP-CNNs platform to other hematopoietic malignancies, DLBCL subtype classification, and potentially integration with immunohistochemistry image analysis. Standardizing slide preparation across institutions, or developing domain adaptation techniques that can bridge the technical gap between hospitals, would be critical for scaling this approach beyond single-hospital deployment.

TL;DR: Key limitations include hospital-specific model dependency (10-18% accuracy drop cross-hospital), single x400 magnification, binary classification only (DLBCL vs. non-DLBCL), and small dataset sizes with no public data repository. Future work targets multi-class hematopoietic malignancy classification, multi-scale imaging, and cross-hospital standardization.
Citation: Li D, Bledsoe JR, Zeng Y, et al.. Open Access, 2020. Available at: PMC7691991. DOI: 10.1038/s41467-020-19817-3. License: cc by.