AI FDG PET Treatment Outcome Prediction in DLBCL

Overview and Background

Page 1

Why Predicting Treatment Outcome in DLBCL Matters

Diffuse large B-cell lymphoma (DLBCL) is an aggressive cancer of B lymphocytes and accounts for roughly 30% of all Non-Hodgkin lymphoma diagnoses in western countries. The standard first-line therapy combines rituximab, cyclophosphamide, doxorubicin, vincristine, and prednisone (R-CHOP). Clinicians currently use the International Prognostic Index (IPI), which incorporates age, WHO performance status, Ann Arbor stage, serum lactate dehydrogenase level, and number of extranodal disease sites, to stratify patients into risk categories and guide treatment intensity.

Despite improvements in identifying low-risk patients at baseline and during treatment, approximately one-third of DLBCL patients do not respond to first-line treatment or eventually relapse. Early identification of high-risk patients could enable more tailored treatment strategies, potentially improving long-term outcomes. 18F-fluorodeoxyglucose (18F-FDG) PET/CT imaging is already the essential modality for staging DLBCL, and interim PET-adapted treatment approaches are increasingly integrated into national recommendations.

Quantitative PET parameters, particularly metabolic tumour volume (MTV) and the maximal distance between the largest lesion and any other lesion (Dmaxbulk), have shown promise as prognostic factors. However, extracting these features requires manual or semi-automated tumour segmentation, which is time-consuming and observer-dependent. This motivates the use of convolutional neural networks (CNNs) that can generate predictions directly from PET images without requiring manual segmentation.

TL;DR: DLBCL accounts for 30% of Non-Hodgkin lymphoma diagnoses, and about one-third of patients fail first-line R-CHOP therapy. The IPI score is the current clinical standard for risk stratification, but CNN-based analysis of FDG PET images could improve early identification of high-risk patients.

Study Design

Pages 1-2

Two Prospective Trial Datasets for Training and External Validation

The study used baseline 18F-FDG PET/CT scans from two separate prospective multicenter clinical trials. The HOVON-84 trial served as the training dataset. From an initial pool of 373 DLBCL patients, 317 met the DICOM quality and whole-body scan completeness criteria. After excluding 7 patients lost to follow-up within 2 years and 14 who died of unrelated causes, 296 patients remained: 244 classified as TTP0 (time-to-progression longer than 2 years) and 52 classified as TTP1 (progression within 2 years).

The PETAL trial provided the external validation dataset. Starting from 1,098 patients, the authors excluded those with non-DLBCL diagnoses, incomplete or artefact-affected scans, missing DICOM information, different treatment regimens (12 patients), loss to follow-up within 2 years (24 patients), and death without progression (19 patients). This yielded 340 patients: 279 TTP0 and 61 TTP1. Importantly, after correction for IPI, there were no significant survival differences between the PETAL and HOVON-84 cohorts.

Both datasets were previously published by Eertink et al., enabling direct comparison with segmentation-based radiomics approaches. All patients provided written consent, and both studies were approved by their respective institutional review boards. The HOVON-84 study was approved by the Erasmus MC institutional review board, and the PETAL study was approved by the Federal Institute for Drugs and Medical Devices along with the ethics committees of all participating sites.

TL;DR: Training data came from the HOVON-84 trial (296 patients, 244 TTP0 vs. 52 TTP1). External validation used the PETAL trial (340 patients, 279 TTP0 vs. 61 TTP1). Both were prospective, multicenter clinical trials with institutional review board approval.

Methodology

Pages 2-3

Maximum Intensity Projections and Data Sampling

MIP Generation: Rather than feeding full 3D PET/CT volumes into the CNN (which would be computationally expensive), the authors used maximum intensity projections (MIPs), which compress 3D scans into 2D images by projecting the maximum voxel intensity along a viewing direction. Both coronal and sagittal MIPs were generated at a resolution of 275 x 200 pixels with 4 x 4 mm pixel size. MIPs were normalized using a fixed maximum SUV of 40 (based on the maximum tumour intensity across all scans), and values above this threshold were truncated to prevent high-uptake organs like the bladder from dominating the normalization.

Lesion Segmentation: The ACCURATE tool was used to generate lesion masks via a standardized uptake value (SUV) threshold of 4.0, which Barrington et al. had identified as the preferred segmentation threshold. Physiological uptake adjacent to tumours was manually removed. Three types of MIP inputs were generated: lesion-only MIPs (containing just the tumour segmentation), regular MIPs (full PET scan projections), and brain-removed MIPs (BR-MIPs, where the brain was excluded to ensure consistency, since some scans did not fully include the head).

Data Sampling: Because the training set was heavily imbalanced (244 TTP0 vs. 52 TTP1), the authors divided the TTP0 patients into 5 stratified subsets of approximately equal size (subsets A through E). Three randomly selected TTP0 patients from other subsets were added to each subset to match the 52 TTP1 count, yielding balanced subsets of approximately 104 patients each (50% prevalence per class). Each subset underwent fivefold cross-validation with an 80/20 training-to-internal-validation split.

TL;DR: 3D PET scans were converted to 2D coronal and sagittal MIPs (275 x 200 pixels, SUV normalized to 40). Three input types were tested: lesion-only, full MIP, and brain-removed MIP. Class imbalance was addressed by splitting 244 TTP0 patients into 5 balanced subsets, each paired with all 52 TTP1 patients, then cross-validated with 5 folds.

CNN Architecture

Pages 3-4

Dual-Branch CNN with Coronal and Sagittal Inputs

The CNN architecture uses a dual-branch design. One branch processes the coronal MIP and the other processes the sagittal MIP. Both branches share an identical architecture and run in parallel, with their outputs merged at the final dense layer. Each branch consists of 4 convolutional layers, each followed by a max pooling layer. The number of feature maps doubles at each convolutional layer, starting at 16 in the first layer and reaching 128 in the fourth layer. Each convolutional layer uses 3x3 filter sizes with the ReLU activation function.

Regularization and Pooling: A spatial dropout of 0.35 is applied after each convolutional layer, meaning 35% of network nodes and connections are randomly dropped during training to prevent overfitting. Three MaxPooling layers with feature map sizes of (3,3), (3,3), and (2,2) handle dimensionality reduction. After the final convolutional and dropout layers, a GlobalAveragePooling2D (GAP2D) layer flattens the output into a 2D tensor. The coronal and sagittal outputs are then concatenated at a fully connected layer (FCL) that outputs probabilities for two classes: P(TTP0) and P(TTP1), using a softmax activation.

Training Configuration: The model was compiled using the Adam optimizer with a learning rate of 0.00005 and a decay rate of 0.000001. Three training schemes were explored. The Lesion MIP CNN trained for 200 epochs on lesion-only MIPs. The MIP CNN used a two-step transfer learning approach: 200 epochs on lesion masks followed by 300 epochs on regular MIPs. The BR-MIP CNN followed the same two-step process but used brain-removed MIPs in the second step. All models were implemented in Python 3.9.16 with Keras 2.10.0 and TensorFlow 2.10.0.

TL;DR: A dual-branch CNN processes coronal and sagittal MIPs in parallel through 4 convolutional layers (16 to 128 feature maps), 0.35 spatial dropout, and GlobalAveragePooling2D before merging at a softmax output layer. Three training variants were tested: Lesion MIP (200 epochs), MIP (200 + 300 epochs with transfer learning), and BR-MIP (same two-step approach with brain-removed images).

Results

Pages 4-6

Internal and External Validation Performance

Internal Validation (HOVON-84): Among the 5 data subsets, the model trained on subset C consistently performed best across all three CNN variants. For the Lesion MIP CNN, the cross-validated AUC was 0.75 (SD 0.07) with sensitivity of 0.63 (SD 0.19) and specificity of 0.73 (SD 0.08). The MIP CNN achieved a CV-AUC of 0.70 (SD 0.06) with sensitivity 0.75 (SD 0.08) and specificity 0.55 (SD 0.10). The BR-MIP CNN yielded a CV-AUC of 0.72 (SD 0.11) with sensitivity 0.71 (SD 0.14) and specificity 0.63 (SD 0.19). These values were comparable to or better than the IPI prediction model, which had a reported AUC of 0.68 for the same dataset.

External Validation (PETAL): The BR-MIP CNN achieved the highest external AUC of 0.74, with sensitivity of 0.54 and specificity of 0.81. The Lesion MIP CNN produced an AUC of 0.72 (sensitivity 0.59, specificity 0.80), and the MIP CNN reached an AUC of 0.71 (sensitivity 0.62, specificity 0.72). The BR-MIP CNN significantly outperformed the IPI model (AUC 0.67, sensitivity 0.57, specificity 0.68) based on the DeLong test (p-value = 0.035). Statistical significance was also found between the BR-MIP CNN and IPI during internal validation (p-value = 0.015).

The BR-MIP CNN was selected as the primary model for further analysis because it achieved the best external validation AUC, does not require prior tumour segmentation (unlike the Lesion MIP CNN), and uses brain-removed MIPs for consistency across scanners that may or may not include the full head.

TL;DR: The BR-MIP CNN achieved the best external validation AUC of 0.74 (sensitivity 0.54, specificity 0.81), significantly outperforming the IPI model (AUC 0.67, DeLong p = 0.035). Internal CV-AUC was 0.72. The CNN does not require manual tumour segmentation to generate predictions.

Model Plausibility

Pages 6-7

What Is the CNN Actually Learning from PET Images?

To verify that the CNN was making clinically meaningful predictions rather than exploiting imaging artefacts, the authors conducted two plausibility analyses. First, they assessed the association between the CNN-generated P(TTP1) probabilities and two established PET-extracted prognostic features: metabolic tumour volume (MTV) and Dmaxbulk. In the HOVON-84 dataset, a moderate association was found between MTV and P(TTP1), while Dmaxbulk showed a weak association. In the PETAL dataset, both MTV and Dmaxbulk showed moderate associations with P(TTP1). In all cases, higher predicted progression probabilities correlated with higher MTV and Dmaxbulk values.

Tumour Ablation Experiment: The second plausibility analysis involved synthetically removing tumours from the MIP images and re-running the CNN predictions. Tumour voxels were replaced with the average of surrounding non-background voxel intensities. Probabilities were then calibrated using logistic regression coefficients fitted to the original data. The results showed that high P(TTP1) values (above 0.6) in the original MIPs were substantially reduced to below 0.4 after tumour removal, confirming that the CNN is primarily using tumour-related information for its predictions.

Visual inspection of individual CNN outputs showed that patients with fewer tumours and lower dissemination received lower progression probability values, while patients with more tumours and higher dissemination received higher values. These findings suggest the model captures both tumour volume and spatial dissemination patterns, though deep learning methods may also attend to textural features that conventional PET parameters miss.

TL;DR: CNN probabilities correlated moderately with MTV and Dmaxbulk (known prognostic markers). Synthetically removing tumours from MIPs dropped high-risk probabilities from above 0.6 to below 0.4, confirming the CNN uses tumour information. The model appears to capture both tumour volume and spatial dissemination.

Limitations

Pages 7-8

Key Limitations and Caveats

Brain Removal Challenges: Some DLBCL patients develop lesions near or within the brain, which complicates the brain removal step required for the BR-MIP CNN. About 1% of patients had truncated lesions that could not be fully resolved, highlighting the need for clinician supervision in these edge cases.

Sensitivity and Specificity Cut-off: The cut-off value of 0.5 used to calculate sensitivity and specificity was optimized on the HOVON-84 dataset. This led to notable differences when applied to the PETAL dataset, and slight adjustment of this threshold may be needed for different external cohorts to achieve comparable performance metrics.

Patient Selection Bias: The HOVON-84 trial excluded patients with Ann Arbor stage 1 disease and those with central nervous system involvement. The absence of limited-stage patients and those with very poor prognosis represents a potential bias that could affect model performance and generalizability to broader DLBCL populations. A more extensive external validation with a wider patient spectrum is needed.

Model Interpretability: End-to-end CNNs are inherently complex and difficult to interpret. While the authors partially addressed this through association analyses with known PET parameters and tumour ablation experiments, the lack of full explainability remains a barrier for clinical translation. It is not yet clear whether end-to-end CNNs can outperform segmentation-based models or radiomic approaches, and both avenues should continue to be explored.

TL;DR: Limitations include brain removal complications in about 1% of cases, a fixed sensitivity/specificity cut-off that may not transfer across datasets, exclusion of Ann Arbor stage 1 and CNS-involved patients from training data, and the inherent lack of interpretability in end-to-end CNN models.

Future Directions

Page 8

What Comes Next for CNN-Based Outcome Prediction in DLBCL

This study represents one of the first investigations into end-to-end CNN-based treatment outcome prediction in DLBCL using 18F-FDG PET/CT MIP images with proper external validation. The authors note that prior studies using MIPs for DLBCL outcome prediction (such as the multi-task ranker neural network by Rebaud et al. and the multi-task 3D U-Net by Liu et al.) did not validate their models on external datasets as recommended by the RELIANCE guidelines for nuclear medicine AI studies.

Segmentation-Based vs. End-to-End Approaches: The authors plan to investigate segmentation-based CNNs for DLBCL treatment outcome prediction and compare them with the end-to-end approach presented here. Segmentation-based models offer easier interpretability because the lesion segmentation can be visually inspected, and derived features like tumour volume and dissemination can be directly examined. The question of whether end-to-end CNNs can outperform these more transparent models remains open.

Clinical Integration: Recent recommendations by Westin et al. suggest assessing progression within or after 1 year of first-line treatment rather than the 2-year window used in this study. Future work may need to adapt the prediction endpoint accordingly. Additionally, combining CNN-derived features with handcrafted radiomics analysis and clinical parameters could yield more robust prognostic models. The International Metabolic Prognostic Index (IMPI) by Mikhaeel et al., which combines Ann Arbor stage, age, and MTV for 3-year progression-free survival prediction, represents another benchmark for comparison.

TL;DR: This is one of the first externally validated CNN studies for DLBCL outcome prediction using PET MIPs. Future work should compare end-to-end CNNs against segmentation-based models, consider adapting to 1-year progression endpoints, and explore combining CNN features with radiomics and clinical parameters.

An artificial intelligence method using FDG PET to predict treatment outcome in diffuse large B-cell lymphoma

Original Paper (PDF)

Plain-English Explanations