Survival Prediction in DLBCL via Multimodal PET/CT

Overview and Background

Pages 1-2

Why Better Prognostic Models Are Needed for DLBCL

Diffuse large B-cell lymphoma (DLBCL) is the most common form of non-Hodgkin lymphoma and is characterized by significant genetic and phenotypic heterogeneity. Despite standard immunochemotherapy with R-CHOP (rituximab, cyclophosphamide, doxorubicin, vincristine, and prednisone), approximately 30-40% of patients experience refractory disease or death. Identifying reliable prognostic factors early in the treatment course could help clinicians stratify patients and tailor therapy more effectively.

Current clinical scoring systems: The International Prognostic Index (IPI), Revised IPI (R-IPI), and NCCN-IPI are the primary tools used to estimate outcomes for DLBCL patients. However, these indices rely on a handful of clinical variables and fall short of accurately identifying patients who will relapse or have poor long-term prognosis. They do not capture the underlying tumor biology visible in imaging data.

The promise of PET/CT imaging: Pre-treatment [18F]-FDG PET/CT imaging has shown value in DLBCL prognosis through semiquantitative metabolic parameters such as total lesion glycolysis (TLG), baseline metabolic tumor volume (MTV), and standardized uptake value (SUV). These metrics help assess tumor heterogeneity and risk stratification but cannot capture the internal structural nuances of the tumor. Radiomics, which extracts quantitative features from medical images, offers a potential solution, but traditional radiomics methods struggle to extract deep, nonlinear features from images.

Study objective: This study aimed to develop a combined prognostic model for DLBCL patients by constructing a multimodal PET-CT deep features radiomics signature (DFR-signature) using fused PET/CT images and an Automated Machine Learning (AutoML) framework. The goal was to predict both progression-free survival (PFS) and overall survival (OS) more accurately than existing unimodal or traditional radiomics approaches.

TL;DR: DLBCL affects 30-40% of patients with poor outcomes despite R-CHOP. Current prognostic tools (IPI, R-IPI, NCCN-IPI) lack precision. This study builds a deep learning radiomics model from fused PET/CT images to better predict survival.

Study Design and Patient Cohort

Pages 2-3

369 Patients from Two Medical Centers

This retrospective study enrolled 369 DLBCL patients from two institutions: 225 patients from Nanjing Drum Tower Hospital (affiliated with Nanjing University) and 144 patients from West China Hospital of Sichuan University. All patients received 6 cycles of R-CHOP followed by 2 cycles of rituximab. Patients treated with radiotherapy for PET-CT positive remaining lesions were excluded, as were those receiving the Pola-R-CHP regimen. Patients with IPI 0 and no bulk received only 4 cycles of R-CHOP followed by 2 cycles of R, consistent with the FLYER trial protocol.

Cohort division: The 369 patients were randomly split into a training cohort (n = 258) and an internal validation cohort (n = 111) at a 7:3 ratio. There was no statistically significant difference between the baseline characteristics of the two cohorts (P > 0.05). The median follow-up was 31 months. In the training cohort, 93 patients experienced disease relapse and 66 died. PFS rates were 80.8%, 42.0%, and 7.6% at 1, 3, and 5 years, respectively. OS rates were 86.7%, 43.1%, and 10.6% at 1, 3, and 5 years.

PET/CT scanning protocol: Patients fasted for at least 6 hours before scanning to achieve a blood glucose level below 11.1 mmol/L. After fasting, 185-370 MBq of [18F]-FDG (5.18 MBq/kg) was administered intravenously, and scans commenced 60 minutes post-injection. Scans covered the region from the base of the skull to the upper thighs, with 2 minutes per bed position. Different scanners were used at the two sites (Gemini GXL at Nanjing, Gemini GXL and UM780 at West China Hospital).

TL;DR: 369 DLBCL patients from two Chinese hospitals, split 7:3 into training (n=258) and validation (n=111) cohorts. Median follow-up was 31 months. 5-year PFS and OS rates in the training cohort were 7.6% and 10.6%, respectively.

Methodology

Pages 3-4

Multimodal Image Fusion and Deep Feature Extraction

Volume of interest (VOI) delineation: VOIs were delineated semi-automatically on PET images using the Grow-Cut algorithm in 3D Slicer (version 5.2.0), then manually adjusted by two physicians. Discrepancies were resolved by a senior nuclear medicine scientist. Rather than independently delineating VOIs on CT and fused images, the PET-derived VOIs were registered onto the CT images and then propagated to the fused multimodal images, reducing the manual workload significantly.

Image preprocessing and fusion: CT images were windowed at 350 HU width and 50 HU level for optimal contrast. PET images underwent smoothing and denoising. Z-score normalization was applied to pixel values of both modalities. CT images were then registered to PET images using the SimpleITK library (version 2.0.0). A multi-scale feature-weighted fusion module combined the two modalities. PET and CT images were down-sampled and pooled to obtain feature maps at multiple scales, which were concatenated to produce spatial fusion maps. An attention mechanism was applied to capture the semantic (bio-metabolic) information from PET images, generating PET attention maps. These attention maps were concatenated with CT images to form hybrid attentional maps, which were then combined with the spatial fusion maps and up-sampled to produce the final fused multimodal PET-CT images.

Deep feature extraction: Seven pre-trained deep learning architectures were evaluated for feature extraction: ResNet-50, VGG-16, VGG-19, DenseNet-121, DenseNet-169, Xception, and NASNet, all initially trained on the ImageNet dataset. Tumor-containing images were cropped to 224 x 224 pixels and converted from grayscale to RGB via cubic spline interpolation. From each model's fully connected layer, 1,000 deep features were extracted per patient. These models were implemented using the Python Keras library.

TL;DR: PET and CT images were fused using a multi-scale attention-based module. Seven CNN architectures (ResNet-50, VGG-16/19, DenseNet-121/169, Xception, NASNet) were tested for deep feature extraction, yielding 1,000 features per patient from the fused images.

Methodology

Pages 4-5

Feature Selection, AutoML, and Model Construction

Feature selection pipeline: All extracted deep features were normalized via Z-scores. The intraclass correlation coefficient (ICC) was calculated to assess interobserver repeatability, and only features with ICC above 0.8 were retained. The Mutual Information (MI) algorithm was then used to rank features by their significance to survival events. The top 10 features were selected to construct the DFR-signature.

AutoGluon framework: The AutoML framework AutoGluon (version 0.7.0) was employed to optimize the DFR-signature construction. AutoGluon automates data preprocessing, optimal algorithm selection, and hyperparameter tuning. It uses a two-layer architecture: the first layer consists of multiple base machine learning models that generate predictions, and the second layer uses multiple stacker models that take the first layer's predictions combined with the original input to produce final outputs. This stacking ensemble approach eliminates the need for manual parameter tuning, which is a common bottleneck in traditional radiomics pipelines.

Model development and validation: A 10-fold cross-validation approach was used during training to mitigate overfitting risk. Univariate and multivariate Cox regression analyses identified independent clinical factors and metabolic parameters. These prognostic factors were combined with the DFR-signature to build the final combined model. For comparison, the authors also built a PET-only model, a CT-only model, an NCCN-IPI model, and a conventional radiomics model. Performance was evaluated using Harrell's concordance index (C-index) and time-dependent AUC (tdAUC). PFS and OS served as the primary endpoints.

TL;DR: Top 10 features (from 1,000) were selected via ICC filtering and Mutual Information ranking. AutoGluon's two-layer stacking ensemble automated model optimization. The combined model integrated the DFR-signature with clinical and metabolic factors, validated via 10-fold cross-validation.

Key Results

Pages 5-6

ResNet-50 Outperformed All Other Architectures

Among the seven CNN architectures tested, ResNet-50 achieved the highest AUC for predicting overall survival: 0.863 in the training cohort and 0.767 in the internal validation cohort. The next best performers were DenseNet-169 (training AUC 0.839, validation 0.698) and VGG-16 (training 0.807, validation 0.713). DenseNet-121 showed strong validation performance (0.740) but lower training AUC (0.785). VGG-19, Xception, and NASNet all had validation AUCs below 0.71.

Multimodal superiority: When ResNet-50 features were extracted from CT images alone, PET images alone, and fused PET/CT images separately, the fused multimodal approach consistently outperformed the single-modality approaches. The ROC analysis showed the multimodal DFR-signature model yielded better classification performance than both unimodal models and the traditional radiomics signature model.

Risk stratification with Kaplan-Meier analysis: The DFR-signature cutoff values in the training cohort were 0.2839 for PFS and 0.1911 for OS. These cutoffs stratified patients into low-risk and high-risk groups. By combining DFR-signature risk groups with NCCN-IPI scores (threshold at 4), patients were further subdivided into four risk categories: low/low-intermediate (DFR low-risk + NCCN-IPI < 4), high-intermediate/high (DFR high-risk + NCCN-IPI >= 4), and two intermediate groups. Log-rank tests confirmed statistically significant survival differences across all four groups (P < 0.05) in both training and validation cohorts.

TL;DR: ResNet-50 achieved the best AUC (0.863 training, 0.767 validation) among seven CNNs. Fused PET/CT features outperformed single-modality features. Combining DFR-signature with NCCN-IPI produced a four-tier risk stratification with significant survival separation (P < 0.05).

Key Results

Pages 6-9

Combined Model Performance and Cox Regression Findings

Cox regression analysis: Multivariate analysis identified several independent risk factors. For PFS, significant predictors included Ann Arbor stage (HR = 1.703, P = 0.044), NCCN-IPI >= 4 (HR = 2.351, P < 0.001), SUVmax (HR = 1.575, P = 0.032), TMTV (HR = 2.535, P < 0.001), and DFR-signature for PFS (HR = 6.048, P < 0.001). For OS, significant predictors included pathological subtype (HR = 1.986, P = 0.006), NCCN-IPI (HR = 3.608, P < 0.001), SUVmax (HR = 1.931, P = 0.009), TMTV (HR = 2.280, P < 0.001), and DFR-signature for OS (HR = 8.058, P < 0.001). Notably, the DFR-signature had the highest hazard ratios of any variable for both endpoints.

C-index comparison across models: The combined model (integrating DFR-signature, clinical features, and metabolic parameters) achieved C-indices of 0.784 (training) and 0.739 (validation) for PFS, and 0.831 (training) and 0.782 (validation) for OS. This consistently surpassed the CT model (PFS: 0.732/0.718, OS: 0.799/0.721), the PET model (PFS: 0.724/0.717, OS: 0.791/0.745), the NCCN-IPI model (PFS: 0.632/0.654, OS: 0.659/0.636), and the conventional radiomics model (PFS: 0.740/0.721, OS: 0.787/0.734). The combined model outperformed the NCCN-IPI model by over 0.12 points in C-index across most comparisons.

External validation: The model was further tested using Nanjing Drum Tower Hospital data for training/testing and West China Hospital data as a completely independent external validation cohort. The DFR-signature-based model achieved AUCs of 0.758 (training) and 0.703 (external test) for PFS, and 0.753 (training) and 0.701 (external test) for OS. Calibration curves for 1-year, 3-year, and 5-year survival probabilities showed close alignment between predicted and observed outcomes in both cohorts, supporting clinical utility.

TL;DR: The DFR-signature was the strongest independent predictor for both PFS (HR = 6.048) and OS (HR = 8.058). The combined model achieved C-indices of 0.784/0.739 (PFS) and 0.831/0.782 (OS), outperforming NCCN-IPI by over 0.12 points. External validation confirmed generalizability (AUC 0.703 for PFS, 0.701 for OS).

Limitations

Pages 10-11

Retrospective Design and Gaps in Therapy Coverage

Retrospective single-treatment design: All patients in this study received R-CHOP, which remains the standard first-line regimen for DLBCL. However, the study did not include patients treated with newer therapies such as small molecule drugs, antibody-drug conjugates (ADCs), CD3xCD20 bispecific antibodies, or chimeric antigen receptor (CAR) T-cell therapy. CAR-T cell therapy has fundamentally changed the management of relapsed and refractory large B-cell lymphomas, yet over 50% of patients do not benefit from it due to non-response or further relapse. The model's applicability to patients receiving these newer regimens remains unvalidated.

Multi-scanner variability: Different PET/CT scanners were used at the two institutions (Gemini GXL at Nanjing, Gemini GXL and UM780 at West China). While the study demonstrated generalizability across these two sites, the impact of broader scanner variability on feature extraction and model performance was not systematically evaluated. Radiomics features are known to be sensitive to differences in acquisition protocols, reconstruction algorithms, and scanner hardware.

Sample size and cohort demographics: The study enrolled 369 patients, which is reasonable for a multicenter radiomics study but still limits the statistical power for subgroup analyses. The median follow-up of 31 months may be insufficient to fully capture long-term outcomes, particularly for 5-year survival predictions where the event rates were low (PFS 7.6%, OS 10.6% at 5 years in the training cohort). The geographic concentration of patients in Chinese hospitals raises questions about generalizability to other populations with different genetic backgrounds and disease presentations.

TL;DR: Key limitations include retrospective design, exclusion of novel therapies (CAR-T, ADCs, bispecific antibodies), multi-scanner variability, a 31-month median follow-up that may be too short for 5-year predictions, and a geographically limited patient cohort of 369 from two Chinese hospitals.

Future Directions

Pages 11-12

Expanding to Novel Therapies and CNS Relapse Prediction

Novel therapy integration: The authors highlight the need to expand the model to incorporate biological features associated with novel DLBCL therapies. As CAR-T cell therapy, bispecific antibodies, and ADCs become more widely used, identifying biomarkers that predict which patients will benefit from these treatments becomes critical. Future studies could include imaging and clinical features specific to these treatment contexts, potentially enabling the DFR-signature framework to guide therapy selection beyond first-line R-CHOP.

CNS relapse prediction: The study references prior work showing that radiomics-based deep learning achieved 90.66% accuracy in predicting survival for high-grade gliomas, and that radiomic feature analysis (AUC = 0.944) outperformed conventional genomics (AUC = 0.819) in distinguishing glioblastoma from primary CNS lymphoma. The authors suggest that DFR-based models could serve as a new low-cost tool for predicting CNS relapse in DLBCL patients, improving treatment decisions for CNS involvement and playing a role in routine clinical management.

Prospective validation and broader adoption: While the multicenter design and external validation cohort are strengths of this study, prospective validation across diverse international cohorts would be necessary before clinical deployment. Harmonization of PET/CT acquisition protocols across sites, standardization of the image fusion pipeline, and integration into clinical decision-support systems represent important next steps. The AutoGluon framework's automation of model selection and hyperparameter tuning could help make the approach more accessible to institutions without dedicated machine learning expertise.

TL;DR: Future work should incorporate novel therapies (CAR-T, bispecific antibodies), explore DFR-based CNS relapse prediction, and pursue prospective multicenter validation. The AutoML pipeline may lower the barrier for clinical adoption across institutions.

Survival prediction in diffuse large B-cell lymphoma patients: multimodal PET/CT analysis

Original Paper (PDF)

Plain-English Explanations