Diffuse Large B-Cell Lymphoma (DLBCL) is the most common type of Non-Hodgkin Lymphoma. Over 60% of patients are cured after first-line R-CHOP chemotherapy (rituximab, cyclophosphamide, doxorubicin, vincristine, and prednisone). However, up to 15% of patients experience Primary Treatment Failure (PTF), meaning their disease progresses during therapy or fails to achieve a complete response. These refractory patients face a median survival of less than one year, making early identification critical for treatment planning.
The clinical gap: Current tools for predicting PTF are limited. Double-hit status (concurrent BCL2 and MYC translocations) identifies only about 5% of patients unlikely to benefit from R-CHOP. The revised International Prognostic Index (R-IPI) and TP53 mutation status can predict long-term survival but cannot specifically flag patients headed for primary treatment failure. Standard diagnostic imaging is used for staging, but no conventional radiological findings have been correlated with PTF-DLBCL. Emerging therapies like chimeric antigen receptor (CAR) T-cell therapy produce high response rates in relapsed/refractory DLBCL and could benefit patients at risk for PTF if they were identified before starting standard treatment.
The radiomics hypothesis: Radiomics refers to the high-throughput extraction of large numbers of quantitative features from medical images. These features capture information about tumor shape, texture, and intensity patterns that are invisible to the human eye. While quantitative analysis of 18FDG-PET/CT scans has been studied as a prognostic marker in DLBCL, data on the potential value of diagnostic CT-derived quantitative features in lymphoma remains sparse. This study aimed to develop a machine learning model using CT-based radiomic features to predict PTF-DLBCL from the initial diagnostic scan, before any treatment begins.
This single-center retrospective study included adult patients diagnosed with de novo DLBCL between 2009 and 2018 at the Jewish General Hospital, McGill University. All patients received R-CHOP or similar frontline therapy. Patients who had received prior chemotherapy or radiotherapy (for instance, for prior indolent lymphoma) were excluded. The study was approved by the Research Ethics Committee of the CIUSSS of West-Central Montreal with a waiver for informed consent given its retrospective design.
Cohort composition: Twenty-six refractory patients were identified and matched 1:1 to 26 non-refractory patients. Refractory patients (PTF group) were defined by disease progression during R-CHOP or failure to achieve a complete response after at least 4 cycles, per the Lugano criteria. Non-refractory patients achieved complete response without relapse within 6 months. Matching was performed on sex and R-IPI score (which incorporates age, stage, performance status, LDH level, and extra-nodal sites). The median age was 65 years for refractory and 71 years for non-refractory patients. The sex ratio (M/F) was 1.36 in both groups, and the median R-IPI was 3 (69% had R-IPI of 3 or higher). In the refractory group, 96% had stage 3 or higher and 31% had ECOG performance status of 2 or greater. Three of 18 tested refractory patients had double-hit DLBCL, compared to 0 of 11 tested non-refractory patients.
Node selection criteria: On the diagnostic CT scan, lymph nodes measuring 1.5 cm or greater in greatest diameter were evaluated. To mirror the Lugano response criteria, a maximum of 6 nodes at each of 4 nodal sites (abdomen, chest, axilla, and neck) were included per patient. When more than 6 eligible nodes existed at one site, only the 6 largest were kept. This yielded 180 lymph nodes total: 75 non-complete-response nodes from refractory patients (55 refractory nodes and 20 partial response nodes) and 105 complete response nodes from non-refractory patients.
Volume of interest delineation: Each lymph node was manually contoured in 3D using the open-source 3D Slicer software (v4.10.2). Critically, every node was contoured independently by two readers to test inter-observer reproducibility of the model. When conglomerate nodes were present, they were included as a single volume of interest (VOI) unless a clear cleavage line separated distinct nodes. An experienced senior oncologic radiologist reviewed all contours for quality control, ensuring that extra-nodal material was excluded and inclusion/exclusion criteria were respected. This central revision was not intended to flatten inter-individual contouring variation but rather to enforce consistent criteria.
Radiomic feature extraction: A total of 1,218 features were extracted from each VOI using the PyRadiomics open-source software (v2.2.0). The feature set included 18 first-order statistics (pixel-level intensity distributions), 14 3D shape-based features, and 68 gray-level texture features, totaling 100 base features. In addition, 13 spatial filters (5 Laplacian of Gaussian and 8 wavelet filters) were applied to extract fine and coarse textural patterns. Each non-shape feature was re-extracted from all 13 filtered images, adding 1,118 additional features (86 features multiplied by 13 filters). Beyond the radiomic features, nodal anatomical site and the subjective assessment of necrosis were also evaluated as potential additional variables.
Handling artifacts and necrosis: Nodes significantly obscured by imaging artifact were excluded (8 nodes for reader 1, 10 for reader 2). Nodes with central necrosis or cystic changes (defined as areas of low attenuation between -10 and 30 Hounsfield units with greatest diameter of 3 mm or more) were labeled by each reader. Rather than being excluded, these nodes were retained in the analysis, with subjective necrosis included as a candidate variable during model construction.
Initial feature reduction: With 1,218 radiomic features, the authors first eliminated highly correlated features from the training set. After evaluating multiple correlation thresholds, a cutoff of 0.6 was selected manually. This removed redundant features while preserving enough diversity for recursive feature elimination without overwhelming the computational pipeline. The 1,218 features were reduced to 66 uncorrelated candidates, which were then combined with 2 additional clinical features: nodal anatomical site and subjective necrosis assessment.
Recursive feature elimination: Using the rfeControl function from R's caret package, the authors performed recursive feature elimination optimized for accuracy within a Random Forest framework. Repeated 10-fold cross-validation was used to evaluate candidate feature subsets: the training data was split into 10 folds, the model trained on 9 and evaluated on the 10th, repeated 10 times, with average performance recorded. This process identified the top 10 features for the final model, all of which showed strong inter-reader correlation (Spearman median r = 0.8, all adjusted p-values less than 0.0001). Notably, neither nodal site nor subjective necrosis was retained in the final feature set.
Hyperparameter tuning: The Random Forest classifier (R randomForest package, v4.6-14) was tuned for the number of trees using a grid search ranging from 5 to 20,000 trees, evaluated via repeated cross-validation on the training set. The final tuned model was then evaluated on the independent test set. This entire process was repeated across five random seeds for both readers, with average performance compared across runs.
Train-test split strategy: Seventy percent of nodes were randomly assigned to the training set and the remaining 30% to an independent test set. The same node allocation was initially used for both readers to enable fair comparison. However, because each reader independently assessed artifact exclusions, there was a small discrepancy between the exact nodes used for each reader's training and test sets.
Overall performance: On the independent test set, the final model achieved a mean accuracy of 73%, mean sensitivity of 62%, and mean specificity of 82% for distinguishing between refractory and non-refractory patients. The mean positive predictive value (PPV) was 0.77 and the mean negative predictive value (NPV) was 0.71. The high specificity is particularly important in this clinical context because a test designed for patient selection and treatment decisions must reliably avoid misclassifying non-refractory patients as refractory.
Per-reader breakdown: Reader 1 achieved a sensitivity of 0.66 (95% CI: 0.57-0.74), specificity of 0.84 (95% CI: 0.74-0.93), PPV of 0.77 (95% CI: 0.75-0.80), NPV of 0.73 (95% CI: 0.69-0.77), and accuracy of 0.76 (95% CI: 0.74-0.79). Reader 2 achieved a sensitivity of 0.59 (95% CI: 0.49-0.69), specificity of 0.81 (95% CI: 0.76-0.85), PPV of 0.76 (95% CI: 0.67-0.86), NPV of 0.69 (95% CI: 0.66-0.73), and accuracy of 0.70 (95% CI: 0.68-0.73). The confidence intervals overlapped across all metrics, indicating good inter-reader reproducibility.
AUC analysis: The area under the ROC curve was 0.83 (95% CI: 0.81-0.84) for reader 1 and 0.79 (95% CI: 0.76-0.81) for reader 2. This is notably superior to the AUC of subjective necrosis as a qualitative radiology marker, which achieved only 0.56 (95% CI: 0.49-0.63) and 0.52 (95% CI: 0.45-0.59) for readers 1 and 2 respectively. For comparison, an interim PET-scan-based study reported an AUC of only 0.677 for response prediction in DLBCL, further underscoring the competitive performance of this CT-based radiomics approach.
Univariate feature analysis: Before building the multivariate model, the authors explored basic statistical associations between non-CR and CR nodes using multiple t-tests. They identified 35 features with an absolute log2 fold change of 1 or greater and adjusted p-value below 0.05 (using the Benjamini-Yekutieli correction method). A hierarchical clustering heatmap using Euclidean distance was generated for these 35 features. The heatmap showed a loose clustering pattern where refractory and non-refractory nodes grouped together, but it did not identify fully distinct phenotypic clusters, motivating the need for a multivariate machine learning approach.
The role of texture and heterogeneity: Two of the 10 final features in the Random Forest model relate to non-uniformity, identifying intratumor heterogeneity as a prognostic marker. This finding is consistent with prior literature showing that intratumor heterogeneity is a strong imaging predictor of poor prognosis in both solid tumors and DLBCL. The predictive power of these texture-based features supports further exploration of contrast-enhanced CT-derived images in future radiomics models. Such features could potentially be extracted from either standalone CT scans or from the CT component of contrast-enhanced 18FDG-PET/CT.
Visual indistinguishability: The study presented a compelling comparison of two intra-abdominal nodes that appeared visually identical on CT imaging but were accurately distinguished by the radiomics model, with one classified as refractory and the other as non-refractory. This illustrates that quantitative image-based features can capture information about treatment response that is completely invisible to conventional radiological interpretation of semantic features.
Small sample size: The study included only 52 patients (26 refractory and 26 non-refractory), yielding 180 lymph nodes. DLBCL is characterized by wide molecular heterogeneity and can be subdivided into up to five distinct molecular subgroups. The limited cohort may not adequately represent this molecular variability. To partially mitigate this, the authors matched patients on the most significant clinical outcome predictor, the R-IPI score, helping to flatten differences between the two comparative groups.
Intra-individual spatial heterogeneity: Different tumor sites within a single patient can harbor distinct clonal populations with mutational disparity, even at initial diagnosis before treatment. This spatial diversity can produce mixed responses where some lesions progress while others respond to therapy. The existence of mixed response still categorizes the patient as refractory. To account for this, each node was considered independently for predicting treatment response. The authors argue that a radiomics approach has the potential to better represent disease complexity than a single biopsy, which might underestimate spatial heterogeneity.
Manual segmentation burden: Inter-observer variability from manual contouring is a well-known limitation in radiomics research. While this study addressed it through double contouring by two independent readers (with high inter-reader correlation for all 10 final features, Spearman adjusted p-values all below 0.0001), manual segmentation is inherently time-consuming. This limits practical clinical implementation as a real-time biomarker. Semi-automated or fully automated segmentation algorithms would be needed to make this approach scalable for routine clinical use.
Single-center, retrospective design: All data came from a single institution over a 10-year period (2009-2018). The retrospective design and lack of external validation mean the model's generalizability to other centers, scanner types, and patient populations remains unconfirmed. The model was tested across two readers as a form of internal reproducibility, but this does not substitute for true external validation on an independent dataset.
Expanding the dataset: The authors plan to expand the patient dataset to develop more powerful algorithms with improved performance. A larger cohort would better capture the molecular heterogeneity of DLBCL and provide more robust training data for machine learning models. The immediate priority is validating the current model on an external independent dataset from a different institution to test its generalizability across different scanners, imaging protocols, and patient populations.
Automated node identification: Future work will focus on automating the identification of nodal structures on CT scans. This would address the primary bottleneck of manual 3D contouring, which is both time-consuming and subject to inter-observer variability. Semi-automated or fully automated segmentation using deep learning could make the radiomics pipeline practical for routine clinical deployment and enable analysis of larger patient populations in shorter timeframes.
Combining CT with PET-scan biomarkers: The contrast-enhanced CT scan used in this study is the standard method for assessing disease extent and treatment response in clinical trials, alongside PET-CT. The authors suggest combining their quantitative CT-based model with metabolic PET-scan biomarkers (such as SUV data) to potentially improve predictive accuracy. Prior studies have shown that texture analysis of PET scans combined with functional parameters correlates better with survival than SUV analysis alone. Integrating both CT-derived and PET-derived features into a unified radiomics model could capture complementary aspects of tumor biology.
Non-nodal disease analysis: The current study focused exclusively on nodal involvement. DLBCL frequently affects extra-nodal sites, and future work will explore radiomics analysis of non-nodal disease involvement. This could provide a more comprehensive assessment of the total disease burden and potentially improve prediction accuracy by incorporating the full spectrum of tumor locations and their respective imaging characteristics.