Non-Hodgkin lymphoma (NHL) is a diverse group of lymphoid cancers with a rising global incidence, reaching an estimated 545,000 new cases and 260,000 deaths worldwide in 2020. Immunochemotherapy regimens such as R-CHOP have improved survival considerably, but infectious complications, particularly pneumonia, remain among the most frequent and dangerous adverse events. In elderly NHL patients treated with R-CHOP, pulmonary complications have been reported in up to 40% of cases, with roughly 10% experiencing severe infections. A prior study of 229 newly diagnosed NHL patients found that 91 (39.7%) developed bacterial pneumonia.
The critical 90-day window: Pneumonia most commonly occurs within the first 90 days after chemotherapy initiation, a period marked by bone marrow suppression, disrupted mucosal barriers, and compromised immune function. Early identification of high-risk individuals during this window is essential so that clinicians can initiate timely interventions such as empiric antimicrobial therapy, antifungal prophylaxis, vaccination strategies, and chemotherapy dose adjustments.
Gaps in existing tools: Current risk stratification systems like the MASCC Risk Index, the Talcott classification, and the CISNE score were designed for febrile neutropenia management, not organ-specific pneumonia prediction. Similarly, invasive fungal disease scores target a narrow fungal subset rather than the full bacterial, viral, and fungal pneumonia spectrum encountered during NHL chemotherapy. This study addresses that gap by developing an interpretable machine learning model specifically for 90-day pneumonia risk in NHL.
This retrospective study was conducted at Dongyang Hospital affiliated with Wenzhou Medical University, enrolling consecutive patients with pathologically confirmed NHL who initiated systemic chemotherapy between October 2018 and October 2024. Eligibility required at least one chemotherapy cycle and follow-up for at least 90 days (or until pneumonia or death). Patients with pre-existing pneumonia or incomplete data were excluded, yielding a final cohort of 205 patients.
Outcome definition: The primary endpoint was radiographically confirmed pneumonia within 90 days of chemotherapy initiation. Case ascertainment required both a new or progressive pulmonary infiltrate on chest radiography or CT, and at least one clinical criterion: new or worsened cough/sputum, fever, auscultatory findings consistent with consolidation, or an abnormal leukocyte count. Non-infectious mimics (drug-induced pneumonitis, cardiogenic pulmonary edema, pulmonary embolism, tumor progression) were excluded. The definition adapted the 2019 ATS/IDSA community-acquired pneumonia framework for case ascertainment only, since that guideline excludes immunocompromised hosts.
Baseline characteristics: Of the 205 patients, 79 (38.5%) developed pneumonia. The pneumonia group was more frequently male (69.6% vs. 48.4%, p = 0.005), had advanced Ann Arbor stage III-IV disease (78.5% vs. 58.7%, p = 0.006), higher rates of smoking (57.0% vs. 16.7%, p < 0.001), alcohol use (58.2% vs. 17.5%, p < 0.001), high-grade malignancy (81.0% vs. 46.0%, p < 0.001), and reduced eGFR below 80 mL/min/1.73 m2 (51.9% vs. 9.5%, p < 0.001). Most other laboratory and demographic variables showed no significant differences.
Data preprocessing: A total of 35 clinical variables covering demographics, disease status, comorbidities, treatment factors, and baseline labs were collected. All predictors were dichotomized to binary indicators (0/1). Missing values were imputed using k-nearest neighbors (kNN) with k = 5, fit on training data only to avoid information leakage. Fractional values resulting from kNN on binary fields were post-processed using a 0.5 threshold to restore 0/1 coding.
The authors used a two-stage feature selection procedure to identify a parsimonious and stable predictor set from the initial 35 candidate variables. In the first step, LASSO logistic regression was applied with 10-fold cross-validation, selecting the lambda.1se solution (one standard error above minimum) to favor sparsity and reduce variance. This retained exactly four variables: high-grade malignancy, drinking (alcohol use), estimated glomerular filtration rate (eGFR), and smoking.
RF-RFE refinement: In the second step, random-forest-based recursive feature elimination (RF-RFE) was applied to the LASSO-retained variables, constrained by an events-per-variable (EPV) threshold of at least 5 and a maximum of 10 predictors (appropriate for the 79 pneumonia events). A pre-specified stopping rule selected the smallest subset on the performance plateau, defined as a change in AUC of 0.01 or less from the maximum cross-validated AUC. The mean CV AUCs were 0.803 for 3 predictors and 0.804 for 4 predictors (difference = 0.001), both satisfying the plateau criterion. A tie-breaker favoring higher mean CV AUC and greater selection stability confirmed the four-predictor model.
Stability validation: Nested cross-validation (outer 5-fold, 20 repeats) showed consistently elevated selection frequencies for these four predictors. Additionally, 200 bootstrap resamples of the LASSO step produced concordant stability results. Collinearity diagnostics (VIF/GVIF and Pearson correlations) indicated no concerning multicollinearity among the final predictors.
The data was split into a training set (n = 145) and an internal hold-out test set (n = 60) using stratified 70/30 random splitting to preserve class distribution. All preprocessing, including kNN imputation, feature selection, SMOTE for class imbalance, and hyperparameter tuning, was strictly confined to training-set cross-validation folds. The hold-out test set remained untouched until final evaluation, preventing any information leakage.
Algorithms and tuning: Five ML models were trained using the same four predictors: logistic regression, support vector machine (SVM), k-nearest neighbors (KNN), gradient boosting machine (GBM), and LightGBM. For SVM, GBM, and KNN, hyperparameters were optimized via Bayesian optimization (rBayesianOptimization, upper confidence bound with kappa = 2.0), each evaluated with 5-fold CV and within-fold SMOTE. For LightGBM, a two-stage procedure used Bayesian search with 4-fold CV and early stopping, followed by confirmatory 5-fold CV to determine the optimal iteration number. Logistic regression required no hyperparameter tuning. SVM and KNN used z-score centering and scaling, while GBM, LightGBM, and logistic regression did not.
SMOTE handling: To address class imbalance (79 events vs. 126 non-events), SMOTE was applied within each resampling fold of cross-validation using caret's trainControl, never on the hold-out test set. Decision thresholds were pre-specified on the training set by maximizing Youden's J index and then fixed for both training and test evaluations.
Evaluation metrics: Performance was assessed using AUC (with DeLong 95% CIs), F1 score, Brier score, accuracy, sensitivity, specificity, PPV, and NPV. Threshold-based metrics used class-stratified bootstrap 95% CIs (B = 2,000). Calibration curves and decision-curve analysis (DCA) were also generated to evaluate clinical utility.
On the training set, GBM achieved the highest AUC at 0.853 (95% CI 0.789-0.916), followed by SVM (0.844), LightGBM (0.843), logistic regression (0.841), and KNN (0.806). On the internal hold-out test set (n = 60), GBM again led with an AUC of 0.855 (95% CI 0.746-0.964), followed by logistic regression (0.844), LightGBM and SVM (both 0.841), and KNN (0.588). The sharp drop-off for KNN on the test set suggests poor generalization for that algorithm in this context.
Threshold-based metrics for GBM (test set): At the pre-specified threshold of 0.418, GBM achieved accuracy of 0.717 (95% CI 0.600-0.817), sensitivity of 0.783 (95% CI 0.609-0.957), specificity of 0.676 (95% CI 0.514-0.811), PPV of 0.600 (95% CI 0.484-0.731), NPV of 0.833 (95% CI 0.714-0.957), and F1 of 0.679 (95% CI 0.545-0.792). The confusion matrix showed 18 true positives, 25 true negatives, 12 false positives, and 5 false negatives.
Calibration and clinical utility: Calibration curves showed good agreement between predicted and observed risks for GBM, with Brier scores of 0.151 (training) and 0.155 (internal test). Decision-curve analysis demonstrated comparable net benefit for GBM, logistic regression, SVM, and LightGBM across most clinically relevant thresholds, with no single model uniformly dominating. KNN consistently underperformed. LightGBM showed slightly higher accuracy (0.750) and specificity (0.730) than GBM, but GBM maintained the most robust overall performance across all metrics.
To enhance interpretability, SHAP (Shapley Additive Explanations) was applied to the final GBM model. The SHAP summary bar plot ranked the four features by mean absolute SHAP value, with eGFR (estimated glomerular filtration rate) showing the strongest overall influence on predictions. Smoking ranked second, followed by drinking (alcohol use) and high-grade malignancy. The beeswarm plot visualized how feature values pushed predictions higher or lower across the test set.
Case-level interpretation: For a representative high-risk patient with concurrent smoking, drinking, and reduced renal function but no high-grade malignancy, the SHAP waterfall plot showed how individual feature contributions shifted the predicted probability from the baseline E[f(x)] = 0.519 to f(x) = 0.988. In contrast, a low-risk case with no smoking, no drinking, preserved renal function, and no high-grade malignancy had a predicted probability of just 0.186. These case-level visualizations demonstrate how the model provides transparent, individualized rationale for each risk estimate.
Clinical actionability: The identified predictors map to both modifiable and non-modifiable risk factors. Smoking cessation and alcohol avoidance represent actionable targets for prevention. Vigilant monitoring of renal function and awareness of high-grade disease status may guide early, targeted intervention. Importantly, the authors emphasize that SHAP reflects statistical associations rather than causation, and explanations should inform preventive vigilance and shared decision-making rather than being treated as deterministic.
The final GBM model was deployed as an interactive web-based tool using the R Shiny framework. The interface accepts four inputs (high-grade malignancy, drinking, eGFR, and smoking status) and computes a real-time predicted probability of pneumonia within 90 days. The tool integrates SHAP-based interpretive visualizations and is accessible through standard web browsers on desktop and mobile devices. The authors stress that this is a research-only prototype, not intended for clinical decision-making until temporal and multicenter external validation is completed.
Comparison with MASCC and other tools: The MASCC score, one of the most widely used tools for infection risk in oncology, was originally developed for febrile neutropenia and relies on general clinical parameters such as burden of illness and outpatient status. It does not incorporate tumor biology or treatment-specific factors, making it less suited for organ-specific pneumonia prediction in lymphoma. The authors' GBM model, by contrast, captures non-linear relationships and interactions among malignancy severity, renal function, and behavioral factors, achieving an AUC of 0.855 on internal validation.
Context from related ML studies: Sun et al. predicted post-chemotherapy lung infection in lung cancer using 36 predictors with a LASSO-regularized logistic regression model (AUC approximately 0.89), though without restricting to a 90-day pneumonia endpoint. Peng et al. used XGBoost for infection prediction in newly diagnosed multiple myeloma (AUC approximately 0.88) with a composite infection definition. Both used SHAP for interpretability. The current study is differentiated by its narrow, clinically actionable endpoint and its parsimonious four-predictor model specific to NHL.
Single-center limitation: All 205 patients came from one hospital, and validation used only an internal hold-out test set (n = 60). With just 23 pneumonia events in the test set, confidence intervals are necessarily wide (e.g., AUC 95% CI 0.746-0.964). Performance estimates may be optimistic, and generalizability to other centers, patient populations, and treatment protocols remains unconfirmed.
Phenotype ascertainment concerns: Microbiological data (cultures, PCR panels) were not systematically collected, and imaging studies were not centrally reviewed or scored using a standardized severity metric. Case ascertainment relied on clinical radiology reports, meaning some pneumonia cases may have been misclassified or missed. Incorporating pathogen identification and standardized radiographic scoring could enhance both diagnostic accuracy and model performance in future iterations.
Overfitting risk: Despite confining feature selection and hyperparameter tuning to training cross-validation folds, some risk of overfitting remains with 35 initial candidates and 79 events. The EPV constraints, plateau stopping rule, nested CV stability analyses, and LASSO bootstrapping mitigate this risk but cannot eliminate it entirely. Additionally, several laboratory variables (WBC, ANC, PLT, CRP) showed modest train-test imbalance (SMD approximately 0.21-0.39), though a sensitivity analysis excluding these variables produced essentially unchanged results.
Future directions: The authors plan temporal validation within their center and multicenter external validation to assess transportability, with model recalibration and threshold re-specification if dataset shift is detected. Integration with electronic health records for real-time risk scoring is being explored, subject to governance and privacy safeguards. Future model iterations may incorporate longitudinal clinical trajectories, pathogen-specific data, treatment exposure details, and imaging-derived features to improve accuracy while maintaining interpretability. All code is publicly available on GitHub to facilitate transparency and reproducibility.