Interpretable machine learning model to predict 90-day radiographically confirmed progression

BMJ Open 2023 AI 8 Explanations View Original
Original Paper (PDF)

Unable to display PDF. Download it here or view on PMC.

Plain-English Explanations
Pages 1-2
Why Pneumonia Prediction Matters in NHL Chemotherapy

Non-Hodgkin lymphoma (NHL) is a diverse group of lymphoid cancers with a rising global incidence, reaching an estimated 545,000 new cases and 260,000 deaths worldwide in 2020. Immunochemotherapy regimens such as R-CHOP have improved survival considerably, but infectious complications, particularly pneumonia, remain among the most frequent and dangerous adverse events. In elderly NHL patients treated with R-CHOP, pulmonary complications have been reported in up to 40% of cases, with roughly 10% experiencing severe infections. A prior study of 229 newly diagnosed NHL patients found that 91 (39.7%) developed bacterial pneumonia.

The critical 90-day window: Pneumonia most commonly occurs within the first 90 days after chemotherapy initiation, a period marked by bone marrow suppression, disrupted mucosal barriers, and compromised immune function. Early identification of high-risk individuals during this window is essential so that clinicians can initiate timely interventions such as empiric antimicrobial therapy, antifungal prophylaxis, vaccination strategies, and chemotherapy dose adjustments.

Gaps in existing tools: Current risk stratification systems like the MASCC Risk Index, the Talcott classification, and the CISNE score were designed for febrile neutropenia management, not organ-specific pneumonia prediction. Similarly, invasive fungal disease scores target a narrow fungal subset rather than the full bacterial, viral, and fungal pneumonia spectrum encountered during NHL chemotherapy. This study addresses that gap by developing an interpretable machine learning model specifically for 90-day pneumonia risk in NHL.

TL;DR: Nearly 40% of NHL patients develop pneumonia, mostly within 90 days of chemotherapy. Existing risk tools (MASCC, CISNE) were not designed for this purpose. This study builds an interpretable ML model to fill that clinical gap.
Pages 2-5
Retrospective Single-Center Cohort of 205 NHL Patients

This retrospective study was conducted at Dongyang Hospital affiliated with Wenzhou Medical University, enrolling consecutive patients with pathologically confirmed NHL who initiated systemic chemotherapy between October 2018 and October 2024. Eligibility required at least one chemotherapy cycle and follow-up for at least 90 days (or until pneumonia or death). Patients with pre-existing pneumonia or incomplete data were excluded, yielding a final cohort of 205 patients.

Outcome definition: The primary endpoint was radiographically confirmed pneumonia within 90 days of chemotherapy initiation. Case ascertainment required both a new or progressive pulmonary infiltrate on chest radiography or CT, and at least one clinical criterion: new or worsened cough/sputum, fever, auscultatory findings consistent with consolidation, or an abnormal leukocyte count. Non-infectious mimics (drug-induced pneumonitis, cardiogenic pulmonary edema, pulmonary embolism, tumor progression) were excluded. The definition adapted the 2019 ATS/IDSA community-acquired pneumonia framework for case ascertainment only, since that guideline excludes immunocompromised hosts.

Baseline characteristics: Of the 205 patients, 79 (38.5%) developed pneumonia. The pneumonia group was more frequently male (69.6% vs. 48.4%, p = 0.005), had advanced Ann Arbor stage III-IV disease (78.5% vs. 58.7%, p = 0.006), higher rates of smoking (57.0% vs. 16.7%, p < 0.001), alcohol use (58.2% vs. 17.5%, p < 0.001), high-grade malignancy (81.0% vs. 46.0%, p < 0.001), and reduced eGFR below 80 mL/min/1.73 m2 (51.9% vs. 9.5%, p < 0.001). Most other laboratory and demographic variables showed no significant differences.

Data preprocessing: A total of 35 clinical variables covering demographics, disease status, comorbidities, treatment factors, and baseline labs were collected. All predictors were dichotomized to binary indicators (0/1). Missing values were imputed using k-nearest neighbors (kNN) with k = 5, fit on training data only to avoid information leakage. Fractional values resulting from kNN on binary fields were post-processed using a 0.5 threshold to restore 0/1 coding.

TL;DR: 205 NHL patients from a single Chinese center (2018-2024). 79 (38.5%) developed radiographically confirmed pneumonia within 90 days. Pneumonia patients were more often male, smokers, drinkers, had high-grade disease, and reduced eGFR. 35 variables were collected and dichotomized, with kNN imputation for missing values.
Pages 3, 5-6
Two-Step Feature Selection: LASSO Followed by RF-RFE

The authors used a two-stage feature selection procedure to identify a parsimonious and stable predictor set from the initial 35 candidate variables. In the first step, LASSO logistic regression was applied with 10-fold cross-validation, selecting the lambda.1se solution (one standard error above minimum) to favor sparsity and reduce variance. This retained exactly four variables: high-grade malignancy, drinking (alcohol use), estimated glomerular filtration rate (eGFR), and smoking.

RF-RFE refinement: In the second step, random-forest-based recursive feature elimination (RF-RFE) was applied to the LASSO-retained variables, constrained by an events-per-variable (EPV) threshold of at least 5 and a maximum of 10 predictors (appropriate for the 79 pneumonia events). A pre-specified stopping rule selected the smallest subset on the performance plateau, defined as a change in AUC of 0.01 or less from the maximum cross-validated AUC. The mean CV AUCs were 0.803 for 3 predictors and 0.804 for 4 predictors (difference = 0.001), both satisfying the plateau criterion. A tie-breaker favoring higher mean CV AUC and greater selection stability confirmed the four-predictor model.

Stability validation: Nested cross-validation (outer 5-fold, 20 repeats) showed consistently elevated selection frequencies for these four predictors. Additionally, 200 bootstrap resamples of the LASSO step produced concordant stability results. Collinearity diagnostics (VIF/GVIF and Pearson correlations) indicated no concerning multicollinearity among the final predictors.

TL;DR: From 35 candidate variables, LASSO (lambda.1se) retained 4 predictors: high-grade malignancy, drinking, eGFR, and smoking. RF-RFE confirmed this set was optimal (CV AUC = 0.804 for 4 features). Nested CV and bootstrap analyses verified selection stability with no multicollinearity concerns.
Pages 6-10
Five Algorithms Compared with Leakage-Safe Preprocessing

The data was split into a training set (n = 145) and an internal hold-out test set (n = 60) using stratified 70/30 random splitting to preserve class distribution. All preprocessing, including kNN imputation, feature selection, SMOTE for class imbalance, and hyperparameter tuning, was strictly confined to training-set cross-validation folds. The hold-out test set remained untouched until final evaluation, preventing any information leakage.

Algorithms and tuning: Five ML models were trained using the same four predictors: logistic regression, support vector machine (SVM), k-nearest neighbors (KNN), gradient boosting machine (GBM), and LightGBM. For SVM, GBM, and KNN, hyperparameters were optimized via Bayesian optimization (rBayesianOptimization, upper confidence bound with kappa = 2.0), each evaluated with 5-fold CV and within-fold SMOTE. For LightGBM, a two-stage procedure used Bayesian search with 4-fold CV and early stopping, followed by confirmatory 5-fold CV to determine the optimal iteration number. Logistic regression required no hyperparameter tuning. SVM and KNN used z-score centering and scaling, while GBM, LightGBM, and logistic regression did not.

SMOTE handling: To address class imbalance (79 events vs. 126 non-events), SMOTE was applied within each resampling fold of cross-validation using caret's trainControl, never on the hold-out test set. Decision thresholds were pre-specified on the training set by maximizing Youden's J index and then fixed for both training and test evaluations.

Evaluation metrics: Performance was assessed using AUC (with DeLong 95% CIs), F1 score, Brier score, accuracy, sensitivity, specificity, PPV, and NPV. Threshold-based metrics used class-stratified bootstrap 95% CIs (B = 2,000). Calibration curves and decision-curve analysis (DCA) were also generated to evaluate clinical utility.

TL;DR: Five models (logistic regression, SVM, KNN, GBM, LightGBM) were trained on n = 145 with stratified 70/30 split. Bayesian optimization tuned hyperparameters within 5-fold CV. SMOTE was applied within folds only. Thresholds were set by Youden's J on training data and fixed for testing.
Pages 10-12
GBM Achieved the Highest Discrimination Across All Metrics

On the training set, GBM achieved the highest AUC at 0.853 (95% CI 0.789-0.916), followed by SVM (0.844), LightGBM (0.843), logistic regression (0.841), and KNN (0.806). On the internal hold-out test set (n = 60), GBM again led with an AUC of 0.855 (95% CI 0.746-0.964), followed by logistic regression (0.844), LightGBM and SVM (both 0.841), and KNN (0.588). The sharp drop-off for KNN on the test set suggests poor generalization for that algorithm in this context.

Threshold-based metrics for GBM (test set): At the pre-specified threshold of 0.418, GBM achieved accuracy of 0.717 (95% CI 0.600-0.817), sensitivity of 0.783 (95% CI 0.609-0.957), specificity of 0.676 (95% CI 0.514-0.811), PPV of 0.600 (95% CI 0.484-0.731), NPV of 0.833 (95% CI 0.714-0.957), and F1 of 0.679 (95% CI 0.545-0.792). The confusion matrix showed 18 true positives, 25 true negatives, 12 false positives, and 5 false negatives.

Calibration and clinical utility: Calibration curves showed good agreement between predicted and observed risks for GBM, with Brier scores of 0.151 (training) and 0.155 (internal test). Decision-curve analysis demonstrated comparable net benefit for GBM, logistic regression, SVM, and LightGBM across most clinically relevant thresholds, with no single model uniformly dominating. KNN consistently underperformed. LightGBM showed slightly higher accuracy (0.750) and specificity (0.730) than GBM, but GBM maintained the most robust overall performance across all metrics.

TL;DR: GBM was the top performer: AUC 0.855 (95% CI 0.746-0.964), F1 0.679, Brier 0.155 on the hold-out test set (n = 60). Sensitivity was 78.3% and NPV was 83.3%. KNN collapsed on the test set (AUC 0.588). Calibration and DCA supported GBM as the most robust choice.
Pages 12-14
SHAP Explanations Reveal eGFR as the Strongest Predictor

To enhance interpretability, SHAP (Shapley Additive Explanations) was applied to the final GBM model. The SHAP summary bar plot ranked the four features by mean absolute SHAP value, with eGFR (estimated glomerular filtration rate) showing the strongest overall influence on predictions. Smoking ranked second, followed by drinking (alcohol use) and high-grade malignancy. The beeswarm plot visualized how feature values pushed predictions higher or lower across the test set.

Case-level interpretation: For a representative high-risk patient with concurrent smoking, drinking, and reduced renal function but no high-grade malignancy, the SHAP waterfall plot showed how individual feature contributions shifted the predicted probability from the baseline E[f(x)] = 0.519 to f(x) = 0.988. In contrast, a low-risk case with no smoking, no drinking, preserved renal function, and no high-grade malignancy had a predicted probability of just 0.186. These case-level visualizations demonstrate how the model provides transparent, individualized rationale for each risk estimate.

Clinical actionability: The identified predictors map to both modifiable and non-modifiable risk factors. Smoking cessation and alcohol avoidance represent actionable targets for prevention. Vigilant monitoring of renal function and awareness of high-grade disease status may guide early, targeted intervention. Importantly, the authors emphasize that SHAP reflects statistical associations rather than causation, and explanations should inform preventive vigilance and shared decision-making rather than being treated as deterministic.

TL;DR: SHAP ranked eGFR as the most influential predictor, followed by smoking, drinking, and high-grade malignancy. A high-risk patient (smoking + drinking + low eGFR) had a predicted probability of 0.988 vs. baseline 0.519. A low-risk patient scored 0.186. Smoking and alcohol are modifiable targets for prevention.
Pages 14-15
R Shiny Prototype and Comparison with Existing Scoring Systems

The final GBM model was deployed as an interactive web-based tool using the R Shiny framework. The interface accepts four inputs (high-grade malignancy, drinking, eGFR, and smoking status) and computes a real-time predicted probability of pneumonia within 90 days. The tool integrates SHAP-based interpretive visualizations and is accessible through standard web browsers on desktop and mobile devices. The authors stress that this is a research-only prototype, not intended for clinical decision-making until temporal and multicenter external validation is completed.

Comparison with MASCC and other tools: The MASCC score, one of the most widely used tools for infection risk in oncology, was originally developed for febrile neutropenia and relies on general clinical parameters such as burden of illness and outpatient status. It does not incorporate tumor biology or treatment-specific factors, making it less suited for organ-specific pneumonia prediction in lymphoma. The authors' GBM model, by contrast, captures non-linear relationships and interactions among malignancy severity, renal function, and behavioral factors, achieving an AUC of 0.855 on internal validation.

Context from related ML studies: Sun et al. predicted post-chemotherapy lung infection in lung cancer using 36 predictors with a LASSO-regularized logistic regression model (AUC approximately 0.89), though without restricting to a 90-day pneumonia endpoint. Peng et al. used XGBoost for infection prediction in newly diagnosed multiple myeloma (AUC approximately 0.88) with a composite infection definition. Both used SHAP for interpretability. The current study is differentiated by its narrow, clinically actionable endpoint and its parsimonious four-predictor model specific to NHL.

TL;DR: An R Shiny web tool provides real-time pneumonia risk scores from 4 inputs. Unlike the MASCC score (designed for febrile neutropenia), this model targets NHL-specific pneumonia and captures non-linear interactions. Related ML studies in lung cancer (AUC 0.89) and myeloma (AUC 0.88) used far more predictors.
Pages 15-16
Single-Center Design, Small Test Set, and No External Validation

Single-center limitation: All 205 patients came from one hospital, and validation used only an internal hold-out test set (n = 60). With just 23 pneumonia events in the test set, confidence intervals are necessarily wide (e.g., AUC 95% CI 0.746-0.964). Performance estimates may be optimistic, and generalizability to other centers, patient populations, and treatment protocols remains unconfirmed.

Phenotype ascertainment concerns: Microbiological data (cultures, PCR panels) were not systematically collected, and imaging studies were not centrally reviewed or scored using a standardized severity metric. Case ascertainment relied on clinical radiology reports, meaning some pneumonia cases may have been misclassified or missed. Incorporating pathogen identification and standardized radiographic scoring could enhance both diagnostic accuracy and model performance in future iterations.

Overfitting risk: Despite confining feature selection and hyperparameter tuning to training cross-validation folds, some risk of overfitting remains with 35 initial candidates and 79 events. The EPV constraints, plateau stopping rule, nested CV stability analyses, and LASSO bootstrapping mitigate this risk but cannot eliminate it entirely. Additionally, several laboratory variables (WBC, ANC, PLT, CRP) showed modest train-test imbalance (SMD approximately 0.21-0.39), though a sensitivity analysis excluding these variables produced essentially unchanged results.

Future directions: The authors plan temporal validation within their center and multicenter external validation to assess transportability, with model recalibration and threshold re-specification if dataset shift is detected. Integration with electronic health records for real-time risk scoring is being explored, subject to governance and privacy safeguards. Future model iterations may incorporate longitudinal clinical trajectories, pathogen-specific data, treatment exposure details, and imaging-derived features to improve accuracy while maintaining interpretability. All code is publicly available on GitHub to facilitate transparency and reproducibility.

TL;DR: Key limitations include single-center data (n = 205), a small test set (n = 60, 23 events), no systematic microbiology, no standardized radiology re-reads, and no external validation. Future plans include temporal and multicenter validation, EHR integration, and potential incorporation of pathogen-specific and imaging-derived features. Code is open-source on GitHub.
Citation: Zhang Z, Su M, Jiang P, et al.. Open Access, 2025. Available at: PMC12497835. DOI: 10.3389/fmed.2025.1674896. License: cc by.