Lung cancer is the leading contributor to cancer death worldwide, and non-small cell lung cancer (NSCLC) accounts for the majority of cases, carrying a five-year survival rate of only 18%. Most patients present with late-stage disease and are treated with non-surgical approaches such as radiation, chemotherapy, targeted therapies, or immunotherapies. Monitoring how tumors respond to these treatments over time through follow-up imaging is critical, yet current clinical methods like RECIST (Response Evaluation Criteria in Solid Tumors) rely on simple size-based measurements such as the longest axial diameter of lesions.
The radiomics approach: Artificial intelligence enables a quantitative, rather than qualitative, assessment of radiographic tumor characteristics. Traditional machine learning involved hand-engineered features for image quantification, with demonstrated success in response assessment and outcome prediction. More recently, deep learning, specifically convolutional neural networks (CNNs), allows automated feature extraction from images without requiring human-defined features. CNNs pre-trained on millions of photographic images (such as ImageNet) can be applied to medical images through transfer learning, a technique already proven in cancer detection and staging.
The gap this study addresses: Most quantitative imaging studies at the time focused on biomarkers from a single time-point. However, tumors are dynamic biological systems with evolving vascular and stem cell contributions that cannot be fully captured at one moment. Recurrent neural networks (RNNs), already successful in video classification and natural language processing, offered a way to incorporate longitudinal imaging data. Only a few studies had applied these advanced computational approaches in radiology before this work.
The authors from Harvard Medical School (Dana-Farber Cancer Institute and Brigham and Women's Hospital) set out to combine CNNs with RNNs to analyze serial CT images of stage III NSCLC patients, predicting survival and treatment response using pre-treatment and follow-up scans. Crucially, the method required only a single seed-point click to localize the tumor, eliminating the need for time-consuming volumetric segmentations.
Dataset A (training and testing): This cohort contained 179 consecutive stage III NSCLC patients treated at Brigham and Women's/Dana-Farber Cancer Center between 2003 and 2014 with definitive chemoradiation (Carboplatin/Paclitaxel or Cisplatin/Etoposide). Each patient had at least one follow-up CT scan, yielding a total of 581 CT scans (average 3.2 scans per patient, range 2 to 4). These included pre-treatment scans and follow-ups at 1, 3, and 6 months after radiation therapy. Of the 581 scans, 125 were attenuation CTs from PET and 456 were diagnostic CTs. The cohort was 52.8% female, with a median age of 63 years (range 32 to 93), predominantly stage IIIA (58.9%), and 58.1% adenocarcinoma. The median radiation dose was 66 Gy (range 45 to 70 Gy), with a median follow-up of 31.4 months. Dataset A was randomly split 2:1 into training/tuning (n=107) and test (n=72) sets with no significant difference in patient parameters between them (p>0.1).
Dataset B (external validation): A separate cohort of 89 consecutive stage III NSCLC patients from the same institution, treated between 2001 and 2013 with neoadjuvant chemoradiation followed by surgical resection (trimodality therapy). This yielded 178 CT scans at two time points: pre-radiation and post-radiation (prior to surgery). The median radiation dose was lower at 54 Gy (range 50 to 70 Gy), with a median follow-up of 37.1 months. Patients were excluded if they had distant metastasis, a delay exceeding 120 days between chemoradiation and surgery, or lacked survival data.
Endpoints: For Dataset A, the primary endpoint was prediction of overall survival, with secondary endpoints of distant metastasis, disease progression, and locoregional recurrence. For Dataset B, the endpoint was pathological response at surgery, classified as responders (pathological complete response, n=14, or microscopic residual disease, n=28) versus gross residual disease (n=47). No histological exclusions were applied in either cohort.
Image preprocessing: CTs were acquired on a GE "Lightspeed" scanner. Due to variability in slice thickness and in-plane resolution, all CT voxels were interpolated to a uniform 1 x 1 x 1 mm3 using linear and nearest-neighbor interpolation. For each time point, the model input consisted of three axial slices of 50 x 50 mm2 centered on, 5 mm proximal to, and 5 mm distal to a manually defined seed point (placed using 3D Slicer 4.8.1). This three-slice approach provides spatial context while keeping the feature count manageable and reducing GPU memory usage compared to a full 3D volume.
Transfer learning with ResNet: The network was implemented in Python using Keras with TensorFlow backend (Python 2.7, Keras 2.0.8, TensorFlow 1.3.0). The base architecture was a ResNet CNN pre-trained on the ImageNet database of over 14 million natural images. One separate CNN was defined for each time point, so a patient with scans at three time points would have three parallel CNN branches. The three axial slices were fed into each CNN as if they were RGB color channels, leveraging the ResNet architecture originally designed for color photographs.
Recurrent network for longitudinal analysis: The CNN-extracted features from each time point were then fed into recurrent layers with gated recurrent units (GRU). The GRU architecture was specifically chosen because it contains update and reset gates that control how much information from each time point is passed forward. Critically, the RNN was designed to handle missing scans by masking the time point when a scan was unavailable, avoiding immortal time bias. After the GRU layers, averaging and fully connected layers were applied, with batch normalization and dropout after each fully connected layer to prevent overfitting. The final softmax layer produced a binary classification output.
Training protocol: Training used Monte Carlo cross-validation with 10 different splits (further 3:2 split of training:tuning) on the 107 discovery patients, with class weight balancing for up to 300 epochs. Image augmentation included flipping, translation, rotation, and small-scale deformation applied consistently across the entire input series for each patient. The same augmentation was applied to pre-treatment and follow-up images to preserve the temporal mapping.
All predictions were evaluated on the independent test set of 72 patients from Dataset A. The primary statistical measures included the area under the receiver operating characteristic curve (AUC) and the Wilcoxon rank-sum test (Mann-Whitney U test) to assess differences between positive and negative survival groups. Clinical endpoints evaluated were one-year and two-year overall survival, distant metastasis, disease progression, and locoregional recurrence.
Survival analysis: Kaplan-Meier estimates were calculated for low and high mortality risk groups, stratified at the median prediction probability from the training set, and compared using the log-rank test. Hazard ratios were derived from Cox proportional-hazards models. These analyses assessed whether the deep learning model could meaningfully separate patients into clinically distinct risk groups, not just achieve statistical significance on aggregate metrics.
Clinical model comparison: A random forest clinical model served as the baseline comparator, incorporating stage, gender, age, tumor grade, performance status, smoking status, and clinical tumor size (primary maximum axial diameter). This comparison was designed to determine whether the deep learning imaging biomarker provided information beyond standard clinical variables.
Dataset B validation: The one-year survival model trained on Dataset A (using pre-treatment and first follow-up scans) was applied to the trimodality cohort. The network's prediction probabilities were used to stratify patients by pathological response (responders versus gross residual disease) and compared against tumor volume change as well as the same random forest clinical model. A combined model integrating network probabilities and volume change was also tested.
The central finding of this study is that deep learning model performance improved consistently with each additional follow-up CT scan. For predicting two-year overall survival, the model using only the pre-treatment scan achieved an AUC of 0.58 (p=0.3, not significant). Adding the 1-month follow-up raised the AUC to 0.64 (p=0.04). Including the 3-month follow-up further improved it to 0.69 (p=0.007). With all scans through 6 months, the model reached an AUC of 0.74 (p=0.001). This same stepwise improvement pattern was observed across all clinical endpoints: one-year survival, distant metastasis, progression, and locoregional recurrence.
Risk stratification: The model successfully stratified patients into low and high mortality risk groups using Kaplan-Meier analysis. For two-year overall survival, significant separation was achieved with 2 follow-up scans (p=0.023, log-rank) and 3 follow-up scans (p=0.027, log-rank). The most striking hazard ratios were observed with 3 follow-up time points: one-year overall survival (HR=6.16, 95% CI [2.17, 17.44], p=0.0004), distant metastasis-free survival (HR=3.99, 95% CI [1.31, 12.13], p=0.01), progression-free survival (HR=3.20, 95% CI [1.16, 8.87], p=0.02), and locoregional recurrence-free survival (HR=2.74, 95% CI [1.18, 6.34], p=0.02).
Clinical model failure: By contrast, the random forest clinical model incorporating stage, gender, age, tumor grade, performance status, smoking status, and clinical tumor size was unable to achieve a statistically significant prediction for either survival (two-year survival AUC=0.51, p=0.93) or any of the treatment response endpoints. This underscores that the deep learning imaging features captured biological information that standard clinical variables missed entirely.
The model trained on Dataset A was applied to Dataset B (89 trimodality patients) to test whether deep learning survival predictions could generalize to a different treatment context and predict pathological response at surgery. Using the one-year survival model with two time points (pre-treatment and post-radiation, prior to surgery), the network significantly distinguished between responders and patients with gross residual disease, achieving an AUC of 0.65 (n=89, p=0.016, Wilcoxon test).
Comparison with tumor volume change: The change in primary tumor volume also predicted pathological response with a nearly identical AUC of 0.65 (n=89, p=0.017). However, the CNN probabilities and tumor volume changes were only weakly correlated (Spearman r=0.39, p=0.0002), suggesting that the neural network detected radiographic characteristics beyond simple tumor shrinkage. A combined model integrating both the CNN probabilities and volume change showed slightly improved performance (AUC=0.67, n=89, p=0.006).
Clinical model failure again: The random forest clinical model using standard clinical variables did not achieve a statistically significant prediction for pathological response (p=0.42, Wilcoxon test). The model also significantly predicted surrogates of survival in Dataset B, including distant metastasis, progression, and locoregional recurrence, though overall survival prediction trended toward significance without reaching it, likely due to the low number of events (30 of 89 patients) and inherent differences between the two cohorts.
The fact that the model was completely blinded to Dataset B during training, and that Dataset B contained younger, healthier surgical patients with different disease burdens, makes this cross-cohort validation particularly meaningful. It demonstrates that the deep learning features learned from definitive chemoradiation patients captured generalizable tumor biology rather than cohort-specific artifacts.
Sample size: The most significant limitation is the relatively small cohort. With 179 patients in Dataset A and 89 in Dataset B, the study is orders of magnitude smaller than typical deep learning applications. Facial recognition models, for example, train on 87,000 images and test on 5,000. Transfer learning from ImageNet (14 million images) partially compensated, but the authors acknowledge that larger cohorts could improve predictive power, particularly for pre-treatment-only models where the single time-point AUC was only 0.58 (not significant).
2D versus 3D input: The model used only three 2D axial slices per time point, dictated by the ResNet architecture's requirement for pre-trained parameters from ImageNet. A full 3D tumor volume would better represent tumor biology and could improve performance, but 3D pre-trained networks with sufficient training data did not exist at the time. Any 3D model trained on the available medical imaging cohort (thousands of images) would likely overfit to the institution, the patient cohort, and the specific prediction task.
Interpretability: Deep learning operates as a "black box" where hidden layers offer no transparent explanation for predictions. Unlike engineered radiomic features built to capture specific image characteristics (texture, shape, intensity), CNN features are abstract and ambiguous. Activation maps and heat maps generated over the final convolutional layer offer partial visualization of what the network weighs, but true interpretability remains an unsolved challenge. The authors note that incorporating domain knowledge into these abstract features is an important open question.
Additional constraints: The survival models were based purely on CT images without incorporating patient-specific parameters such as age, sex, histology, smoking cessation, or radiation therapy parameters. With a larger cohort, integrating these clinical variables could boost performance. The retrospective design also meant that not all patients had imaging at every follow-up time point, introducing potential selection effects despite the RNN's ability to handle missing data. Potential immortal time bias, where longer-surviving patients are more likely to have more follow-up scans, could confound results even though the model was designed to mitigate it.
Minimal input, fast output: One of the most clinically appealing aspects of this approach is its simplicity. The model requires only a single seed-point click within the tumor and the corresponding CT images. It does not require volumetric segmentation, which is susceptible to inter-reader variability and is time-consuming. Compared to hand-crafted radiomic features that demand accurate tumor delineation, or RECIST measurements that are prone to inter-operator variability, this approach is more efficient and robust to manual inference. Once the tumor location is detected on follow-up images (potentially automated via existing lung nodule detection algorithms), the trained network can generate prognostic probabilities within seconds.
Integration into clinical workflow: Follow-up CT scans are already part of the standard clinical workflow for these patients. The deep learning model adds no additional imaging burden. Its predictions could be presented alongside other clinical measures such as RECIST criteria to support patient assessment decisions. The probabilities could also serve as quantitative endpoints in clinical trials assessing treatment response, and eventually support dynamically adapting therapy based on longitudinal imaging patterns.
What differentiates this from prior work: Previous radiomics studies in lung cancer focused on single time-point predictions using either engineered features or deep learning. Other deep learning studies, such as Kumar et al. and Hua et al., used manual delineation of lung nodules with feature extraction at one time point. This study's innovation was the incorporation of multiple time points through RNN architecture, combined with the minimal-input seed-point approach, for predicting both survival and pathological response across two independent cohorts.
Path to clinical deployment: The authors envision that after training on a larger, more diverse population and undergoing extensive external validation and benchmarking against current clinical standards, these quantitative prognostic models could be implemented in routine clinical practice. The approach has implications for precision medicine, enabling non-invasive, repeated, low-cost tracking of tumor phenotype that could inform adaptive and personalized therapy decisions for lung cancer patients.