Artificial intelligence for detection and characterization of pulmonary nodules in lung cancer screening

PMC 2021 AI 8 Explanations View Original
Original Paper (PDF)

Unable to display PDF. Download it here or view on PMC.

Plain-English Explanations
Pages 1-2
Lung Cancer Screening Works, but Efficiency Is the Bottleneck

Lung cancer is the leading cause of cancer-related death worldwide, with 5-year survival rates that have yet to surpass 20%. The National Lung Screening Trial (NLST), a landmark randomized controlled trial (RCT) enrolling over 53,000 participants, demonstrated in 2011 that three rounds of annual low-dose CT screening reduced lung cancer deaths by 20% compared to chest radiography after seven years of follow-up. The Dutch-Belgian NELSON trial, the second largest RCT with 15,789 participants, confirmed these benefits with a 24% mortality reduction in high-risk men compared to no screening. Smaller trials such as the German LUSI and Italian MILD trials reported consistent evidence, though they were statistically underpowered.

Despite these proven benefits, several barriers complicate large-scale implementation. The NLST's original definition of a positive screen (any solid nodule greater than 4 mm) produced a 24% false positive rate. The NELSON trial addressed this by introducing growth-rate assessment for indeterminate nodules, reducing the false positive rate to approximately 2%. This improvement spurred the development of standardized CT reporting systems, including the mandatory Lung-RADS system in the United States, alongside guidelines from the British Thoracic Society, National Comprehensive Cancer Network (NCCN), and the European Union Position Statement on Lung Cancer Screening.

The radiologist's task in screening is complex: assess scan quality, search for pulmonary nodules, measure and classify each nodule by size and type, characterize morphological features, estimate malignancy risk, and determine follow-up recommendations. This process is laborious and subject to substantial reader variability, which directly influences screening effectiveness. The review by Schreuder et al. from Radboudumc was published in Translational Lung Cancer Research and examines whether AI algorithms have matured sufficiently to assist or partially replace human readers in this workflow.

TL;DR: The NLST (53,000+ participants) showed a 20% lung cancer mortality reduction with low-dose CT screening; the NELSON trial (15,789 participants) showed 24%. False positive rates dropped from 24% to approximately 2% with growth-rate assessment. This review evaluates whether AI can further improve screening efficiency and accuracy.
Pages 2-3
Deep Learning and the Rise of Convolutional Neural Networks in Medical Imaging

The authors provide context on why AI performance has accelerated so rapidly. Deep learning gained momentum in 2012 when Krizhevsky et al. implemented a convolutional neural network (CNN) that decisively won the ImageNet Large Scale Visual Recognition Challenge (ILSVRC), an annual competition for classifying and detecting objects in natural images. This breakthrough demonstrated that CNNs could learn high-dimensional features directly from large datasets, bypassing the need for hand-crafted feature engineering. The methodology quickly spread to autonomous driving, natural language processing, big data analytics, and medical image interpretation.

CT scan quality and dose optimization: Before AI can assess a scan, the scan must meet minimum quality standards. In screening, keeping radiation dose low is essential. Standard low-dose CTs used in most trials, including the NLST, operated at approximately 1.5 mSv. Since 2009, iterative reconstruction algorithms replaced older filtered back projection methods, improving image quality by revising each reconstructed image across multiple iterations to remove artifacts. This enabled ultra-low-dose CT at approximately 0.5 mSv, approaching the dose of a standard chest X-ray. Deep learning techniques have since been incorporated to further optimize both radiation dose and reconstruction time.

A pilot study confirmed that all nodules greater than 2 mm visible on standard low-dose CT were also detectable on ultra-low-dose images. Another study showed that two independent observers achieved higher sensitivity on ultra-low-dose CT with iterative reconstruction than on low-dose CT with filtered back projection. These findings are significant because they suggest that AI-compatible low-radiation protocols can maintain diagnostic quality, making large-scale screening more feasible and safer for patients.

TL;DR: CNNs became dominant after the 2012 ImageNet breakthrough. Standard screening CT uses approximately 1.5 mSv; ultra-low-dose CT with iterative reconstruction achieves approximately 0.5 mSv while preserving nodule detection for nodules greater than 2 mm. Deep learning now helps optimize both image reconstruction and radiation dose.
Pages 3-4
AI Challenges and Benchmarks for Finding Pulmonary Nodules

Detecting pulmonary nodules on CT is the critical first step toward lung cancer diagnosis, and it is a task at which radiologists are imperfect. There is considerable disagreement among readers about what constitutes a pulmonary nodule, and the task of searching for small opacities in images cluttered with vessels and airways is inherently difficult, especially under time pressure. The review traces the evolution of AI detection benchmarks that have driven algorithmic progress in this domain.

ANODE09: The Automated Nodule Detection 2009 challenge was the first web-based framework for comparing nodule detection algorithms on lung cancer screening CTs. All submitted algorithms were tested on 50 anonymized scans containing 207 annotated nodules, with reference values kept secret. The study also proposed a method for combining the output of multiple AI algorithms to achieve improved performance. The main limitation was dataset uniformity, as all scans came from a single center using one scanner and protocol.

LUNA16: To address these limitations, the Lung Nodule Analysis 2016 (LUNA16) challenge was established using 888 scans with 1,186 nodule annotations from the LIDC-IDRI database. Reference values were based on annotations from four radiologists to ensure robustness. At the time of publication, the best algorithm achieved a sensitivity of 97.2% at the cost of 1 false positive per scan on average. The LUNA16 challenge was officially closed in January 2018, but the organizers open-sourced the evaluation scripts and data, allowing it to continue serving as a benchmark for newer algorithms.

AI versus human readers: Most comparative studies between AI algorithms and individual radiologists for nodule detection were performed over a decade ago. These studies found that algorithms showed slightly inferior or equivalent sensitivities compared to radiologists, but with a noticeable increase in the false positive rate. The key insight is that while individual AI performance approached human levels, the real value emerged in combined human-AI systems rather than direct replacement.

TL;DR: ANODE09 tested algorithms on 50 scans (207 nodules) from one center. LUNA16 scaled up to 888 scans with 1,186 nodules annotated by four radiologists, where the best AI achieved 97.2% sensitivity at 1 false positive per scan. Early head-to-head studies showed AI roughly matching radiologists in sensitivity but with more false positives.
Pages 4-5
AI for Nodule Typing, Volumetric Segmentation, and Size Assessment

Once nodules are detected, screening guidelines stratify them into malignancy risk groups based on two primary criteria: size and type. Ciompi et al. developed an AI algorithm capable of differentiating between six nodule types: solid, part-solid, non-solid, perifissural, calcified, and spiculated. When validated on an external dataset assessed by four experienced human readers, the algorithm performed within the inter-observer variability of the experts, effectively achieving equivalent performance to an independent human specialist. This demonstrated that automatic nodule categorization was reliable enough for screening use.

Size measurement challenges: Nodule size is traditionally determined by manually measuring the longest and perpendicular diameters in the transverse plane. This approach is prone to both inter- and intra-radiologist variability, which can directly influence diagnostic recommendations. Volumetric segmentation methods, while more reproducible and less subject to observer variability, have been available for over a decade but were not commonly used in most lung cancer screening trials.

Volume measurement reproducibility: In same-day repeat scan studies, volume differences between measurements were found to be in the order of plus or minus 25%, with large variation across different algorithms. For reliable growth assessment over time, the same segmentation algorithm and version must be used consistently. Despite this variability, the NELSON trial's success with volume-based growth assessment led multiple guidelines (Lung-RADS, British Thoracic Society, EU Position Statement, and I-ELCAP) to advocate for semi-automatic volumetric segmentation. A recent study found that mean diameter derived from computer-aided detection (CAD) was equally predictive of malignancy as CAD-derived volume in a multivariable logistic regression model.

TL;DR: AI classifies six nodule types at expert-equivalent accuracy. Manual diameter measurement has significant inter-reader variability; volumetric segmentation is more reproducible but shows plus or minus 25% variation across algorithms. Multiple screening guidelines now recommend semi-automatic volumetric approaches for growth tracking.
Pages 5-6
From Statistical Risk Models to Deep Learning for Cancer Prediction

The ultimate goal of lung cancer CT screening is predicting whether a participant has cancer. The most established statistical risk model is the Brock model (also known as the PanCan model), which incorporates patient demographics, nodule size, type, and morphology. The Brock model is integrated into the British Thoracic Society guidelines and recommended in Lung-RADS version 1.1. While it has shown good performance across multiple independent datasets, previous studies demonstrated that radiologists can more accurately assess malignancy risk than the model alone, though radiologists themselves show no consensus when asked to characterize specific signs of malignancy.

LUNGx Challenge: This challenge provided scans from The University of Chicago containing 37 benign and 36 size-matched malignant nodules. Of 11 participating algorithms, only three achieved an area under the ROC curve (AUC) statistically superior to random guessing, with AUCs ranging from 0.50 to 0.68. In contrast, six participating radiologists achieved AUCs between 0.70 and 0.85, with three of them statistically outperforming the best algorithm. This early challenge underscored how difficult malignancy prediction was for AI at the time.

2017 Kaggle Data Science Bowl: This million-dollar challenge attracted over 2,000 teams to develop algorithms predicting whether a person would receive a lung cancer diagnosis within one year from a CT scan. An observer study of 11 radiologists (seven chest specialists) found that human experts performed only slightly better than the top three algorithms: AUC of 0.90 (95% CI: 0.85-0.94) for radiologists versus 0.86 (95% CI: 0.81-0.91) for algorithms. The top 10 winners made their code publicly available.

Google's 2019 study: Ardila et al. published a deep learning network claiming superior performance to six radiologists on single-scan lung cancer risk assessment, with an absolute false positive reduction of 11% and an absolute false negative reduction of 5%. When multiple scans were available, the model performed on par with radiologists. However, the conclusions were criticized: validation used a subset of the training cohort (NLST) plus a small independent cohort, the radiologists used Lung-RADS (a management guideline, not a 1-year risk model), and they were not thoracic specialists. The code was not made publicly available.

TL;DR: LUNGx Challenge: best AI AUC of 0.68 versus radiologist AUCs of 0.70-0.85. Kaggle 2017: top AI AUC of 0.86 versus radiologist AUC of 0.90. Google's 2019 model claimed 11% fewer false positives and 5% fewer false negatives than radiologists on single scans, but faced criticism over study design, validation approach, and non-specialist comparators.
Pages 6-7
Three Reading Paradigms: Second Reader, Concurrent Reader, and First Reader

Rather than viewing AI as a replacement for radiologists, the authors describe three distinct paradigms for human-AI collaboration. As a second reader, AI is enabled only after the radiologist completes an initial unbiased assessment. The radiologist then reviews the AI's findings to check for missed or misinterpreted nodules. As a concurrent reader, the radiologist has immediate access to AI results while interpreting the scan. As a first reader, the AI performs initial detection and only sends flagged areas to the radiologist, enabling the shortest reading times but risking that nodules missed by AI go undetected. Commercial systems to date have only been approved for second or concurrent reader use.

Evidence for second reading: Roos et al. demonstrated that an AI algorithm detected 74% (141 of 190) of nodules, of which 18% (25 of 141) had been missed by all three independent radiologists. Conversely, 14% (27 of 190) of nodules found by at least one radiologist were missed by the software. Liang et al. examined lung cancers from the NLST that had been visible in prior scans but missed by radiologists, finding that four detection systems identified 56% to 70% of these on prior scans and 74% to 82% on subsequent scans.

Quantified complementarity: Wormanns et al. published the first study on a commercial AI system as a second reader in 2004. Individually, the AI had a sensitivity of 0.55 versus 0.51 to 0.55 for three radiologists. Double reading between two radiologists yielded a sensitivity of 0.67 to 0.68, while pairing a radiologist with AI achieved a sensitivity of 0.77 to 0.81, at the cost of a 7% increase in false positive rate. Multiple subsequent publications have confirmed these results.

Technician-plus-AI triage: Ritchie et al. tested pulmonary nodule detection by a trained technician supported by an AI algorithm. For identifying abnormal CT scans with at least one nodule of 1 mm or larger, the technician-plus-AI system achieved a sensitivity of 0.98 and a specificity of 0.98. For malignant nodules specifically, the technician-plus-AI found 93% (104 of 112) compared to 85% (95 of 112) detected by PanCan radiologists without AI. The average prescreen time was 208 seconds per scan. This approach could make screening more cost-effective and feasible in regions with radiologist shortages, mirroring the workflow used in cervical cancer Pap smear screening where trained technologists triage normal findings.

TL;DR: Radiologist-plus-AI sensitivity reached 0.77-0.81 versus 0.67-0.68 for two-radiologist double reading. A technician-plus-AI system achieved 0.98 sensitivity and 0.98 specificity, detecting 93% of malignant nodules (versus 85% by radiologists alone) in 208 seconds per scan. Commercial AI systems are currently approved only as second or concurrent readers.
Pages 7-9
What the Field Still Needs Before AI Can Take a Larger Role

The authors identify several critical gaps in current evidence. First, most AI studies use a reference standard based on radiologist consensus rather than histopathological proof. The ultimate goal is not to find all nodules but to find all lung cancers. Future studies should measure cancer detection using biopsy-confirmed malignancy or at least 2 years of follow-up imaging showing lesion stability. Unfortunately, no public datasets include a substantial number of pathologically confirmed malignant nodules. The largest public database, the NLST, lacks metadata on which nodules were biopsied.

Decision impact remains untested: No study has yet demonstrated how revealing an AI-generated malignancy risk score affects radiologist decision-making. Key unanswered questions include: Are radiologists' decisions actually altered by this additional information? When do they choose to deviate from the AI's recommendation, and how often are they right to do so? Chung et al. showed that radiologists could appropriately upgrade nodules from lower Lung-RADS categories to the urgent 4X category by recognizing malignancy signs. If AI could similarly flag cancers typically missed by radiologists, it might improve upgrade accuracy, but this has not been tested.

Overdiagnosis concerns: As with every screening program, overdiagnosis is a side-effect that requires monitoring. Although the 5-year death rate from lung cancer is very high, not all detected malignancies lead to morbidity or death. An extended NLST follow-up study reported equal lung cancer incidence in both the CT and control groups after 10 years, suggesting minimal overdiagnosis in that cohort. However, other studies have raised warnings. Currently, no AI algorithms focus on predicting histological subtype, growth rate, or metastatic potential of screening-detected nodules, and there is a lack of data for predicting whether lung cancer progression will be the cause of death.

The case for prospective RCTs: In medicine, new drugs reach the market only after one or multiple prospective multicenter RCTs (phase III studies) demonstrate benefit. The authors argue that AI algorithms should face similar scrutiny. However, RCTs for AI software are not commonly performed and are not mandatory for regulatory approval. Proving effectiveness is further complicated because integration of AI into health systems depends on many interacting factors: workflow integration, extent of information display, physician training, and the constant rate of AI improvement as training data grows. While the RCT remains the gold standard for proving causality, there is currently no consensus on its role for guiding AI deployment in healthcare.

TL;DR: Current studies use radiologist consensus rather than biopsy-confirmed cancer as the reference standard. No study has tested how AI risk scores affect radiologist decision-making. Overdiagnosis prevention lacks AI tools for predicting subtype or growth behavior. Prospective multicenter RCTs are needed but not yet mandatory for AI regulatory approval.
Pages 9-11
AI Is Ready to Assist, Not Yet Ready to Lead

The review concludes that AI performance is approaching or already on par with radiologists for the core tasks in lung cancer screening CT interpretation, including nodule detection, classification, volumetric measurement, and malignancy risk estimation. In its current state, AI is best positioned in a supportive role. Several commercial products have been cleared for use as second or concurrent readers and are ready to be adopted in screening centers. Evaluation studies testing these systems in adequately sized, multi-center datasets will provide more insight into their real-world effects on sensitivity, false positive rate, and interpretation time.

The autonomous AI horizon: The first fully autonomous AI system approved by the FDA, IDx-DR for diabetic retinopathy detection from fundus photographs, provides a model for what may eventually happen in lung screening. The 2017 Kaggle challenge showed that fully automatic algorithms incorporating both nodule detection and malignancy estimation reached promising performance (AUC of 0.86) but remained slightly inferior to expert radiologists (AUC of 0.90). Post-challenge algorithms, including Google's system, have reported superior or equivalent performance, though these claims require more rigorous independent validation.

Explainability and safety requirements: Even if an autonomous AI achieves radiologist-level performance, the authors emphasize that a "black box" system that overrules established clinical guidelines will not be readily accepted. Autonomous systems need to be explainable, highlighting all areas of interest, and should include quality assurance components. For example, IDx-DR includes a scan quality module that returns the image to the operator if it is deemed insufficient for AI analysis. Similar safeguards would be essential for any autonomous lung screening AI.

The technician-plus-AI pathway: Triaging screening CTs using trained technicians aided by AI is identified as one of the most promising near-term directions for reducing costs and radiologist workload. Currently, every screening CT in the United States must be signed off by an American College of Radiology board-certified radiologist. If studies demonstrate that trained technicians can safely triage a large portion of normal scans without reducing care quality, policy changes would be needed. The ultimate cost-reduction scenario would involve fully autonomous AI algorithms performing triage, selecting only potentially abnormal CTs for radiologist review, but the requirements for implementing such autonomous algorithms should be substantially more extensive than when a trained human reader remains in the loop.

TL;DR: AI is ready for clinical use as a second or concurrent reader in lung cancer screening. Fully autonomous AI (like the FDA-approved IDx-DR for retinopathy) remains a longer-term goal for lung screening, requiring explainability, quality assurance components, and more rigorous validation. The technician-plus-AI triage model is the most promising near-term strategy for improving cost-effectiveness.
Citation: Schreuder A, Scholten ET, van Ginneken B, Jacobs C.. Open Access, 2021. Available at: PMC8182724. DOI: 10.21037/tlcr-2020-lcs-06. License: cc by-nc-nd.