A Systematic Review of AI Performance in Lung Cancer Detection on CT Thorax

Radiology 2024 AI 9 Explanations View Original
Original Paper (PDF)

Unable to display PDF. Download it here or view on PMC.

Plain-English Explanations
Pages 1-2
Why Lung Cancer Screening Needs AI Assistance

Lung cancer is the leading cause of cancer death worldwide, responsible for almost 2.5 million new cases annually and an estimated 1.8 million deaths per year according to the World Health Organization. The overall 5-year survival rate is just 17%, ranging from about 70% for stage I disease down to less than 5% for stage IV. Because most patients are diagnosed at an advanced stage, early detection through screening is critical to improving outcomes.

Low-dose computed tomography (LDCT) is the established gold standard for lung cancer screening (LCS). The American Cancer Society recommends LDCT for individuals aged 50 to 80 who are current smokers or have a 20-or-greater pack-year smoking history. The National Lung Screening Trial (NLST) demonstrated that LDCT screening reduced lung cancer mortality by 20% among current and former heavy smokers. However, LDCT suffers from a high false positive rate of up to 49.3% at baseline screening, because it also detects benign findings such as intrapulmonary lymph nodes and noncalcified granulomas.

Detecting and characterizing pulmonary nodules on CT is a laborious task. Subcentimetre lesions can be especially difficult to distinguish from normal anatomic structures like vessels and airways. If LDCT is adopted broadly as a screening tool, the resulting surge in imaging volumes risks overburdening the limited radiologist workforce. AI augmentation offers the potential to address this added workload by helping to detect and classify nodules without sacrificing diagnostic accuracy.

This systematic review was conducted to analyse and assess the diagnostic performance of existing AI models for detecting and classifying lung cancer on CT scans. The authors aimed to determine whether AI can reliably support radiologists in the interpretation of screening CTs and help overcome challenges in implementing large-scale LCS programmes.

TL;DR: Lung cancer kills 1.8 million people per year and has a 17% five-year survival rate. LDCT screening reduces mortality by 20% but has a false positive rate up to 49.3%. This review evaluates whether AI models can help radiologists handle the growing volume of screening CTs without losing diagnostic accuracy.
Pages 2-3
Search Strategy and Study Selection: The PRISMA Approach

The review followed the Cochrane Handbook for Systematic Reviews and the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) statement guidelines. Six major databases were searched: MEDLINE (Ovid), Embase (Ovid), PubMed, CINAHL, Cochrane Library, and Scopus. The search covered a 12-year period from 1 January 2010 to 21 December 2022 and was limited to English-language publications.

The search strategy combined controlled vocabulary (MeSH terms) and free-text keywords using Boolean operators. Terms covered three domains: lung cancer and nodules, computed tomography, and artificial intelligence (including deep learning, machine learning, computer vision, and neural networks). Study selection was performed independently by two reviewers, each with over five years of experience in radiology research, using a two-stage process of title/abstract screening followed by full-text review.

PRISMA flow: The initial search yielded 3,234 articles. After removing 1,658 duplicates and 980 irrelevant records, 596 articles underwent title and abstract screening. Of these, 556 were excluded, leaving 40 articles for full-text review. The final selection included 14 studies for the systematic review. Exclusion criteria removed studies not published in English, studies that did not evaluate AI-based detection or classification on chest CT, studies relying solely on open-source datasets without independent test cohorts, and non-primary publication types such as case reports and guidelines.

Why no meta-analysis: The authors did not perform a formal meta-analysis because the included studies were too diverse in terms of study designs, population groups, and outcome measures. Instead, they used descriptive summaries with range, mean, and standard deviation for each metric. Similarly, no formal risk-of-bias tool (such as QUADAS-2) was applied, which limits the ability to assess the quality of individual studies.

TL;DR: Six databases were searched over a 12-year window, yielding 3,234 initial results that were narrowed to 14 studies through PRISMA screening. The studies were too heterogeneous for meta-analysis, so results were summarized descriptively using ranges, means, and standard deviations.
Pages 3-4
The 14 Included Studies: Architectures, Datasets, and Design

A total of 10,217 nodules across 14 retrospective studies were analysed. The studies were split into two subgroups: seven focused on the detection of pulmonary nodules and eight focused on the classification of nodules as benign or malignant. One study (Guo et al., 2020) contributed data to both subgroups. All 14 studies were retrospective in design, meaning they analysed previously collected CT data rather than prospectively enrolling patients.

Detection models: The seven detection studies employed a variety of convolutional neural network (CNN) architectures. Gao et al. used a ResNet50 network on 330 nodules. Katase et al. used Faster R-CNN on 115 nodules with 2 mm slice thickness. Abadia et al. tested the commercial AI-RAD Companion Chest CT system (Siemens) on 441 nodules at 1 mm slice thickness. Cui et al. used a dual-CNN with VGG-net architecture on 262 nodules. Hsu et al. evaluated the commercial ClearReadCT system on 340 nodules. Guo et al. used the DeepLN deep neural network on 766 nodules. Kozuka et al. tested the CAD InfeRead CT Lung system (also built on Faster R-CNN) on 743 nodules at 1 mm slice thickness.

Classification models: The eight classification studies used different approaches. Qiu et al. applied DenseNet to 254 ground-glass nodules. Marappan et al. used a hybrid 2D/3D DenseNet with Softmax on 195 nodules. Lv et al. developed FGP-NET and tested it on 100 nodules. Heuvelmans et al. used LCP-CNN on 2,106 nodules, one of the largest datasets in the review. Pang et al. combined DenseNet with AdaBoost on 3,940 nodules, the single largest dataset. Du et al. and Diao et al. each used unnamed in-house models on 194 and 431 nodules, respectively. Guo et al. contributed classification data from their DeepLN system as well.

CT acquisition parameters varied considerably across studies. Slice thicknesses ranged from 1 mm to 5 mm, and some studies did not report their CT parameters at all. This variability in imaging protocols is an important consideration because thinner slices generally allow better detection of small nodules, and the lack of standardization makes direct comparison across studies more difficult.

TL;DR: Fourteen retrospective studies covering 10,217 nodules were analysed, split into detection (7 studies) and classification (8 studies) subgroups. Architectures included ResNet50, Faster R-CNN, DenseNet, VGG-net, and several commercial systems. Dataset sizes ranged from 100 to 3,940 nodules.
Pages 4-6
AI Nodule Detection: Higher Sensitivity, Lower Specificity Than Radiologists

Sensitivity (finding real nodules): AI models demonstrated consistently high sensitivity across all seven detection studies, with values ranging from 86.0% to 98.1% (mean 94.0%, SD 3.99). The highest sensitivity of 98.1% was achieved by Katase et al. using Faster R-CNN. In comparison, radiologists achieved sensitivity values ranging from 68% to 76% (mean 73.3%, SD 3.11). In every study that included a radiologist comparison, the AI algorithm outperformed human readers in identifying lung nodules.

Specificity (avoiding false alarms): AI models generally performed worse than radiologists in specificity. AI specificity ranged from 77.5% to 87.0% (mean 82.6%, SD 3.91), while radiologist specificity ranged from 87.0% to 91.7% (mean 89.4%, SD 2.35). This means AI models were more prone to flagging benign findings as suspicious. The problem was particularly evident in Cui et al., where the AI model produced 359 false positives among a population of just 262 nodules, an unusually high false positive burden.

Accuracy: A few detection studies reported high accuracy values for their AI models, ranging from 85.71% (Hsu et al., ClearReadCT) to 99.02% (Guo et al., DeepLN). However, none of the studies provided radiologist accuracy values for direct comparison, making it impossible to benchmark AI accuracy against human performance in this subgroup. Hsu et al. was the only study to report a radiologist AUC value of 81%, and no studies reported AI AUC values for the detection task.

The pattern across the detection subgroup is clear: AI excels at finding nodules (high sensitivity) but tends to over-call findings that turn out to be benign (lower specificity). This trade-off is clinically significant because high false positive rates lead to unnecessary follow-up scans, invasive procedures, patient anxiety, and increased healthcare costs.

TL;DR: AI detection sensitivity (86.0% to 98.1%, mean 94.0%) consistently beat radiologists (68% to 76%, mean 73.3%). However, AI specificity (77.5% to 87.0%) lagged behind radiologists (87.0% to 91.7%), meaning more false positives. One study produced 359 false positives from just 262 nodules.
Pages 6-7
AI Nodule Classification: Distinguishing Benign From Malignant

Sensitivity: In the classification subgroup, AI sensitivity ranged from 60.58% to 93.3% (mean 80.3%, SD 13.7), while radiologist sensitivity ranged from 76.27% to 86.7% (mean 83.8%, SD 5.32). Unlike the detection subgroup where AI clearly outperformed radiologists, the classification results were mixed. The lowest AI sensitivity of 60.58% came from DenseNet (Qiu et al.), which was substantially below the radiologist benchmark of 76.27% in the same study. However, FGP-NET (Lv et al.) achieved the highest AI sensitivity of 93.3%, outperforming the radiologist value of 86.7%.

Specificity: AI specificity for classification ranged from 64.0% to 95.93% (mean 80.0%, SD 13.0), compared to radiologist specificity of 61.67% to 84.0% (mean 70.3%, SD 9.79). The highest AI specificity of 95.93% was achieved by the DeepLN system (Guo et al.). AI models generally outperformed radiologists on this metric for classification, which is the opposite of the pattern seen in the detection subgroup.

Accuracy and AUC: AI accuracy was reported in five classification studies with values ranging from 64.96% to 92.46% (mean 84.8%, SD 10.0). Radiologist accuracy was available in only two studies, ranging from 73.31% to 85.57% (mean 79.5%, SD 6.15). Notably, DenseNet (Qiu et al.) had the lowest AI accuracy at 64.96%, which was below the radiologist accuracy of 73.31% in the same study. For AUC, AI values ranged from 76.8% to 94.5% (mean 85.4%, SD 8.23), with the LCP-CNN model (Heuvelmans et al.) achieving the highest AUC of 94.5% (95% CI: 92.6% to 96.1%). No studies reported radiologist AUC for classification.

The classification subgroup shows wider variability in AI performance compared to detection, with standard deviations roughly three to four times larger. This suggests that the task of distinguishing benign from malignant nodules remains more challenging for AI, and performance depends heavily on the specific architecture and dataset used.

TL;DR: Classification results were mixed. AI sensitivity ranged widely (60.58% to 93.3%) with higher variability than radiologists (76.27% to 86.7%). However, AI generally achieved better specificity (up to 95.93%) and accuracy (up to 92.46%) than human readers. LCP-CNN achieved the highest AUC at 94.5%.
Pages 7-8
Putting AI Performance in Clinical Context: NLST, LDCT, and Screening Realities

The NLST demonstrated that LDCT screening reduces lung cancer mortality by 20%, but the accurate interpretation of these scans remains challenging. CT images are three-dimensional and contain enormous amounts of information, and small nodules under one centimetre in diameter can easily be confused with normal vessels or airways. Research comparing experienced radiologists with AI algorithms has shown that AI can identify nodules that even experienced readers overlook, highlighting the potential value of AI as either a concurrent reader (working alongside the radiologist in real time) or a second reader (reviewing cases after the initial human reading).

An important caveat is that most of the included studies used standard diagnostic CT rather than LDCT. This distinction matters because LDCT uses lower radiation doses, which produces noisier images with different technical characteristics. AI models trained or tested on standard CT may not perform identically on LDCT screening data. The generalizability of these findings to actual screening programmes therefore requires further validation.

Studies on LCS programmes in Asian populations have found that including never-smokers in screening programmes revealed more stage I lung cancer diagnoses among never-smokers than among ever-smokers. In the United States, implementation of LDCT screening among smokers increased the proportion of stage I diagnoses while reducing stage IV diagnoses, with overall survival rates improving. These findings reinforce the value of broad screening, and AI tools that can help manage the resulting volume of scans could play a pivotal role in making such programmes feasible.

TL;DR: LDCT screening reduces lung cancer mortality by 20%, but most studies in this review used standard CT rather than LDCT, limiting direct applicability to screening programmes. AI shows promise as a concurrent or second reader to catch nodules that radiologists miss.
Pages 8-10
Limitations of AI Models and the Review Itself

Architectural variability and preprocessing: The included AI models exhibited substantial architectural variability. CNNs such as ResNet, DenseNet, and Faster R-CNN were frequently used for detection due to their strength in spatial feature extraction from volumetric CT data. Classification models often employed ensemble methods (for example, DenseNet combined with AdaBoost) or hybrid 2D/3D CNNs to capture the morphological complexity of nodules. However, few studies reported their preprocessing workflows, such as image normalization or segmentation steps, which are known to significantly affect model performance. The lack of standardized evaluation protocols and the inconsistent use of radiologist comparators make it difficult to benchmark models against each other.

Black-box decision-making: A significant limitation of current AI models is their lack of transparency. When an AI system flags a lesion, it often cannot explain why it reached that conclusion. This opacity makes it difficult for radiologists to validate AI findings, particularly in high-stakes clinical scenarios. Explainable AI (XAI) methods, such as activation maximization and saliency mapping, could help address this by revealing which image features the model focuses on. However, XAI still requires significant development before it can be consistently applied in clinical practice.

False positive burden: Models optimized for high sensitivity frequently flag benign or irrelevant findings as potential lesions. This leads to unnecessary follow-up tests, increased healthcare costs, and added patient stress. The example from Cui et al. (359 false positives from 262 nodules) illustrates how an AI system with excellent sensitivity can still create substantial clinical burden if its specificity is insufficient.

Review-level limitations: The review itself had several constraints. The heterogeneity of study designs prevented formal meta-analysis. Restriction to English-language publications may have excluded relevant non-English studies. The exclusion of studies using only open-source datasets, while reducing bias, may have omitted well-performing algorithms. The relatively small number of included studies (n = 14) reflects strict inclusion criteria but limits generalizability. Finally, the search ended in December 2022, creating a gap of over two years before publication, during which newer AI models and validation studies may have emerged.

TL;DR: Key limitations include lack of preprocessing standardization, black-box AI models that cannot explain their decisions, high false positive rates (up to 359 false positives from 262 nodules), and the inability to perform meta-analysis due to study heterogeneity. Only 14 studies met inclusion criteria, and the search ended in December 2022.
Pages 9-11
Socioeconomic and Environmental Determinants of Lung Cancer Risk

The review highlights that socioeconomic status (SES) is a well-documented determinant of lung cancer incidence and outcomes. Individuals in lower SES communities face disproportionate disease burden due to structural inequalities: limited access to quality healthcare, fewer preventive services, and reduced access to early detection programmes, particularly in rural and underserved urban areas. Financial strain can discourage people from seeking medical care, and lower SES groups are less likely to take medical leave due to job insecurity.

Occupational and environmental exposures: Populations with low SES tend to reside near hazardous industrial sites such as waste disposal facilities, power plants, and superfund locations, which produce chronic exposure to carcinogens. Workers in manual labor-intensive jobs, including construction, mining, painting, and chimney sweeping, face exposure to asbestos, diesel exhaust, silica, and coal tar. These occupational exposures elevate lung cancer risk in both smokers and non-smokers, though the risk is significantly amplified in individuals who also smoke.

These findings are relevant to AI-based screening because even if AI tools reduce radiology bottlenecks, disadvantaged populations may still face barriers to accessing LDCT screening itself. The authors argue that the cumulative impact of social and environmental factors in heightening lung cancer risk means that technology alone cannot solve the screening gap. Broader public health interventions are needed alongside AI deployment to ensure equitable access to early detection.

TL;DR: Lower socioeconomic status is linked to higher lung cancer incidence through limited healthcare access, occupational carcinogen exposure (asbestos, diesel exhaust, silica), and proximity to industrial pollution. AI can reduce radiology bottlenecks, but barriers to screening access persist for disadvantaged populations.
Pages 10-12
Next Steps: From Research Models to Real-World Screening Tools

The WHO supports early detection programmes and encourages countries to implement screening for high-risk populations. While LDCT is recommended, its implementation faces challenges in low- and middle-income countries (LMICs) due to infrastructure limitations, workforce shortages, and financial constraints. AI models that can partially automate nodule detection and malignancy classification offer a potential solution to scale up screening coverage without proportionally increasing the radiologist workload. However, the WHO has cautioned against premature deployment of AI without proper clinical validation, impact assessments, and regulatory oversight.

Clinical validation needs: The review emphasizes that continuous refinement and testing are crucial to move AI from research settings to widespread clinical use. Key gaps include the need for large, diverse training datasets that represent different populations and imaging protocols, improved model interpretability through explainable AI methods, and consistent external validation across multiple institutions. The authors call for proper standardization of methodology and interdisciplinary collaboration to ensure ethical and effective integration of AI into imaging practice.

Smoking cessation synergy: An interesting finding from the literature is that smoking cessation efforts tend to be more successful when a nodule is detected on a CT scan, regardless of whether the nodule turns out to be benign or malignant. Since the majority of people enrolled in LCS programmes are current smokers, integrating screening with smoking cessation programmes could yield a dual benefit: earlier cancer detection and reduced smoking rates. AI that increases screening throughput could therefore amplify this secondary prevention benefit.

The authors conclude that AI models for pulmonary lesion detection and classification on CT have the potential to augment CT thorax interpretation while maintaining diagnostic accuracy. They position AI as a tool to help overcome challenges in implementing lung cancer screening programmes at scale, provided that the models are properly validated and integrated into clinical workflows with appropriate oversight.

TL;DR: Future priorities include validating AI on diverse populations and LDCT protocols, developing explainable AI for clinical trust, and scaling screening in low-resource settings. An unexpected benefit: nodule detection on CT improves smoking cessation rates, and AI-driven screening throughput could amplify this effect.
Citation: Cheo HM, Ong CYG, Ting Y.. Open Access, 2025. Available at: PMC12250385. DOI: 10.3390/healthcare13131510. License: cc by.