Annotation-Efficient DL for Pancreatic Cancer CT Diagnosis

Overview and Background

Pages 1-2

Why Pancreatic Cancer Detection on CT Remains So Difficult

Pancreatic cancer (PC) is among the deadliest cancers, with a 5-year survival rate of approximately 11%. The primary reason for this poor prognosis is that most patients are diagnosed at an advanced stage with metastatic disease. Only about 20% of cases are eligible for surgical resection at the time of diagnosis, and curative surgery remains the only realistic path to long-term survival. These statistics underscore the urgent clinical need for earlier and more accurate detection tools.

CT imaging challenges: Computed tomography (CT) is the most widely used modality for detecting and staging pancreatic carcinoma, with reported sensitivity ranging from 76% to 96%. However, PC is characterized by abundant fibrous stroma and hypervascularity that cause poor contrast enhancement relative to the surrounding pancreatic parenchyma. This leads to ill-defined masses with indistinct borders on CT, making accurate detection heavily dependent on the radiologist's experience and the imaging protocol. Tumors smaller than 2 cm are particularly problematic, with approximately 40% going undetected at diagnosis and reported sensitivity as low as 58-77%.

The annotation bottleneck: Deep learning (DL) systems require large, high-quality annotated training datasets to generalize effectively across different centers, CT equipment, and patient populations. But for pancreatic cancer, obtaining such annotations is exceptionally difficult. The irregular contours and ill-defined margins of pancreatic tumors make even expert radiologists disagree on precise boundaries. Significant inter-rater variation means that annotation quality cannot be guaranteed, creating a fundamental barrier to building reliable DL-based diagnostic tools.

This study, conducted by researchers from Ewha Womans University and the National Cancer Center (NCC) of Korea, proposes a novel self-supervised learning algorithm called pseudo-lesion segmentation (PS) that eliminates the need for manual expert annotation while still boosting DL model performance for PC classification on CT images.

TL;DR: Pancreatic cancer has an 11% five-year survival rate. CT sensitivity is 76-96% overall but only 58-77% for small tumors. This study introduces a self-supervised "pseudo-lesion segmentation" method that removes the need for radiologist-annotated training data, addressing a key bottleneck in DL-based PC detection.

Related Work

Pages 2-3

Self-Supervised Learning and the Problem with ImageNet Pre-training

A common strategy in natural image classification is to use pre-trained weights from ImageNet, a large dataset of everyday photographs. However, transferring these learned representations to medical imaging is suboptimal. The visual features learned from color photographs of cats, cars, and landscapes differ fundamentally from the grayscale, high-resolution textures found in CT scans. The mismatch in feature distribution, spatial resolution, and output label structure means that ImageNet pre-training provides limited benefit for medical image classification tasks like pancreatic cancer detection.

Self-supervised learning approaches: Self-supervised learning bridges the gap between supervised and unsupervised methods by creating labels from the data itself through "pretext tasks." These tasks include jigsaw puzzles (rearranging image patches), colorization (predicting color from grayscale), and rotation prediction. By solving these pretext tasks, models learn semantically useful feature representations from unlabeled domain-specific images that can then be transferred to downstream tasks like classification and segmentation.

Limitations of prior work: Li et al. previously demonstrated that tumor classification performance could be improved by pre-training on a brain tumor segmentation pretext task. However, this approach depended on accurately segmented tumor regions annotated by radiologists. When tumors were not precisely segmented, the learned features did not reliably improve classification accuracy. Given the high inter-rater variability in pancreatic lesion annotation, this dependency on manual segmentation is a critical weakness.

The pseudo-lesion concept: Previous attempts at pseudo-lesion generation used simple geometric shapes, which poorly simulated the complex morphology of real tumors. More realistic synthetic tumors had been created for model observer studies in breast, liver, and lung imaging, but these required organ-specific background textures and noise characteristics that limit reproducibility across different organs and scanner systems. The authors' approach uses randomly combined simple shapes to create atypical, complex forms that approximate real tumor appearances without needing organ-specific customization.

TL;DR: ImageNet pre-training is a poor fit for medical CT images. Prior self-supervised methods required radiologist-annotated tumor segmentations, limiting their applicability. This study generates pseudo-lesions from random shape combinations, bypassing the need for any human annotation while still learning useful visual representations.

Methodology

Pages 3-5

Dataset Construction, PS Algorithm, and DL Model Training

Patient cohort: The study analyzed CT images from 4,287 patients diagnosed with PC, collected between June 2004 and December 2020 from the NCC and seven general tertiary hospitals across South Korea. PC was defined as histologically or cytologically confirmed pancreatic adenocarcinoma. Benign cases included pancreatic cystic lesions and acute or chronic pancreatitis with a 1-year follow-up period, plus normal pancreas cases identified from health checkup participants with negative or unremarkable radiologist reports. All CT scans were reviewed by two experienced radiologists with more than 5 years of pancreatic imaging experience.

Data splits: The 4,287 patients were randomly divided into a training set of 3,010 patients (mean age 58.9 years, SD 13.5) and an internal validation set of 1,277 patients (mean age 58.9 years, SD 13.4). For cross-ethnicity external validation, CT images from 361 patients were drawn from two US-based open-source datasets: the Medical Segmentation Decathlon from Memorial Sloan Kettering Cancer Center (281 PC patients) and the TCIA dataset from the NIH Clinical Center (80 normal pancreas patients). CT images were obtained in either portal venous phase (70 seconds after contrast injection) or pancreatic phase (40 seconds after contrast injection).

The PS self-supervised learning algorithm: PS works in three steps. First, the system automatically generates synthetic annotations called "pseudo-lesions," which are atypical shapes created by randomly combining multiple simple geometric forms to mimic real tumor morphology. These pseudo-lesions are then inserted into pancreatic CT scans. Second, a DL network is trained to segment (identify the boundaries of) these pseudo-lesion regions, learning pancreas-related and tumor-related visual representations in the process. Third, the pre-trained network is fine-tuned for the actual PC classification task using supervised learning. Because the pseudo-lesion annotations are generated automatically, their correctness is guaranteed, unlike human annotations which suffer from inter-rater variability.

Model selection: The authors tested multiple state-of-the-art architectures. Among CNN-based models, ShuffleNet V2 achieved the highest baseline accuracy at 93.6% (95% CI: 92.1-94.8%). Among transformer-based models, Pyramid Vision Transformer (PVT) led with 90.6% (95% CI: 88.8-92.1%). These two were selected as the baseline architectures for PS integration. Other tested CNN models included ResNet-101 (90.2%), ResNeXt-101 (83.5%), and ResNeSt (84.3%). Transformer alternatives included MiT (85.4%) and PiT (82.8%).

TL;DR: 4,287 patients from South Korea (3,010 training, 1,277 internal validation) plus 361 external patients from the US. The PS algorithm generates synthetic tumor-like shapes, inserts them into CT scans, trains a network to segment them, then fine-tunes for PC classification. ShuffleNet V2 (CNN) and PVT (transformer) were selected as baseline models based on top performance among 7 architectures tested.

Internal Validation Results

Pages 5-7

PS Boosts Both CNN and Transformer Performance on Korean Dataset

CNN-based results (ShuffleNet V2): Adding PS to the CNN-based model improved classification accuracy from 93.6% (95% CI: 92.1-94.8%) to 94.3% (95% CI: 92.8-95.4%). Sensitivity rose from 90.6% to 92.5%, specificity from 95.5% to 95.8%, and AUC from 0.93 to 0.94. While these improvements appear modest in absolute terms, they represent meaningful gains at the high end of performance where incremental improvement is difficult to achieve.

Transformer-based results (PVT): The impact of PS was dramatically larger on the transformer architecture. PVT with PS achieved 95.7% accuracy (95% CI: 94.5-96.7%), a 5.1 percentage point jump over PVT without PS (90.6%). Sensitivity surged from 97.4% to 99.3% (95% CI: 98.4-99.7%), specificity improved from 87.5% to 90.7%, precision increased by 15.4 percentage points (from 78.3% to 93.7%), F1 score rose from 0.83 to 0.98, and AUC climbed from 0.88 to 0.95. The transformer with PS outperformed the CNN with PS across nearly every metric.

Grad-CAM visualization: The authors used gradient-weighted class activation maps (Grad-CAM) to visualize which regions of the CT images the DL models focused on when making predictions. Models incorporating PS showed heat maps more closely aligned with actual tumor locations for PC cases and with the pancreatic region for normal cases. In contrast, models without PS often highlighted irrelevant areas. This visual evidence suggests that PS helps the model learn more anatomically meaningful representations rather than relying on spurious correlations in the image data.

These results demonstrate that PS self-supervised learning improves all evaluation metrics on both CNN-based and transformer-based architectures, making the DL models' predictions more closely aligned with expert radiologist labeling (ground truth).

TL;DR: PVT + PS achieved 95.7% accuracy, 99.3% sensitivity, and 0.95 AUC on internal validation. The transformer benefited most from PS, with a 5.1% accuracy gain and 15.4% precision gain. Grad-CAM heat maps confirmed that PS-trained models focused on actual tumor regions rather than irrelevant areas.

External Validation

Pages 7-9

Cross-Ethnicity Generalization to US Datasets

A practical DL model must generalize to unseen datasets from different ethnic groups and institutions. The external validation set combined CT images from the Memorial Sloan Kettering Cancer Center (281 PC patients) and the NIH TCIA dataset (80 normal pancreas patients), both from the United States. This represents a significant domain shift from the Korean training data in terms of patient ethnicity, scanner hardware, and institutional imaging protocols.

CNN-based external results: ShuffleNet V2 with PS achieved 82.5% accuracy (95% CI: 78.3-86.1%), 81.7% sensitivity (95% CI: 77.3-85.4%), 100.0% specificity (95% CI: 81.7-100.0%), 100.0% precision (95% CI: 98.6-100.0%), F1 score of 0.90, and AUC of 0.61. This outperformed the CNN without PS, which had 80.9% accuracy, 80.3% sensitivity, and 0.57 AUC.

Transformer-based external results: PVT with PS achieved 87.8% accuracy (95% CI: 84.0-90.8%), 86.5% sensitivity (95% CI: 82.3-89.8%), 100.0% specificity (95% CI: 90.4-100.0%), F1 score of 0.93, and AUC of 0.80. Compared to PVT without PS (83.1% accuracy, 82.3% sensitivity, 0.62 AUC), PS boosted accuracy by 4.7%, sensitivity by 4.2%, specificity by 4.8%, and AUC by 0.18.

Performance gap analysis: The lower accuracy and sensitivity on external data compared to internal validation is expected and attributable to differences in race/ethnicity and scanner diversity. Pancreatic fat content, a known factor that influences CT appearance, varies across ethnic groups. Despite this domain shift, the external performance showed only a modest decrease compared to prior published DL algorithms, and PS consistently improved generalization across both architectures. The authors note that CT images with multicenter technical variations from a large patient cohort reflect real clinical practice, suggesting the PS model has practical generalizability potential.

TL;DR: On US-based external data (361 patients), PVT + PS achieved 87.8% accuracy, 86.5% sensitivity, and 0.80 AUC, outperforming the baseline by 4.7% in accuracy and 0.18 in AUC. PS improved cross-ethnicity generalization for both CNN and transformer architectures despite significant domain shift from the Korean training data.

Early Stage Detection

Pages 9-10

PS Improves Detection of T1 and T2 Stage Pancreatic Cancer

Detecting early stage pancreatic cancer is critically important because patients with T1/T2 stage disease have better prognosis if identified promptly for surgical intervention. However, tumors smaller than 2 cm are frequently unremarkable on CT, and approximately 40% are missed at diagnosis. The authors evaluated their DL models specifically on early stage cancer cases to determine whether PS could help close this diagnostic gap.

Stage T1 results: For the smallest tumors (T1), the CNN-based model with PS achieved 54.0% accuracy (95% CI: 44.8-57.8%), compared to 51.3% without PS. The transformer-based PVT with PS reached 55.3% accuracy (95% CI: 48.8-61.8%), up from 50.4% without PS. While these absolute numbers are modest, they represent performance comparable to reported radiologist sensitivity for small pancreatic lesions (58-77%), and PS consistently improved detection over baseline models.

Stage T2 results: For T2 tumors, improvements were more substantial. The CNN with PS achieved 76.9% accuracy (95% CI: 74.6-79.0%) versus 68.4% without PS, a gain of 8.5 percentage points. The transformer with PS achieved 75.2% (95% CI: 72.7-77.6%) versus 67.1% without PS, an 8.1 percentage point improvement. These results are particularly encouraging because T2-stage detection at this accuracy level could meaningfully reduce the rate of missed diagnoses in clinical practice.

Grad-CAM analysis of early stage cases confirmed that models with PS more accurately focused on tumor regions compared to models without PS. The authors argue that these results suggest the DL model with PS can reduce overlooked or missed diagnoses of early stage PC, potentially improving patient outcomes by enabling timely surgical intervention.

TL;DR: For T1 tumors, PS improved accuracy to 54.0% (CNN) and 55.3% (PVT), comparable to radiologist sensitivity (58-77%). For T2 tumors, PS boosted accuracy by roughly 8 percentage points for both architectures, reaching 76.9% (CNN) and 75.2% (PVT). These gains could reduce missed early stage diagnoses.

Small Dataset Robustness

Page 10

PS Delivers Dramatic Gains When Training Data Is Scarce

One of the most compelling findings of this study is the impact of PS on small training datasets. The authors randomly sampled 10%, 25%, 50%, and 75% of the full training set and compared PVT performance with and without PS at each level. This experiment directly addresses a common real-world constraint: many institutions lack the large annotated datasets that DL models typically require.

The 10% dataset result: When trained on only 10% of the data (approximately 301 patients), implementing PS increased prediction accuracy by 20.5% and sensitivity by 37.0%. These are not marginal improvements. They represent the difference between a model that is clinically useless and one that provides meaningful diagnostic support. The result suggests that PS effectively compensates for the lack of labeled data by providing the model with a strong prior understanding of pancreas-related visual features through the pseudo-lesion segmentation pretext task.

Scaling behavior: The performance gap between models with and without PS narrowed as the training dataset grew larger. At 25% of the dataset, PS still provided substantial gains. At 50% and above, the benefits were smaller but still positive. This pattern is consistent with self-supervised learning theory: pre-learned representations matter most when labeled data is scarce, and their relative contribution diminishes as more supervised data becomes available.

These findings have important practical implications. Institutions with limited pancreatic cancer case volumes or insufficient resources for large-scale annotation efforts could still build effective DL models by leveraging PS self-supervised learning. This lowers the barrier to deploying AI-based pancreatic cancer detection in resource-constrained settings.

TL;DR: With only 10% of training data, PS boosted accuracy by 20.5% and sensitivity by 37.0%. The benefit decreased as more data became available but remained positive at all levels. This means institutions with small datasets can still build effective PC detection models using PS pre-training.

Limitations and Future Directions

Pages 11-12

Single-Country Training, Missing Radiologist Benchmarks, and Next Steps

No direct radiologist comparison: The study did not include cancer prediction results from radiologists of varying experience levels. Without this comparison, the model's ability to reduce the number of overlooked lesions in real clinical practice cannot be fully substantiated. A prospective study comparing the DL model's performance against junior and senior radiologists would provide much stronger evidence for clinical utility.

Limited ethnic diversity in training: All training data came from seven tertiary hospitals in South Korea, resulting in an entirely Asian patient cohort. While the external validation on US datasets demonstrated some cross-ethnicity generalizability, the performance drop (from 95.7% internal accuracy to 87.8% external accuracy for PVT + PS) highlights the limitations of single-country training. Pancreatic fat content and other anatomical characteristics vary by ethnicity, and the model would benefit from more diverse training data.

Retrospective design: This was a retrospective diagnostic study, meaning all data came from patients who had already been diagnosed with or cleared of pancreatic cancer. A prospective study using the model in real-time clinical workflow would be necessary to validate its practical impact on diagnostic accuracy and patient outcomes. The study also relied on portal venous and pancreatic phase CT images, and performance on other imaging protocols remains unknown.

Future directions: The authors plan to develop methods that increase the robustness of the model for external datasets with different distributions in terms of scanner settings and patient demographics. Expanding training to include multi-ethnic, multi-center data would be a natural next step. Additionally, evaluating PS self-supervised learning in prospective clinical trials, and potentially extending the technique to other cancer types where annotation is similarly challenging, represent promising avenues for future work.

TL;DR: Key limitations include single-country (South Korea) training data, no head-to-head comparison with radiologists, and retrospective-only design. External accuracy dropped from 95.7% to 87.8% due to ethnic and scanner differences. Future work will focus on multi-ethnic training data and prospective clinical validation.

Annotation-Efficient Deep Learning Model for Pancreatic Cancer Diagnosis and Classification Using CT Images: A Retrospective Diagnostic Study

Original Paper (PDF)

Plain-English Explanations