Development of a Deep Learning Model to Assist With Diagnosis of Hepatocellular Carcinoma

PMC 2022 AI 8 Explanations View Original
Original Paper (PDF)

Unable to display PDF. Download it here or view on PMC.

Plain-English Explanations
Pages 1-2
Why Pathological Diagnosis of HCC Needs AI Assistance

Hepatocellular carcinoma (HCC) ranks fifth in incidence and second in cancer fatality rate among males worldwide. Accurate pathological diagnosis is essential because it directly guides treatment decisions and affects overall survival. Key pathological features such as microvascular invasion (MVI) and microsatellite nodules are critical prognostic indicators, but identifying them is time-consuming and entirely dependent on the subjective experience of individual pathologists, which varies substantially across practitioners.

Deep learning has already demonstrated breakthroughs in medical imaging, including diabetic retinopathy detection, gastric cancer gene signature prediction, and AI-based identification of cancers of unknown primary origin. In pathology specifically, convolutional neural networks (CNNs) have shown effectiveness for tumor detection in stomach, lung, breast lymph node metastasis, and prostate core needle biopsies. However, prior CNN-based HCC work focused mainly on radiology images (CT, ultrasound, MRI) rather than whole-slide histopathological images (WSIs). Few AI studies have tackled large-sample WSI datasets for HCC.

The core challenge: AI-based pathology requires large numbers of accurate annotations, but manual annotation of histopathological images is extremely time-consuming and inherently imprecise. Larger-scale patches capture more complete cellular features but also introduce more "noisy" patches containing fibrous stroma, blood vessels, or sporadic cancer cells that are difficult to fully mark. These conditions make annotations both scarce and noisy, which is detrimental to machine learning model performance.

This study proposes a noise-specific deep learning model trained on H&E-stained WSIs from 592 HCC patients. The model employs patch screening and dynamic label smoothing to handle noisy annotations, and was validated on an independent cohort of 455 cases plus an external set of 157 slides from The Cancer Genome Atlas (TCGA).

TL;DR: HCC is the fifth most common cancer and second deadliest in males. This study developed a noise-tolerant deep learning model using WSIs from 592 HCC patients, tested on 455 internal cases and 157 TCGA slides, to overcome the challenge of imprecise pathological annotations.
Pages 2-3
592 Patients, Multi-Scale Patches, and the Noisy Label Problem

The study enrolled 592 HCC patients who underwent surgical resection at The First Affiliated Hospital, College of Medicine, Zhejiang University between 2015 and 2020. Patients who had received prior radiotherapy or chemotherapy were excluded, as were those with intrahepatic cholangiocarcinoma or mixed hepatocellular-cholangiocarcinoma. Digital slides were produced at 40x magnification using the 3DHISTECH P250 FLASH scanner, with one representative H&E-stained slide per case containing both HCC tissue and adjacent liver tissue.

Annotation approach: An expert liver pathologist made rough annotations of tumor regions, with 20 slides receiving elaborate pixel-level annotation for detailed evaluation. Over 400 of the 592 cases fell into grade 2 or grade 3 on the Edmondson-Steiner grading system. To address the imbalanced distribution of tumor differentiation grades, the authors selected 137 cases for training, with the remaining 455 slides split into 211 for validation and 244 for testing. An additional 157 TCGA slides served as an external test set.

Multi-scale patch extraction: Hundreds of central points with a minimum distance were selected on each slide. For each point, three patches were cropped at 5x, 20x, and 50x magnification, then resized to 448 pixels. This multi-scale approach captures both macro-level structural patterns (at 5x) and micro-level cellular morphology (at 50x). Patch labels were assigned based on their position relative to the annotated tumor region. All patches were pre-processed with stain normalization before training.

The training set ultimately comprised 22,800 patches from 137 slides. The validation set contained 412 manually screened patches from 211 slides, with only a few patches per slide to maximize diversity. The authors noted that including more slides was more helpful for dataset diversity than extracting more patches from the same slide.

TL;DR: 592 HCC patients, 137 slides for training (22,800 patches), 211 for validation (412 patches), 244 for testing, and 157 external TCGA slides. Patches were extracted at 5x, 20x, and 50x magnification and resized to 448 pixels. Rough annotations from an expert pathologist introduced inherent label noise.
Pages 3-4
Two-Level Screening to Filter Noisy Training Data

Because the training data was generated from rough annotations, many slides contained high levels of label noise, where patches within annotated tumor regions actually contained non-tumor tissue like fibrous stroma or blood vessels. The authors developed a two-stage screening strategy to mitigate this problem. First, a preliminary model was trained on the raw dataset to achieve approximately 85% accuracy on the validation set. This pretrained model was then used to screen the training data through two complementary methods.

Slide-level screening: All patches from a given slide were evaluated together. If the pretrained model achieved less than 70% accuracy on a slide, that entire slide was flagged as containing too many labeling errors and removed from the training set. This approach filters out entire slides with unreliable annotations, eliminating the noisiest samples without needing to assess individual patches.

Patch-level screening: Individual patches were assessed directly. Any patch with a predicted value below 0.7 was filtered out. The 0.7 threshold was determined through an ablation study testing thresholds of 0.6, 0.7, and 0.8, which yielded accuracies of 91.58%, 93.03%, and 92.12%, respectively. The 0.7 threshold achieved the best balance. Patch-level screening reduced dataset noise more directly, but the remaining noisy patches tended to be more ambiguous and difficult to classify.

Experiments showed that patch-level screening outperformed slide-level screening across all magnification scales, with an average improvement of 5.10% over slide-level screening and 8.92% over the raw (unscreened) dataset. The highest patch-level accuracy of 93.03% was achieved using the patch-level screened dataset at 20x magnification, which balanced the trade-off between image noise and cellular feature detail.

TL;DR: Two screening methods were tested. Patch-level screening (threshold 0.7) outperformed slide-level screening by 5.10% on average and the raw dataset by 8.92%. The best result was 93.03% patch-level accuracy at 20x magnification using patch-level screening.
Pages 3-4
ResNet18, Label Smoothing, and Dynamic Smoothing Weights

The model used ResNet18 as its backbone architecture, pre-trained on ImageNet to leverage existing feature extraction capabilities for basic image patterns. This transfer learning approach was critical because acquiring tens or hundreds of thousands of medical samples for training from scratch is unrealistic. Even after screening, the dataset retained residual noise, so the authors introduced label smoothing as an additional noise-mitigation strategy.

Label smoothing fundamentals: Standard cross-entropy loss forces predicted probabilities toward the extremes of 0 or 1, which causes over-confidence on individual samples and reduces generalization. Label smoothing replaces hard labels with soft labels defined as y_smooth = (1 - epsilon) * y + epsilon / 2, where epsilon controls the degree of smoothness. This constrains the model's output range and prevents over-fitting to noisy labels. The authors tested epsilon values of 0.1, 0.2, and 0.3 on the screened 20x dataset. At epsilon = 0.2, the model achieved 93.01% accuracy, AUC of 0.9691, and F1-score of 0.9585.

Dynamic smoothing weight: Rather than applying a fixed smoothing weight uniformly, the model dynamically adjusts epsilon for each patch based on its surrounding context. Surrounding patches are formed into a "feature polymer" using a pretrained Autoencoder (Ep), which feeds into fully connected layers to output a patch-specific epsilon value. The rationale is that if a patch is surrounded by a mixture of normal and cancerous cells, the annotated label is less credible and a larger smoothing weight should be applied. The final smoothing weight is set as epsilon_prime = 0.2 * epsilon, where epsilon is the sigmoid output between 0 and 1.

With dynamic label smoothing, the model reached 93.87% patch-level accuracy, AUC of 0.9720, and F1-score of 0.9644, surpassing all fixed-epsilon variants. The dynamic approach effectively adapts to the local annotation reliability of each patch, providing stronger noise tolerance than a one-size-fits-all smoothing parameter.

TL;DR: ResNet18 (pretrained on ImageNet) with dynamic label smoothing achieved 93.87% patch-level accuracy, AUC of 0.9720, and F1-score of 0.9644. Dynamic smoothing outperformed all fixed-epsilon variants (best fixed: 93.01% at epsilon = 0.2) by adapting to each patch's surrounding context.
Pages 5-6
98.77% Slide-Level Accuracy on Internal Testing and 87.90% on TCGA

For slide-level prediction, all patches from each slide were fed through the trained model, and the ratio of patches classified as cancerous was calculated. A threshold of 0.04 was determined to be optimal for both the baseline CNN model and the proposed model. The internal testing set consisted of 244 slides, including 161 HCC slides and 83 paracancerous liver tissue slides, with sample numbers balanced across Edmondson-Steiner grades as much as possible.

Internal test set performance: The proposed model achieved 98.77% slide-level accuracy (241/244 slides correct), compared to 97.54% (238/244) for the baseline CNN. Breaking this down by grade: grade 0 (non-tumor) was 98.80% (82/83), grade 1 was 96.94% (32/33), grade 2 was 97.92% (47/48), grade 3 was 100% (48/48), and grade 4 was 100% (32/32). The baseline model matched the proposed model on grades 2 and 4 but fell short on grades 0, 1, and 3.

External TCGA validation: The model was further tested on 157 HCC slides from The Cancer Genome Atlas database, acquired with different imaging equipment. Here, the proposed model achieved 87.90% slide-level accuracy (138/157), a dramatic improvement over the baseline CNN, which managed only 54.14% (85/157). For G1 and G2 grades, the proposed model reached 83.84% (84/99) versus the baseline's 47.48% (47/99). For G3 and G4, the proposed model hit 93.62% (44/47) compared to 68.09% (32/47).

The massive gap on TCGA data (87.90% vs. 54.14%) highlights the critical advantage of label smoothing for generalization. The baseline model, which over-fits to noisy labels during training, fails to transfer to data from a different institution and scanner. The proposed model's noise-tolerant training strategy results in substantially better cross-dataset robustness.

TL;DR: Internal testing: 98.77% slide-level accuracy (241/244 correct) vs. 97.54% baseline. TCGA external testing: 87.90% (138/157) vs. 54.14% baseline, a 33.76 percentage-point improvement. The gap was largest for well-differentiated tumors (G1/G2: 83.84% vs. 47.48%).
Pages 6-8
89.98% Pixel-Level Accuracy, Plus Detection of Well-Differentiated HCC and MVI

To evaluate fine-grained diagnostic performance, 20 slides were elaborately annotated at the pixel level. The common approach for gigapixel pathological images is to use a sliding window to calculate mean prediction accuracy per pixel. Blank areas around and within the tissue were excluded to prevent artificially inflated scores. Across all 20 precisely annotated slides, the proposed model achieved 89.98% pixel-level accuracy, compared to 82.52% for the baseline CNN and 87.72% for the fixed label smoothing variant.

Well-differentiated HCC detection: The visualization results revealed that the proposed model excels at identifying well-differentiated HCC, which is one of the most challenging diagnostic tasks in liver pathology. Pathologists sometimes need immunohistochemistry and reticular fiber staining to distinguish well-differentiated HCC from hepatic atypical hyperplastic nodules. The baseline model ignored almost all well-differentiated cancer cells in the visualization examples, while the proposed model accurately predicted most of them.

Microvascular invasion (MVI) recognition: The model also demonstrated the ability to identify MVI and microsatellite nodules, despite never being explicitly trained on MVI samples. MVI is a crucial prognostic indicator for HCC. Prior deep learning studies by Song et al. and Wei et al. predicted MVI from MRI with accuracies of 79.3% and 81.2%, respectively, but no previous study had used deep learning on histopathological images to detect MVI. The model trained on the screened dataset correctly recognized several distinguishable MVI samples, while the model trained without screening misclassified many normal cells.

Handling complex tissue mixtures: In regions with highly dispersed cancer cells, the baseline model struggled to differentiate fibroblasts from cancerous tissue, while the proposed model accurately discriminated these patches. This ability to parse mixed regions of cancer cells, fibroblasts, and erythrocytes in vessels is directly attributable to the dynamic label smoothing approach, which reduces over-confidence in ambiguous boundary zones.

TL;DR: Pixel-level accuracy on 20 elaborately annotated slides: 89.98% (proposed) vs. 82.52% (baseline) vs. 87.72% (fixed smoothing). The model detected well-differentiated HCC and MVI without explicit MVI training data, outperforming prior MRI-based MVI prediction studies (79.3-81.2% accuracy).
Pages 7-8
How This Model Compares to Existing CNN Approaches for Liver Pathology

The authors contextualize their results against both traditional machine learning and deep learning approaches for HCC. Traditional methods such as support vector machines, random forests, and boosting algorithms have achieved reasonable performance on CT and ultrasound image analysis. Liao et al. used random forest on HCC pathological image features for diagnosis and survival prediction. Fehr et al. applied recursive feature selection with support vector machines for binary prostate cancer classification. However, all these approaches depend heavily on hand-designed features, which are time-consuming to engineer and unlikely to fully capture the complexity of pathological images.

Deep learning baselines: Schmauch et al. built a deep learning model for liver lesion differentiation using 367 ultrasound images, achieving 91.6% classification accuracy. Vivanti et al. developed an automated CT-based HCC detection model with 86% accuracy for tumor recurrence. Hamm et al. described an MRI liver lesion classifier with 92% accuracy, 92% sensitivity, and 98% specificity. These studies operated on radiology images rather than histopathological slides, which contain far more detailed tissue-level information but present greater computational and annotation challenges.

Pathology-specific CNN work: Kiani et al. created a CNN to differentiate HCC from cholangiocarcinoma using 25,000 non-overlapping 512x512 patches, yielding 84.2% accuracy on an independent test set. Liao et al. designed a CNN for liver tumor vs. normal tissue classification, achieving 94.9% at 5x magnification and 86.0% at 20x magnification on 256x256 patches. Both approaches ignored macro-view structural information and did not address the noisy annotation problem. The proposed model's 93.03% patch-level accuracy (at 20x) and 98.77% slide-level accuracy represent improvements over these prior benchmarks, particularly because the model was specifically engineered to handle annotation noise.

The most distinctive contribution of this study is not raw accuracy on clean data but robustness to noisy annotations. Prior methods either ignored the noise problem or attempted to add artificial noisy images as negative samples, which oversimplified the real noise patterns found in pathological annotations. The proposed method addresses noise at both the input level (screening) and the output level (dynamic label smoothing), producing a model that generalizes far better to external datasets.

TL;DR: Prior CNN work on HCC pathology achieved 84.2% (Kiani et al.) and 86.0-94.9% (Liao et al.) accuracy on cleaner datasets. This model's 93.03% patch-level and 98.77% slide-level accuracy were achieved despite noisy annotations, and its 87.90% on TCGA far exceeded the baseline's 54.14%, demonstrating superior generalization.
Pages 8-9
Single-Center Training, Grade Imbalance, and the Path Toward Molecular Integration

Single-center training data: All 592 training and internal testing cases came from a single institution (Zhejiang University's First Affiliated Hospital), using the same 3DHISTECH P250 FLASH scanner. While the TCGA external validation partially addresses this concern, the 87.90% accuracy on TCGA (compared to 98.77% internally) reveals a performance drop when generalizing to slides from different scanners and staining protocols. Multi-center training datasets would be needed to close this gap.

Grade imbalance: More than 400 of the 592 cases belonged to grade 2 or grade 3, leaving grades 1 and 4 substantially underrepresented. The authors filtered down to 137 training cases to mitigate this imbalance, but the small training set size limits the model's capacity. Future work with more balanced, larger datasets across all Edmondson-Steiner grades would strengthen performance, particularly for rarer well-differentiated (grade 1) and poorly-differentiated (grade 4) tumors.

Rough annotation reliance: The entire method was designed around the reality of rough annotations, and only 20 slides had elaborate pixel-level ground truth. While this makes the approach practical (expert annotation at scale is infeasible), it also means the pixel-level evaluation was limited to a small subset. The 89.98% pixel-level accuracy, though strong, was measured on just 20 slides and may not fully represent model performance across the entire spectrum of HCC morphology.

Molecular integration potential: The authors highlight that histological phenotypes of HCC are closely linked to gene mutations and molecular tumor subgroups. CTNNB1-mutated HCCs display microtrabecular and pseudoglandular patterns, while macrotrabecular-massive HCC frequently harbors TP53 mutations and FGF19 amplifications. Coudray et al. showed CNNs can predict six frequently mutated genes in lung adenocarcinomas with 73.3-85.6% accuracy from WSIs. Integrating morphological classification with molecular alteration prediction represents a promising future direction that could provide unified diagnostic, therapeutic, and prognostic insight for HCC.

Expanding categories: The current model performs binary classification (tumor vs. non-tumor with grade assignment). The authors propose that, with sufficient balanced data, the model could subdivide tumors into finer categories and extend to liver cirrhosis, hepatitis, and other liver tumors, providing multi-indicator clinical decision support from a single pathological slide.

TL;DR: Limitations include single-center training (592 cases, one scanner), grade imbalance (400+ cases in grades 2-3), and pixel-level evaluation on only 20 slides. Future directions include multi-center datasets, molecular alteration prediction (CTNNB1, TP53, FGF19), and expanding the model to classify liver cirrhosis, hepatitis, and other liver tumors.
Citation: Feng S, Yu X, Liang W, et al.. Open Access, 2021. Available at: PMC8671137. DOI: 10.3389/fonc.2021.762733. License: cc by.