Hepatocellular carcinoma (HCC) is an aggressive primary liver cancer that develops in the setting of chronic liver diseases and ranks among the top causes of cancer incidence and mortality worldwide. While the burden of HCC related to hepatitis B virus (HBV) and hepatitis C virus (HCV) has been declining with effective antiviral therapy, HCC incidence tied to metabolic syndrome and non-alcoholic fatty liver disease (NAFLD) continues to rise due to the dramatic increase in obesity and related conditions in the general population.
Despite decades of research yielding screening protocols, non-invasive diagnostic modalities, and various treatment options including surgical, locoregional, and systemic therapies, the overall outcomes for patients with HCC remain poor. There are significant unmet needs in risk prediction, early detection, accurate prognostication, and individualized treatment selection. Patients generate enormous amounts of health data, but turning that data into actionable clinical knowledge remains a major challenge.
This review by Ahn et al. from Mayo Clinic and Cedars-Sinai Medical Center provides a comprehensive overview of how deep learning (DL) algorithms are being applied across the full spectrum of HCC care. The authors organize their review around four key application domains: risk prediction using clinical variables, multi-omics-based biomarker discovery, radiology-based diagnosis and treatment planning, and histopathology-based tumor characterization and prognostication.
The core deep learning technologies covered include convolutional neural networks (CNNs) for image analysis, recurrent neural networks (RNNs) for time-series clinical data, autoencoders for unsupervised multi-omics integration, and novel architectures like G2Vec (adapted from natural language processing) for gene expression analysis. The review emphasizes that these algorithms represent state-of-the-art techniques for handling complex, multimodal healthcare data ranging from routine lab values to high-resolution medical images.
Artificial intelligence (AI) encompasses a broad range of technology that enables machines to perform tasks typically requiring human reasoning and problem-solving. Machine learning (ML) is a branch of AI where computer algorithms train on sample data to build mathematical models that make predictions without being explicitly programmed. ML algorithms divide broadly into supervised learning (training on labeled outcome data) and unsupervised learning (discovering patterns in unlabeled data).
Examples of supervised learning include traditional techniques like logistic regression and more sophisticated methods such as support vector machines, random forests, and gradient boosting. Unsupervised techniques include K-means clustering and principal component analysis. Among ML algorithms, artificial neural networks (ANNs) consist of interconnected mathematical layers that analyze complex non-linear relationships, making them especially powerful for high-dimensional data.
Deep learning refers to highly complex AI models utilizing multiple layers of ANNs and has emerged as a state-of-the-art technique for analyzing complex healthcare data. The two most commonly used DL architectures are CNNs and RNNs. CNNs have connective patterns resembling the animal visual cortex and excel at detecting spatial features in high-dimensional images. RNNs form directed graphs along temporal sequences and are well-suited for time-series prediction tasks, such as tracking disease progression over time.
The authors stress that any AI-based ML algorithm requires external validation in an independent dataset, since models can be overfitted and overestimate their performance. Throughout the review, all reported performance metrics come from validation cohorts rather than the original training data, which is an important methodological distinction that strengthens the reliability of the findings.
Despite multiple available risk prediction tools for HCC, none have been rigorously validated or endorsed by major liver societies. Current guidelines recommend HCC surveillance for patients with cirrhosis and high-risk patients with chronic HBV infection, but more precise individual-level risk models are needed to implement targeted screening strategies.
Ioannou et al. trained a RNN to predict HCC development within 3 years using 4 baseline and 27 longitudinal variables from 48,151 patients with HCV-related cirrhosis in the Veterans Health Administration database. The RNN significantly outperformed logistic regression, achieving an AUC of 0.759 overall and 0.806 among patients who achieved sustained virologic response. Separately, Phan et al. transformed disease histories of 1 million patients from Taiwan into a 108 x 998 matrix and applied a CNN, which predicted liver cancer with an AUC of 0.886 and accuracy of 0.980.
Nam et al. constructed a deep neural network based on ResNet architecture to predict 3-year and 5-year HCC incidence in 424 patients with HBV-related cirrhosis on entecavir therapy. In an external validation cohort of 316 patients, the model achieved a C-index of 0.782 and significantly outperformed six previously reported models built on traditional statistical methods.
The same group developed the MoRAL-AI (AI-based Model of Recurrence after Liver Transplantation) to predict HCC recurrence after transplantation using variables including tumor diameter, age, AFP (alpha-fetoprotein), and prothrombin time. MoRAL-AI showed significantly better predictive performance compared to conventional models including the Milan, UCSF, up-to-seven, and Kyoto criteria (C-index 0.75 vs. 0.64, 0.62, 0.50, 0.50; P less than 0.001).
Serum AFP has been widely used as a predictive and prognostic biomarker for HCC, but it has limited sensitivity for detecting early-stage disease and its levels do not reliably correlate with disease progression. Recent advances in multi-omics, which integrates data from the genome, epigenome, transcriptome, proteome, metabolome, and microbiome, are expected to address this unmet need for novel biomarkers. These experiments generate enormous amounts of data, making DL techniques essential for computational processing and analysis.
Xie et al. used gene expression profiling of peripheral blood to build an ANN model that distinguishes HCC patients from a control group. Using a nine-gene expression system, the ANN achieved an AUC of 0.943, 98% sensitivity, and 85% specificity, although the control group consisted of healthy individuals rather than patients with cirrhosis, which may have overestimated the model's performance in a realistic clinical scenario.
Choi et al. proposed a novel network-based DL method called G2Vec, adapted from the Word2Vec model originally used for natural language processing (NLP). When applied to gene expression data from The Cancer Genome Atlas (TCGA), G2Vec showed superior prediction accuracy for patient outcomes compared to existing gene selection methods and identified two distinct gene modules significantly associated with HCC prognosis.
Chaudhary et al. used RNA sequencing, miRNA, and methylation data from 360 HCC patients in TCGA to build an autoencoder, an unsupervised feed-forward neural network. This DL model was able to distinguish patients with survival differences and identify specific mutations and pathways as predictors of aggressive tumor behavior, demonstrating the power of multi-omics integration through deep learning for prognostic stratification.
Ultrasound-based deep learning: CNN algorithms trained on ultrasound images have shown excellent performance for liver lesion detection and classification. Bharti et al. built a CNN using echotexture and liver surface roughness on 754 segmented ultrasound images, differentiating between normal liver, chronic liver disease, cirrhosis, and HCC with 96.6% accuracy. Brehar et al. demonstrated that a CNN for HCC detection on ultrasound achieved an AUC of 0.95 with 91.0% accuracy, 94.4% sensitivity, and 88.4% specificity, significantly outperforming conventional ML algorithms including support vector machines and random forests.
CT-based classification and segmentation: Yasaka et al. used CT images from 460 patients to train a CNN that classifies liver lesions into five categories (HCC, other malignancies, indeterminate masses, hemangiomas, and cysts) with a median AUC of 0.92. For segmentation, the 2017 Liver Tumor Segmentation Benchmark (LITS) challenge encouraged development of automated algorithms using 200 CT scans. Several teams developed DL architectures with promising results, including approaches based on H-DenseUNet (a hybrid DenseNet and U-Net architecture), modified SegNet, and Channel-U-Net for liver and tumor segmentation.
MRI-based deep learning: Hamm et al. used MRI images from 494 patients to train a CNN that classifies hepatic lesions into six categories. The CNN outperformed expert radiologists (90% sensitivity and 98% specificity vs. 82.5% sensitivity and 96.5% specificity), with particularly dramatic improvement for HCC detection (90% vs. 60-70% sensitivity). Wu et al. built a CNN using multiphase MRI that achieved an AUC of 0.95 for distinguishing LI-RADS grade 3 from LI-RADS 4 and 5 lesions for HCC diagnosis.
Beyond detection to risk prediction: Jin et al. performed a DL radiomics analysis on 2D shear wave elastography and B-mode ultrasound images of 434 chronic HBV patients, predicting 5-year HCC development with an AUC of 0.900. Zhen et al. trained a CNN combining unenhanced MRI and clinical variables from 1,210 patients with liver tumors, achieving diagnostic performance on par with three experienced radiologists using enhanced MRI, suggesting that DL could reduce the need for contrast agents in some settings.
Microvascular invasion prediction: Vascular invasion is a key prognostic element in patients with HCC. Recent studies developed CNN models with promising ability to detect microvascular invasion (MVI) on MRI images of HCC patients undergoing surgical resection, with AUCs ranging from 0.72 to 0.79. Jiang et al. achieved an AUC of 0.906 for MVI prediction on CT images from 405 HCC patients, with mean survival significantly better in the group without MVI, confirming the clinical importance of this prediction.
TACE response prediction: Liu et al. developed a DL radiomics model to predict responses to trans-arterial chemoembolization (TACE) using ultrasound images of 130 HCC patients, accurately predicting TACE response with an AUC of 0.93. Peng et al. trained a residual CNN model to predict TACE response using CT images from 562 patients with intermediate-stage HCC, achieving accuracies of 85.1% and 82.8% in two external validation cohorts. These models could help clinicians identify patients most likely to benefit from TACE before initiating treatment.
Survival prediction with TACE and sorafenib: Liu et al. developed a DL score for disease-specific survival using CT images from 243 HCC patients treated with TACE, with a higher score predicting poor prognosis (hazard ratio: 3.01). Zhang et al. built a DL-based model predicting overall survival using CT images from 201 patients with unresectable HCC treated with TACE plus sorafenib, achieving superior predictive performance compared to the clinical nomogram (C-index 0.730 vs. 0.679, P = 0.023).
An et al. used an unsupervised CNN-based deformable image registration technique to assess ablative margins in 141 patients with single HCC who underwent microwave ablation, demonstrating that patients with ablative margins less than 5 mm were at significantly higher risk of local tumor progression. This type of DL application helps guide locoregional therapy planning and post-treatment surveillance strategies.
Automated tumor grading: DL models can effectively replicate and augment pathologists' work in diagnosing and grading HCC. Lin et al. used multiphoton microscopy images from 113 HCC patients to train a CNN that achieved over 90% accuracy for determining HCC differentiation. Kiani et al. developed a CNN-based "Liver Cancer Assistant" that accurately differentiated hematoxylin and eosin (H&E) images of HCC and cholangiocarcinoma and improved the diagnostic performance of nine pathologists. Chen et al. trained a CNN for automatic grading of HCC tumors on H&E images, achieving 96% accuracy for benign vs. malignant classification and 89.6% accuracy for tumor differentiation.
Mutation prediction from histology: Liao et al. used TCGA data to train a CNN that distinguished HCC from adjacent normal tissues with perfect performance (AUC: 1.00) and predicted the presence of specific somatic mutations with AUCs over 0.70. Chen et al. similarly demonstrated that CNN-extracted features from H&E images could predict the presence of specific genetic mutations, connecting morphological patterns visible in tissue to underlying genomic alterations that have treatment implications.
Novel histologic subtyping: Wang et al. trained a CNN for automated segmentation and classification of individual nuclei at single-cell levels on H&E-stained HCC tissue sections, extracting 246 quantitative image features. Unsupervised clustering analysis then identified three distinct histologic subtypes that were independent of previously established genomic clusters and had different prognoses. This demonstrates how DL can discover entirely new disease categories invisible to human observation.
Survival and recurrence prediction: Saillard et al. used two DL algorithms on whole-slide digitized histological slides from 194 HCC patients to predict survival after surgical resection, with both models outperforming a score combining all baseline clinical variables. Shi et al. built an interpretable DL framework using pathologic images from 1,445 patients and developed a "tumor risk score" that was superior to clinical staging systems including BCLC. Yamashita et al. developed a histopathology-based DL system that stratified patients with risk scores for postsurgical recurrence of HCC.
The "black box" problem: DL models are traditionally considered "black-box" models, meaning clinicians cannot understand how the models arrive at their predictions. Interpretability is crucial for physicians to accept and trust DL in everyday clinical practice, for troubleshooting model errors, and for improving performance on rare cases. This is being addressed by recent developments in various "explainable AI" techniques, but there is currently no consensus on the best methodology for making DL models transparent.
Generalizability concerns: AI algorithms developed at highly specialized academic medical centers using their own patients' data may over-represent certain populations and not accurately reflect the real-world diversity of patients seen at community hospitals. Models trained predominantly on data from one institution or one ethnic group may perform poorly when deployed in different clinical settings, limiting the practical utility of even high-performing algorithms.
Validation and availability: AI models, like other prediction models, are often not publicly available, limiting independent external validation. The authors emphasize that validation of proposed models and comparison to existing models are as important as deriving new ones. Large-scale, prospective, multi-center studies involving diverse populations with rigorous external validation will be necessary before DL algorithms can be widely accepted and deployed in routine clinical care.
Autonomous robotics: A currently under-explored but highly promising area is the application of DL in autonomous surgical robotics. A DL-based surgical instrument tracking algorithm was able to closely track instruments during robotic surgery and evaluate surgeons' performance, demonstrating that DL can learn the correct steps of robotic procedures. However, significant barriers remain, including technical limitations, patient and provider hesitation, and the critical need for "explainability" so that humans can understand and correct every mistake an autonomous robot makes during surgery.
The evidence table: The review includes a comprehensive table summarizing over 25 studies applying deep learning to HCC. The studies span the full range of clinical applications: risk prediction from clinical variables (using RNN, CNN, and ResNet), multi-omics-based diagnosis and prognostication (using ANN, G2Vec, and autoencoders), radiology-based diagnosis and risk prediction across ultrasound, CT, and MRI (overwhelmingly using CNN architectures), and histopathology-based grading, subtyping, and outcome prediction (again predominantly CNN-based).
Performance patterns: Across the studies, several patterns emerge. DL models consistently outperform traditional statistical methods and conventional ML algorithms for risk prediction tasks. For imaging-based diagnosis, CNN models achieve AUCs of 0.89 to 0.99 across modalities, with several models matching or exceeding expert radiologist performance. In pathology, DL enables discovery of novel disease subtypes and genetic correlations that are invisible to human observers. Treatment response prediction models, particularly for TACE, achieve clinically meaningful accuracy that could inform therapy selection.
Clinical implications: The practical implications of these findings are substantial. DL-based risk prediction could enable targeted screening strategies, reducing unnecessary surveillance in low-risk patients while intensifying monitoring for high-risk individuals. Automated imaging diagnosis could reduce radiologist workload and improve consistency, especially in settings with limited specialist access. DL-driven treatment response prediction could guide the choice between TACE, sorafenib, surgical resection, and other therapeutic options, moving toward truly personalized HCC management aligned with individual tumor biology rather than population-level staging systems like BCLC and Child-Pugh.
The authors conclude that the application of DL for HCC care is rapidly becoming a reality, with most studies published within the past two years at the time of the review. DL algorithms not only replicate the work of human physicians efficiently and accurately, but more importantly can discover novel biologic pathways and disease subgroups with clinical significance by processing complex high-dimensional data in ways impossible for the human brain.