Deep Learning for Feature Selection in Hepatocellular Carcinoma

BMC Medical Informatics and Decision Making 2024 AI 9 Explanations View Original
Original Paper (PDF)

Unable to display PDF. Download it here or view on PMC.

Plain-English Explanations
Pages 1-3
Why This Scoping Review Matters: Deep Learning Meets HCC Feature Selection

Hepatocellular carcinoma (HCC) is the most common form of primary liver cancer and one of the leading causes of cancer-related death worldwide. Despite advances in screening, non-invasive diagnostic imaging, and treatment options ranging from surgical resection to systemic therapies, the overall prognosis for HCC patients remains poor. Many patients are diagnosed at advanced stages because early-stage disease is often asymptomatic, and the sheer volume of clinical, imaging, and molecular data generated per patient makes it difficult to extract actionable insights using traditional methods.

This scoping review, published by Mostafa et al. in BMC Medical Informatics and Decision Making (2024), systematically maps how deep learning techniques are being used to simplify and optimize feature selection for HCC. The authors evaluated approximately 420 scholarly articles from Scopus and Web of Science (published between 2013 and 2023) and selected 73 papers that directly address the intersection of deep learning, machine learning, and HCC prediction, diagnosis, prognosis, and treatment planning.

The review covers a broad spectrum of deep learning architectures, including convolutional neural networks (CNNs), recurrent neural networks (RNNs), ResNet (residual networks), support vector machines, and various ensemble methods. It organizes findings by data modality: clinical variables and risk factors, histopathology slides, and radiological imaging (CT, MRI, ultrasound). The central argument is that deep learning can automate the extraction of predictive features from high-dimensional data, reducing the manual labor and subjectivity of traditional feature engineering while improving diagnostic accuracy.

The authors also present a taxonomy of HCC itself, covering both etiological classifications (viral hepatitis, alcohol consumption, NAFLD, autoimmune disorders, genetic factors) and pathological subtypes recognized by the WHO (steatohepatitic, fibrolamellar, scirrhous, and others). This taxonomy is important because different HCC subtypes may respond differently to AI-driven feature selection approaches, and the choice of deep learning model may depend on the specific clinical context.

TL;DR: This scoping review analyzed 73 studies (from 420 candidates) on how deep learning simplifies feature selection for HCC. It covers CNNs, RNNs, ResNet, and ensemble methods applied to clinical data, histopathology, and radiology, arguing that automated feature extraction outperforms traditional manual approaches in accuracy and efficiency.
Pages 3-6
HCC Datasets: Public Resources and Clinical Repositories

Public datasets: The review catalogs key publicly available HCC datasets that researchers use to train and validate deep learning models. The LiTS (Liver Tumor Segmentation) Challenge Dataset provides 131 contrast-enhanced CT scans with annotations for liver and tumor segmentation. TCGA-LIHC (The Cancer Genome Atlas, Liver Hepatocellular Carcinoma) offers a comprehensive multi-omics dataset including genomic, transcriptomic, and clinical data for 377 HCC patients. The DeepLesion dataset contains 32,735 lesions from 10,594 CT studies, while CPTAC-HCC provides proteomic and clinical data for 159 patients.

Specialized imaging datasets: The LiRad Dataset contains CT and MRI images annotated using the LI-RADS classification system for liver lesions, and the HCC-LPC (Hepatocellular Carcinoma Liver Phenotype Classifier) dataset provides 110 contrast-enhanced CT scans with annotations for HCC and liver phenotypes. These standardized datasets enable benchmarking of different deep learning architectures against each other on the same data.

Private and clinical datasets: Several private datasets are available through institutional collaboration, including the BCLC-HCC dataset (Barcelona Clinic Liver Cancer) containing clinical, radiological, and pathology data, along with molecular profiling repositories from Caris Life Sciences, Foundation Medicine, and Tempus. These private datasets often contain richer clinical annotations but are less accessible for independent validation.

The diversity of these datasets underscores both an opportunity and a challenge: while multi-institutional data improves model generalizability, differences in imaging protocols, patient populations, and annotation standards introduce heterogeneity that deep learning models must learn to handle. The review notes that the choice of dataset directly influences which deep learning architectures perform best.

TL;DR: Key public HCC datasets include LiTS (131 CT scans), TCGA-LIHC (377 patients, multi-omics), DeepLesion (32,735 lesions), and CPTAC-HCC (159 patients, proteomics). Private datasets from BCLC, Caris, Foundation Medicine, and Tempus provide richer clinical data but are harder to access, creating trade-offs between generalizability and annotation quality.
Pages 7-10
Deep Learning on Clinical Variables: Predicting HCC Risk and Recurrence

RNN for HCC risk prediction: Ioannou et al. trained a recurrent neural network (RNN) on data from 48,151 patients with hepatitis C-related cirrhosis from the US Veterans Affairs system. The RNN used 4 baseline variables and 27 time-varying features to predict HCC development within 3 years, achieving 75.9% accuracy overall and 80.6% accuracy in patients who achieved sustained virologic response. This significantly outperformed logistic regression, demonstrating the value of temporal modeling for longitudinal clinical data.

ResNet and deep neural networks: Nam et al. developed a deep neural network to predict HCC incidence over 3 and 5 years in 424 patients with hepatitis B-related cirrhosis on entecavir therapy, achieving a Harrell's C-index of 0.782. The same group created MoRAL-AI, a deep learning model analyzing tumor size, patient age, AFP levels, and prothrombin time to identify patients at high risk of recurrence after liver transplantation, achieving a C-index of 0.75 compared to 0.50-0.64 for traditional criteria (Milan, UCSF, up-to-seven, Kyoto).

CNN on electronic health records: Phan et al. converted medical histories from one million Taiwanese health insurance records into 108 x 998 matrices and applied a CNN to predict liver cancer in viral hepatitis patients, achieving an AUC of 0.886 and accuracy of 98.0%. Liang et al. developed a CNN model that predicted HCC risk one year in advance with an AUROC of 0.94 using minimal features from electronic health records, with consistent performance at 6-month, 2-year, and 3-year prediction horizons.

Additional studies explored LASSO Cox regression for gene-based risk models (Cheng et al., achieving 3-year and 5-year overall survival AUCs of 0.783 and 0.828), 3D convolutional neural networks for predicting microvascular invasion from MRI (Zhang et al., AUC of 0.72 on validation), and comparative evaluations showing that CNN and RNN approaches consistently outperformed traditional machine learning models including KNN, decision trees, Naive Bayes, and logistic regression.

TL;DR: RNNs on 48,151 patients achieved 80.6% accuracy for HCC prediction in hepatitis C cirrhosis. CNNs on electronic health records reached AUROC 0.94 for one-year HCC prediction. The MoRAL-AI model achieved C-index 0.75 for post-transplant recurrence, outperforming traditional Milan and UCSF criteria (0.50-0.64), while LASSO Cox regression identified 13 prognostic genes.
Pages 11-14
Deep Learning on Histopathology: From Whole-Slide Images to Prognosis

CNN-based survival prediction: Yamashita et al. developed a CNN pipeline that preprocessed whole-slide images (WSIs) by partitioning them into 299 x 299 pixel tiles, then selected the top 100 tiles with the highest tumor probability to generate risk scores. Applied to the TCGA-HCC (360 WSIs) and Stanford-HCC (198 WSIs) datasets, the CNN risk scores outperformed the TNM staging system at predicting cancer recurrence. Saillard et al. similarly used two deep learning algorithms on digitized histopathology slides from 194 HCC patients and demonstrated superior discriminatory capabilities compared to composite clinical scores, with C-indices of 0.75 to 0.78.

Automated grading and mutation detection: Chen et al. trained a CNN to automate HCC tumor grading from H&E-stained histopathology images, achieving 96% accuracy for benign vs. malignant classification and 89.6% accuracy for tumor differentiation stage. The model could also predict certain genetic mutations associated with HCC. Liao et al. further demonstrated that CNNs could predict mutation status directly from histopathology images with AUCs exceeding 0.70, using only WSIs as input without any genomic data.

Tissue classification and subtyping: Wang et al. applied a CNN to automate nuclei identification in H&E-stained tissue sections from TCGA, extracting 246 quantitative image features. Unsupervised clustering revealed three distinct histologic subtypes independent of known genomic clusters, each with different prognoses (99% classification accuracy for tumor cells, 97% for lymphocytes). Shi et al. constructed a deep learning framework on 1,445 HCC patients and developed a "tumor risk score" that outperformed clinical staging systems, stratifying patients into five prognostic groups.

The review concludes that CNNs are the dominant architecture for pathology-based HCC analysis, with consistent advantages over traditional scoring systems. The ability to predict genetic mutations, survival outcomes, and recurrence risk directly from tissue images represents a significant advance in feature selection, as the deep learning models automatically identify the most predictive visual features without manual annotation of specific histologic patterns.

TL;DR: CNNs on whole-slide images achieved C-indices of 0.75-0.78 for survival prediction, outperforming TNM staging. Automated tumor grading reached 96% accuracy (benign vs. malignant) and 89.6% for differentiation stage. A deep learning framework on 1,445 patients produced a "tumor risk score" superior to clinical staging, while unsupervised clustering discovered three novel histologic subtypes.
Pages 15-19
CNN-Based Radiomics: Imaging Analysis Across CT, MRI, and Ultrasound

Deep learning radiomics for HCC prediction: Jin et al. developed a deep learning-based radiomics model (HCC-R) that predicted HCC development over five years using ultrasound images, achieving AUCs of 0.942 in the validation cohort and 0.900 in the testing cohort. This significantly outperformed traditional risk scores including liver stiffness measurement, GAG-HCC, and CU-HCC. Brehar et al. compared CNN performance against conventional machine learning for HCC detection on ultrasound, with the CNN achieving AUC 0.95, accuracy 91.0%, sensitivity 94.4%, and specificity 88.4%.

CT-based classification and treatment prediction: Shi et al. demonstrated that a CNN on three-phase CT imaging could identify HCC with diagnostic accuracy comparable to a four-phase protocol (AUC 0.925), potentially reducing patient radiation exposure. Multiple studies applied CNNs to predict outcomes of transarterial chemoembolization (TACE): Peng et al. achieved 85.1% and 82.8% accuracy in two validation cohorts, Liu et al. found that higher deep learning scores predicted poor prognosis (hazard ratio 3.01), and Zhang et al. demonstrated a C-index of 0.714 for overall survival prediction in patients receiving TACE plus sorafenib.

MRI-based microvascular invasion prediction: Wang et al. combined deep features from multiple diffusion-weighted MRI sequences (b0, b100, b600, and ADC maps) to predict microvascular invasion (MVI), achieving the best AUC of 0.79 with the combined approach. Jiang et al. compared a 3D-CNN model against a radiomics-radiological-clinical model for MVI prediction, with the 3D-CNN achieving AUROC of 0.980 in training and 0.906 in validation. Wu et al. used a CNN on multiphase MRI to distinguish LI-RADS grade 3 from grade 4/5 lesions with AUC 0.95.

Sun et al. combined LASSO regression for feature selection with deep learning and radiomic features to predict TACE treatment response, achieving AUCs of 0.937 (training) and 0.909 (validation) by incorporating 19 quantitative radiomic features, 10 deep learning features, and 3 clinical factors. Lai et al. used a ResNet-18 CNN on FDG PET-CT images for overall survival prediction before liver transplantation, demonstrating that combined PET-CT models (AUC 0.807) outperformed CT-only models (AUC 0.743).

TL;DR: Deep learning radiomics achieved AUC 0.900-0.942 for five-year HCC prediction on ultrasound. CNNs on CT matched four-phase diagnostic accuracy with three-phase protocols (AUC 0.925). For microvascular invasion prediction, 3D-CNN reached AUROC 0.980 on MRI, while LASSO-driven radiomic models combining 19 radiomic features, 10 deep learning features, and clinical factors achieved AUC 0.937.
Pages 20-22
Feature Selection Fundamentals: Filter, Wrapper, and Embedded Methods

Why feature selection matters: In high-dimensional medical datasets, not all variables contribute equally to predicting HCC outcomes. Irrelevant or redundant features introduce noise, increase overfitting risk, and inflate computational costs. The review provides a thorough taxonomy of feature selection techniques, organized into three main categories: filter methods that evaluate features independently using statistical measures, wrapper methods that use model performance to evaluate feature subsets, and embedded methods that incorporate feature selection within the model training process itself.

Filter methods include Pearson correlation coefficient (measuring linear feature-target correlation), chi-square tests (for categorical variables), information gain (decision tree-based relevance), and mutual information (quantifying shared information between features and target). These methods are computationally efficient and independent of the learning algorithm, but they ignore feature interactions and dependencies that may be clinically important.

Wrapper methods evaluate feature subsets by training models on different combinations. Key techniques include Recursive Feature Elimination (RFE), which iteratively removes the least important features; forward selection, which adds features one at a time; backward elimination, which removes features one at a time; and exhaustive search, which evaluates all possible combinations. While wrapper methods capture feature interactions, they are computationally expensive for large feature sets and prone to overfitting.

Embedded methods integrate feature selection directly into model training. LASSO (L1 regularization) adds a penalty term that shrinks irrelevant feature coefficients to zero, effectively performing automatic feature selection. Tree-based importance scores from Random Forest or Gradient Boosting assign relevance weights during training. Additional techniques covered include PCA for dimensionality reduction, genetic algorithms for population-based optimization, and Relief methods for class-discriminative feature scoring.

TL;DR: The review categorizes feature selection into filter methods (Pearson correlation, chi-square, information gain), wrapper methods (RFE, forward/backward selection), and embedded methods (LASSO regularization, tree-based importance). LASSO is particularly prominent in HCC research for shrinking irrelevant features to zero, while PCA and genetic algorithms provide complementary dimensionality reduction strategies.
Pages 22-23
Comparing Feature Selection Approaches: Strengths, Limitations, and Practical Guidance

Supervised vs. unsupervised approaches: The review presents a detailed comparison table of feature selection methods. Filter methods and PCA operate in an unsupervised manner, meaning they do not require labeled outcome data, making them suitable for exploratory analysis and high-dimensional datasets where labeling is expensive. Wrapper and embedded methods are supervised, directly optimizing for predictive performance but requiring labeled training data and often more computational resources.

Handling specific data challenges: The review references a systematic evaluation by Bolon-Canedo et al. that rates different feature selection methods on five key criteria: correlation/redundancy handling, nonlinearity, input noise, target noise, and performance when features greatly outnumber samples (common in genomic datasets). ReliefF and mRMR scored highest across most categories, while SVM-RFE with nonlinear kernels excelled at handling input noise and high-dimensional scenarios. LASSO-based approaches and wrapper methods showed varying performance depending on the specific data characteristics.

Practical recommendations: The choice of method depends heavily on the data type and modeling goal. For clinical variable datasets with moderate dimensionality, embedded methods like LASSO offer an efficient balance of accuracy and interpretability. For genomic or multi-omics data where features vastly outnumber samples, filter methods or SVM-RFE provide more stable feature sets. For imaging-based radiomics data, deep learning architectures like CNNs perform implicit feature selection through their learned convolutional filters, often eliminating the need for separate feature engineering entirely.

TL;DR: ReliefF and mRMR perform best across multiple data challenges, while SVM-RFE excels with high-dimensional genomic data. LASSO is recommended for clinical datasets balancing accuracy and interpretability. For imaging data, CNNs perform implicit feature selection through learned filters, often eliminating the need for separate feature engineering steps.
Pages 23-24
Key Findings: CNNs Dominate, But Challenges Remain

CNN supremacy across data modalities: The review's central finding is that convolutional neural networks (CNNs) emerged as the top-performing architecture across all three data modalities examined. For clinical data prediction, CNN and RNN-based approaches achieved the highest accuracy and AUC values. For histopathology analysis, CNNs consistently outperformed conventional scoring systems in predicting recurrence and survival. For radiology, CNNs demonstrated excellent diagnostic performance approaching or exceeding clinician-level accuracy across ultrasound, CT, and MRI modalities.

Data quality and availability: Despite the promising results, the review identifies several critical limitations. Machine learning and deep learning models require large, high-quality datasets, but acquiring comprehensive clinical and imaging data for HCC is challenging due to disease rarity and the need for careful curation. The limited availability of annotated HCC datasets hampers the development and fair evaluation of robust models, particularly for rare subtypes or unusual presentations.

Interpretability and the black-box problem: Deep learning models function as "black boxes," making it difficult for clinicians to understand the reasoning behind predictions. This lack of interpretability raises concerns in medical settings where clinicians need confidence in the decision-making process. The review notes this as a fundamental barrier to clinical adoption, since regulatory approval and physician trust both require understanding of which features drive predictions and why.

Generalizability and bias: Models trained on specific populations or datasets may not perform well when applied to different patient demographics or institutions. The heterogeneity of HCC, including variations in tumor characteristics, genetic profiles, and patient demographics across geographic regions, introduces challenges. Biases from underrepresentation of certain demographic groups or confounding factors in training data can perpetuate or amplify healthcare disparities.

TL;DR: CNNs are the top-performing architecture across clinical, pathology, and radiology data for HCC. However, critical barriers to clinical adoption include limited annotated datasets, the "black box" interpretability problem that undermines clinician trust, and generalizability concerns from geographic and demographic biases in training populations.
Pages 24-27
Future Work: Multimodal Integration, Longitudinal Data, and Clinical Translation

Multimodal data integration: The most promising direction identified by the review is combining clinical data with other modalities, including genetic information, imaging data (MRI, CT scans), and blood-based biomarkers like AFP. Deep learning models are particularly well-suited for handling such diverse data sources, and studies that combined multiple data types (such as the DLRC model integrating 19 radiomic features, 10 deep learning features, and 3 clinical factors) consistently outperformed single-modality approaches.

Longitudinal and temporal modeling: Developing models that incorporate longitudinal data (tracking changes in clinical variables over time) could enable prediction of risk trajectory changes and earlier identification of high-risk patients. The success of RNN-based models on time-series clinical data suggests that temporal deep learning architectures have significant untapped potential for HCC surveillance, particularly in monitoring patients with chronic hepatitis B or C who are at sustained risk.

Geographic diversity and validation: The review emphasizes the need to train and validate models on large, geographically diverse datasets to ensure generalizability and avoid overfitting to specific populations. It is also crucial to account for comorbidities such as diabetes, viral hepatitis, and non-alcoholic fatty liver disease (NAFLD) that influence HCC development differently across regions. Most studies reviewed drew data from single institutions, highlighting the need for multi-center validation studies.

The review concludes by calling for development of more advanced deep learning architectures, integration of multi-modal data sources, and careful exploration of ethical implications. Implementation challenges, including computational resource requirements, integration with existing electronic health record systems, and ensuring robustness and reliability of predictions in real-world clinical settings, must be addressed before these models can transition from research tools to clinical decision-support systems.

TL;DR: Future priorities include multimodal data fusion (imaging + genomics + clinical), longitudinal temporal modeling for risk trajectory tracking, and multi-center validation across geographically diverse populations. Implementation challenges around computational resources, EHR integration, and regulatory approval must be resolved before deep learning feature selection tools can enter routine HCC clinical practice.
Citation: Mostafa G, Mahmoud H, Abd El-Hafeez T, E ElAraby M.. Open Access, 2024. Available at: PMC11452940. DOI: 10.1186/s12911-024-02682-1. License: cc by.