Hepatocellular carcinoma (HCC) is the dominant form of primary liver cancer, accounting for 85-90% of cases. It is the fifth most common malignant tumor worldwide and the third leading cause of cancer-related death. Clinical guidelines consistently emphasize that tumor size is a critical prognostic factor alongside liver function and patient performance status. This means that catching HCC early, especially at small sizes, directly improves survival outcomes after treatment.
The imaging landscape: Magnetic resonance imaging (MRI) provides higher sensitivity for HCC detection than computed tomography (CT). Gadoxetic acid-enhanced liver MRI has become widely adopted because it significantly outperforms MRI with other contrast agents. The improved sensitivity comes primarily from hepatobiliary phase images, where 80-90% of HCCs appear hypointense (darker than surrounding liver tissue). However, a recent meta-analysis found that per-lesion sensitivity for HCC on gadoxetic acid-enhanced MRI was 87% (95% CI: 83-92%), leaving room for computational tools to close the gap.
The gap this study fills: While deep learning had already shown strong results in medical image recognition and classification, the authors note that at the time of publication there were no deep learning-based HCC detection systems using liver MRI in the English literature. This study set out to build a fully automated deep learning model to detect HCC in hepatobiliary phase MR images from patients who underwent surgical resection, and to benchmark its performance against human radiologists.
The study was approved by the institutional review board of Samsung Medical Center (IRB: 2019-03-101-002) and followed the 1975 Declaration of Helsinki ethical guidelines. The authors reviewed hepatobiliary phase images from pre-operative gadoxetic acid-enhanced liver MRI of 549 patients who underwent surgical resection for HCC between 2010 and 2014. Of these, 94 patients were excluded due to severe motion artifacts (n=31), missing images (n=44), low image quality (n=18), or absent preoperative MR images (n=1), leaving 455 patients (442 male, 107 female, mean age 56 years, SD 9.7) with Child-Pugh score A.
Image categorization: The resulting 92,645 hepatobiliary phase MR images were split into two classes: no HCC (41,485 images) and HCC (51,160 images), based on whether HCC was present in each image. The data was divided 70/15/15 into training, validation, and test sets. Training images were acquired on Philips scanners, including the Achieva 1.5T (388 patients), Achieva 3.0T (117 patients), and Ingenia 3.0T (25 patients).
External validation set: An independent external validation cohort of 54 patients (42 male, 12 female, mean age 57, SD 9.6) was collected from four external hospitals between 2015 and 2017. After excluding 9 patients for motion artifact, missing images, or low quality, the authors validated the model on 502 non-HCC images and 448 HCC images (950 total). Critically, this external dataset used MR scanners from multiple vendors, including Philips (Achieva, Ingenia, Ingenia CX), Siemens (Avanto, Skyra, SMS Avanto, Verio), and GE (Discovery MR750w, Signa Excite, Signa HDxt) at both 1.5T and 3.0T field strengths. This multi-vendor, multi-center design tests real-world generalizability.
MR images in this study came in diverse pixel sizes ranging from 256x256 to approximately 400x400. To standardize input, all images were rescaled to 320x320 pixels using bicubic interpolation and area interpolation. This preprocessing step is essential for CNN training, which requires uniform input dimensions.
The class imbalance problem: Among the roughly 100 MR images per patient, only 3-10 images typically contained an HCC nodule. This severe imbalance, if left unaddressed, would cause the model to overfit toward the majority class (no HCC). To correct this, the authors augmented HCC-positive images using four techniques: rotation (within 90 degrees), shift (1-10 pixels in all directions), zoom (0.8x to 1.2x), and a combination of shift and zoom. The augmentation was deliberately conservative to avoid distorting clinical features. After augmentation, the HCC class grew to 44,765 images.
Mask-based extraction: Rather than augmenting entire images randomly, the HCC area in each image was first extracted using a mask generated from human-annotated label maps that distinguished the HCC region. This ensured that augmentation was applied to images in a clinically meaningful way, preserving the spatial relationship between the tumor and surrounding liver parenchyma.
Rather than using an off-the-shelf architecture, the authors designed their own CNN by systematically testing combinations of hyperparameters. They randomly selected 11,117 images (4,902 no-HCC and 6,215 HCC) from the training dataset for architecture optimization, splitting these into a learning set of 9,449 images and a validation set of 1,668 images. Training was terminated when accuracy showed no improvement within 10-20 epochs.
Regularization: The authors compared batch normalization (BN) alone against various combinations of BN with dropout at rates from 0.1 to 0.5, as well as dropout alone. BN without dropout achieved the best accuracy at 93.7%. Adding dropout consistently degraded performance, with BN + dropout(0.1) dropping to 74.6% and dropout(0.5) alone falling to 72.1%. This result is notable because many architectures rely on dropout for regularization.
Activation functions: Four activation functions were compared: LeakyReLU (84.2%), PReLU (90.4%), ELU (91.6%), and ReLU (93.7%). ReLU was the clear winner. For optimization, Adam achieved 93.7% accuracy with a validation loss of 0.18, outperforming RMSprop (91.4%), AdaGrad (55.7%), and SGD (58.7%). The kernel size was set to 2x2 after testing sizes from 2x2 to 7x7, with stride fixed at 1 to minimize information loss. The final architecture used a global average pooling layer instead of a fully connected layer to preserve spatial location information and enable class activation map (CAM) visualization.
Training configuration: The batch size was 128, parameters were initialized using the He initializer, and the learning rate was set to 0.001. Cross entropy served as the loss function, and softmax was used for final class prediction. Training was performed on a workstation with a 3.7 GHz 12-core Intel Core i7, 64 GB RAM, and two GeForce GTX 1080Ti GPUs, using TensorFlow V1.8.0.
On the internal test dataset, the optimized CNN achieved 94% sensitivity, 99% specificity, and an area under the curve (AUC) of 0.97. These numbers dropped on the external validation dataset, as expected when generalizing to unseen scanners and institutions, but remained strong: 87% sensitivity, 93% specificity, and an AUC of 0.90. The overall detection accuracy on the external set was 90%.
Comparison with radiologists: Two radiologists participated in the validation study: an expert with 10 years of abdominal imaging experience, and a less experienced trainee with 4 years in radiology. Both were blinded to the model's development, MR reports, and histopathologic results. The less experienced radiologist achieved 86% sensitivity, 92% specificity, and 91% accuracy. The expert radiologist achieved 98% sensitivity, 93% specificity, and 94% accuracy. The model's overall performance was not significantly different from the less experienced radiologist.
Speed advantage: The model classified a single image in 0.03 seconds (30 milliseconds) and processed 100 images for one patient in 3.4 seconds, using only a commercial PC with a 3.8 GHz Intel Core i5, 16 GB RAM, and a Radeon Pro 580 (CPU-based inference, no GPU required). In comparison, both radiologists took 0.18 seconds per image and 18 seconds per 100 images. The model was approximately six times faster than human readers.
Small HCC detection: The model showed a particular strength in detecting very small HCCs that the less experienced radiologist missed. The mean size of HCCs detected by the model but missed by the less experienced radiologist was 1.0 +/- 0.2 cm. The expert radiologist detected these same lesions, but took longer to do so. This suggests the model could serve as a valuable second reader for sub-centimeter HCCs.
The authors benchmarked their custom CNN against four established deep learning architectures: ResNet50, AlexNet, VGG-16, and Inception-ResNetV2. Their custom model achieved 94% validation accuracy with only 1,077,186 trainable parameters. By comparison, ResNet50 reached 93% with 23.5 million parameters, Inception-ResNetV2 reached 92% with 54.3 million parameters, and both AlexNet (57%, 43.7 million parameters) and VGG-16 (56%, 241.2 million parameters) performed poorly.
Parameter efficiency: The custom CNN used roughly 22x fewer parameters than ResNet50 and 224x fewer than VGG-16 while achieving better accuracy. This is a significant practical advantage: smaller models train faster, require less memory, and are easier to deploy on commodity hardware in clinical settings. The fact that the model ran on a standard desktop PC without a dedicated GPU underscores this point.
Class activation maps (CAM): To provide interpretability, the authors applied class activation maps to visualize where the model was focusing when it predicted HCC. By using global average pooling instead of fully connected layers, the architecture preserved spatial information that enabled heat map generation. These CAMs showed that the model's attention aligned with actual tumor locations, providing physicians with intuitive visual confirmation. A physician could use the color-mapped overlay to quickly distinguish true HCC candidates from false positives.
A deliberate design choice in this model was to use whole MR images as input without any cropping or region-of-interest selection. Most deep learning studies in radiologic imaging require preprocessing where human readers first select images containing lesions and then crop regions of interest around the liver mass. This approach adds a manual step that is time-consuming and introduces operator dependence. The authors chose to skip this entirely: the user simply uploads all hepatobiliary phase MR images, and the model scans them automatically.
The false positive trade-off: This whole-image approach came with a cost. The model produced false positive detections in extrahepatic structures, notably the gallbladder, intrahepatic blood vessels, and the heart. Surprisingly, hepatic cysts, which also appear dark (hypointense) on hepatobiliary phase images, were not a frequent source of false positives. The authors argue that extrahepatic false positives are clinically manageable because structures like the heart can be immediately identified and dismissed by any human reader reviewing the CAM overlay.
Workflow advantage: Despite the false positives, the no-cropping approach has a significant practical benefit. In a clinical workflow, the entire process requires only a single click to upload images. The model then checks all images for candidate HCC nodules in 3.4 seconds per patient. Combined with the CAM visualization, this creates a rapid screening pipeline where the physician's role shifts from primary detection to confirming or dismissing flagged candidates. This is faster than prior deep learning studies, one of which required 10 seconds for 100 CT images.
Single MRI phase: The model used only the hepatobiliary phase of gadoxetic acid-enhanced MRI. Arterial enhancement is one of the key diagnostic imaging findings for HCC, but arterial phase images were excluded due to frequent transient severe motion artifacts. Evidence from CT-based studies shows that multi-phase deep learning models yield higher accuracy than single-phase models. A future iteration incorporating arterial, portal venous, and delayed phases could improve diagnostic performance considerably.
Patient selection bias: All training patients had undergone surgical resection, meaning they had relatively preserved liver function (all Child-Pugh A). Additionally, all tumors in the study showed typical low signal intensity on hepatobiliary phase MRI. This means the model may not generalize well to patients with advanced cirrhosis (Child-Pugh B or C) or to atypical HCCs that do not show classic hypointensity. The real-world HCC population includes a broader spectrum of liver function and tumor phenotypes.
Single-vendor training data: The training dataset was acquired exclusively on Philips MR scanners, which may have introduced vendor-specific imaging characteristics into the learned features. While the external validation set used scanners from Philips, Siemens, and GE (demonstrating some cross-vendor robustness), the slightly lower external validation accuracy (90% vs. 94% internal) could partly reflect this vendor bias. Training on multi-vendor data from the outset would likely improve generalizability.
Future directions: The authors suggest expanding to multi-phase MRI input, including arterial phase images despite motion artifact challenges. Increasing the training dataset to include patients with varying degrees of liver function and atypical HCC presentations would broaden clinical applicability. Multi-vendor, multi-institutional training data would further reduce overfitting to scanner-specific features. The authors also note that additional experiments with more cases are needed for clearer conclusions about the model's comparative advantage over established architectures like ResNet50 and Inception-ResNetV2.