Deep Learning HCC Diagnosis Using MS1 Data

Plain-English Explanations

Overview & Background

Pages 1-2

Why Mass Spectrometry Needs a Better Diagnostic Pipeline for Liver Cancer

Hepatocellular carcinoma (HCC) ranks among the top three cancers worldwide in mortality, and even with effective early-stage treatment, the overall 5-year survival rate remains only 50-70%. The current gold standard for diagnosis relies on histopathological observation using haematoxylin and eosin (H&E) or immunohistochemical staining, a process that is both time-consuming and dependent on subjective interpretation by pathology specialists. Mass spectrometry (MS)-based proteomics offers a faster alternative, capable of quantifying thousands of proteins in complex biological samples within several hours.

The peptide identification bottleneck: Traditional proteomics workflows first identify peptides from secondary mass spectrometry (MS2) data using software like MaxQuant or MSGFPlus, then use machine learning to find tumor biomarkers among the identified proteins. For example, prior work identified 5 urine biomarkers for lung cancer using random forest on 231 patients, and deep neural networks pinpointed 19 thyroid cancer biomarkers from 288 tissue samples. However, these approaches introduce errors during peptide and protein identification steps, which can propagate into the diagnostic predictions.

The raw data alternative: To bypass identification errors, some researchers have turned to analyzing raw MS data directly. Giordano et al. used multivariate statistical analysis on ion peak features from 222 full-scan MS samples, feeding them into a random forest classifier to distinguish tumor from non-tumor tissue. Wang et al. built the MSpectraAI platform using public MS data across six tumor types. Zhang et al. converted DIA MS2 data into 3D tensors and fed them into a ResNet-18 model for HCC classification. However, all of these approaches rely on converting raw data into 2D or 3D heatmaps, and batch-to-batch differences in retention time (RT) and intensity make cross-dataset generalization difficult.

This study introduces MS1Former, a deep learning model that bypasses both the peptide identification step and the heatmap conversion step. Instead, it works directly on raw MS1 spectra, treating the mass-to-charge ratio (m/z) sequence as input similar to how a text sequence is processed in natural language tasks. The model combines a 1D convolutional neural network (1D-CNN) with a transformer encoder to classify HCC tumor versus adjacent non-tumor tissue.

TL;DR: HCC has a 5-year survival of only 50-70%. Traditional proteomics workflows introduce errors during peptide/protein identification. MS1Former is a new end-to-end deep learning model that classifies HCC directly from raw MS1 spectra, bypassing peptide identification entirely by combining a 1D-CNN with a transformer encoder.

Methodology

Pages 2-3

Data Collection, Preprocessing, and the MS1Former Pipeline

Datasets: The study used five HCC datasets. The training set, PXD006512, contained 1,488 raw files from 220 HCC samples (939 malignant, 549 normal) acquired on an Orbitrap Fusion mass spectrometer using data-dependent acquisition (DDA). Four external test sets were used for validation: WL-2023 (30 files, Q Exactive HF-X, DDA), WL-Fast (32 files, Q Exactive HF-X, DDA with a fast 39-minute gradient), PXD002171 (38 files, LTQ Orbitrap Elite, DDA), and PXD021979 (43 files, Orbitrap Fusion Lumos Tribrid, data-independent acquisition or DIA). The WL-2023 and WL-Fast samples were collected from Shulan (Hangzhou) Hospital with ethics approval (KY2023033).

Data preprocessing: Raw data files (.raw/.wif) were converted to mzXML or mzML format using ThermoRawFileParser. The MS1 spectra (retention time, m/z, intensity) were then extracted. The authors observed that the raw data in the RT dimension showed inconsistent patterns across different instruments, but the m/z dimension exhibited clear differences between tumor and normal tissue, with tumor samples showing higher and more dispersed peak intensities. An adapted noise removal step trimmed interference peaks at the beginning and end of the MS acquisition (using a cutoff value of 3 x 10^7), followed by binning the m/z dimension into equal-width windows and normalizing intensities by dividing by the maximum value across all windows.

Dimensionality reduction: Rather than working with 2D heatmaps (RT x m/z), the authors accumulated intensity values along the RT dimension to collapse the data into 1D sequences of (m/z, intensity) pairs. This step is critical because it eliminates the problematic RT dimension where batch-to-batch variability is greatest, while preserving the m/z patterns where tumor-versus-normal differences are most pronounced. The binning parameters used mmin = 260, mmax = 1,800, and a bin size (gamma) of 0.1, which the authors found optimal through systematic evaluation.

Model architecture: MS1Former consists of three components. First, a 1D-CNN layer captures local feature embeddings for each m/z position (analogous to word-packet training in NLP). Second, a transformer encoder block with relative positional encoding and multi-head attention (8 heads, value size 24, key/query size 48) captures long-range dependencies across the full m/z sequence. Third, a feedforward neural network performs the binary classification. Training used Adam optimizer with a learning rate of 0.001, batch size of 8, 200 epochs, 50% dropout, L2 regularization, early stopping, and cross-entropy loss. The framework was implemented in PyTorch with Python 3.8.

TL;DR: Trained on 1,488 raw files (PXD006512) and tested on 4 external datasets across different instruments and acquisition modes. The pipeline converts raw MS1 data to 1D m/z sequences (bin size 0.1, range 260-1,800), then feeds them into a 1D-CNN plus transformer encoder with 8 attention heads, 50% dropout, and Adam optimizer at 0.001 learning rate.

Model Performance

Pages 3-4

Five-Fold Cross-Validation and External Testing Results

On the training dataset PXD006512, the authors performed five-fold cross-validation (80% training, 20% validation per fold). MS1Former achieved a mean accuracy of 0.934, mean precision of 0.926, mean recall of 0.930, and mean F1 score of 0.929. Individual fold results ranged from 0.906 to 0.963 in accuracy, demonstrating reasonable consistency. UMAP visualization of the latent m/z space showed clear separation between the malignant and normal groups.

External validation: The model was tested on four independent datasets collected on different instruments and using different acquisition protocols. On WL-2023 (Q Exactive HF-X, DDA), the model achieved 0.90 accuracy, 0.90 precision, 0.90 recall, 0.90 F1, and 0.92 AUC. On WL-Fast (Q Exactive HF-X, fast 39-minute gradient full scan), performance was lower at 0.84 accuracy, 0.89 precision, 0.82 recall, 0.86 F1, and 0.93 AUC, likely because the shortened gradient captured less peptide information. On PXD002171 (LTQ Orbitrap Elite, DDA), results were 0.90 accuracy and 0.94 AUC. The best external performance came from PXD021979 (Orbitrap Fusion Lumos Tribrid, DIA) with 0.95 accuracy and 0.97 AUC.

Bin size optimization: The authors systematically tested ten bin sizes from 0.1 to 1.0 on PXD006512. Performance improved steadily as bin size decreased, with 0.1 yielding the highest accuracy, F1-score, and AUC. Smaller bins preserve finer m/z resolution, allowing the model to learn more specific peptide feature distributions. The authors noted that even smaller bin sizes (below 0.1) could theoretically improve performance further but were limited by training machine memory and computation time.

Across all external datasets, the AUC consistently exceeded 0.90, demonstrating that the model generalizes across different Orbitrap-series instruments, different LC conditions (gradient times of 39 to 120 minutes, flow rates of 300 to 400 nl/min), and both DDA and DIA acquisition modes. This cross-instrument, cross-protocol generalization is one of the study's strongest claims.

TL;DR: Five-fold CV on PXD006512: mean accuracy 0.934, F1 0.929. External test AUCs ranged from 0.92 (WL-2023) to 0.97 (PXD021979). Best external accuracy was 0.95 on DIA data. Performance improved with smaller bin sizes, with 0.1 being optimal. The model generalized across four different Orbitrap instruments and both DDA and DIA modes.

Comparative Analysis

Pages 4-5

MS1Former Outperforms MSpectraAI, MaxQuant+RF, and ResNet-18

The authors benchmarked MS1Former against three competing methods, all trained on the same PXD006512 dataset and evaluated on the same four external test sets. MaxQuant+RF is a traditional pipeline that first identifies and quantifies proteins via MaxQuant, then trains a random forest classifier on the identified proteins. MSpectraAI also uses MS1 data but classifies each scan independently, without integrating information across multiple scans from the same sample. ResNet-18 transforms MS2 data into 3D tensors (cycle, m/z, window) and applies a convolutional image classification architecture originally designed for photographs.

Key results: MS1Former outperformed all three competitors on accuracy and AUC across the four test datasets. ResNet-18 showed the worst generalization performance on external data, likely because its reliance on 2D/3D heatmap representations makes it sensitive to batch-specific RT and intensity patterns. MaxQuant+RF could not be applied to the WL-Fast dataset at all because full-scan data lacks MS2 spectra, which MaxQuant requires for protein identification. This highlights a fundamental limitation of identification-dependent pipelines: they cannot process data acquired without MS2.

Why MS1Former wins: The model's advantage comes from three design decisions. First, it works directly on MS1 data without requiring peptide or protein identification, eliminating a major source of error. Second, it collapses the RT dimension into a 1D m/z sequence, removing the batch-dependent variability that hinders 2D/3D approaches. Third, the transformer encoder captures long-range dependencies across the entire m/z sequence, whereas MSpectraAI treats each scan independently and misses cross-scan relationships within the same sample. The result is a model that integrates all scan information from a sample into a single end-to-end diagnostic prediction.

TL;DR: MS1Former beat MSpectraAI, MaxQuant+RF, and ResNet-18 on all four external test sets. ResNet-18 had the worst generalization. MaxQuant+RF could not even process the WL-Fast full-scan data due to its dependence on MS2. MS1Former's edge comes from eliminating peptide identification, collapsing the RT dimension, and using transformer attention to integrate all scans per sample.

Interpretability

Pages 5-6

LIME Analysis Reveals Which m/z Features Drive HCC Predictions

To understand what the model actually learns, the authors applied LIME (Local Interpretable Model-Agnostic Explanations), an ablation-based method that identifies which binned m/z regions contribute most to each prediction. Using the WL-2023 dataset, LIME ranked the top ten m/z binning indices by their contribution to correct HCC predictions, assigning importance values and indicating whether each feature was up-regulated (more intense in tumor) or down-regulated (more intense in normal tissue).

Mapping back to peptides: For each important binned m/z index, the authors restored the original m/z information in the RT dimension to enable peptide analysis. For example, the raw m/z values corresponding to binned index 3333 were grouped by similar RT and m/z values, with each group potentially representing a distinct peptide. This approach mirrors the logic of DirectMS1, a method for peptide-level identification from MS1 data alone. However, the authors acknowledged that this mapping remains imprecise because the binning inherently aggregates multiple peptide signals into a single feature.

Beeswarm visualization: Beeswarm plots of important features for the WL-2023 dataset revealed that the number of significant features for malignant tissues was larger than for normal tissues. In the binned m/z range of 3400-4100, there were particularly pronounced differences between tumor and normal tissue, and these differences were confirmed by examining the corresponding raw mass spectra. The model effectively learned to distinguish the more unstable, dispersed proteome peak distributions characteristic of tumor tissue from the more consistent, centralized patterns of normal tissue.

The interpretability analysis also extended to different stages of tumor differentiation. Feature heatmaps corresponding to different HCC differentiation grades showed that the model captured differences visible in pathological tissue images. The characteristic differences in the heatmaps aligned with distinguishable m/z and intensity features on the raw mass spectral peaks, suggesting the model learns biologically meaningful representations rather than just statistical correlations.

TL;DR: LIME analysis identified the top 10 m/z features driving HCC predictions. Malignant tissue showed more significant features than normal tissue, with the 3400-4100 m/z range showing the largest differences. Feature heatmaps aligned with pathological differentiation grades, indicating the model captures biologically meaningful patterns rather than artifacts.

Technical Design Decisions

Pages 6-8

How Overfitting Was Managed and Why MS1 Beats MS2 for This Task

Overfitting mitigation: Small-sample classification is a well-known challenge in deep learning, and the authors took several deliberate steps to address it. The training set (PXD006512) was intentionally large for a proteomics study, containing 1,488 raw files from 220 HCC patients. During training, they combined dropout at 50%, L2 regularization, and early stopping to prevent the model from memorizing training data. The five-fold cross-validation results, with accuracy ranging from 0.906 to 0.963 across folds, suggest these measures were effective in controlling variance.

Why MS1 over MS2: The traditional proteomics analysis workflow relies on MS2 data (secondary mass spectrometry) for peptide identification. However, MS1Former deliberately bypasses MS2 entirely. The authors argue that MS1 data contains sufficient discriminative information in the m/z dimension to separate tumor from normal tissue, and that the errors introduced during MS2-based peptide identification actually degrade diagnostic performance. Their results support this claim: MS1Former outperformed both MaxQuant+RF (which requires MS2 for protein identification) and ResNet-18 (which uses MS2-derived 3D tensors). Furthermore, MS1Former could process full-scan data (WL-Fast) where no MS2 information existed at all.

The transformer advantage: The choice of a transformer encoder over purely convolutional architectures reflects the nature of the data. In an m/z sequence, relevant features may be separated by long stretches of uninformative bins. The multi-head attention mechanism (8 heads in this implementation) allows every position in the sequence to attend to every other position, capturing dependencies between distant m/z regions that a CNN's local receptive field would miss. The relative positional encoding further enables the model to learn how the distance between two m/z positions affects their mutual influence, providing a parameterized baseline for pairwise interactions.

Handling different acquisition modes: One of the most notable aspects of MS1Former is its ability to handle DDA, DIA, and full-scan data within the same framework. Because the model operates only on MS1-level information (which is acquired regardless of the MS2 acquisition strategy), it is inherently agnostic to whether the instrument used DDA or DIA for fragmentation. This versatility was demonstrated by achieving 0.97 AUC on the DIA dataset PXD021979 and 0.93 AUC on the full-scan WL-Fast dataset, both of which were acquired under fundamentally different protocols than the DDA training data.

TL;DR: Overfitting was controlled using 1,488 training files, 50% dropout, L2 regularization, and early stopping. MS1 data proved sufficient for classification without MS2, and the transformer's multi-head attention (8 heads) captures long-range m/z dependencies that CNNs miss. The model handles DDA, DIA, and full-scan data within a single framework, achieving 0.93-0.97 AUC across acquisition modes.

Limitations

Pages 8-9

Instrument Dependence, Biological Interpretation Gaps, and Binning Constraints

Instrument generalizability: All training and primary testing data came from Orbitrap-series mass spectrometers. When the authors tested on a non-Orbitrap dataset (PXD004837, collected on a TripleTOF 5600), the AUC dropped to 0.8442, notably lower than the 0.92-0.97 range seen on Orbitrap data. This indicates that the model has learned instrument-specific spectral patterns to some degree, and its performance on other mass spectrometer platforms remains uncertain. Expanding training data to include samples from TripleTOF, QTOF, and other instrument families would be essential before clinical deployment.

Biological interpretation: While LIME identified the most important m/z bins for classification, the biological relevance of these features remains inadequately explored. Because the model works on MS1 data alone (without MS2), there is no direct way to perform peptide sequencing or identify the specific proteins and peptides underlying the significant m/z bins. The authors reference the MonoMS1 method, which trains peptide recognition models using MS1 features with MS2-identified peptides as labels, as a future direction for bridging this gap. Without this integration, the model functions as a "black box" classifier with post-hoc interpretability rather than a tool for biomarker discovery.

Binning resolution limits: The optimal bin size of 0.1 was determined empirically, but the authors acknowledged that even smaller bins could theoretically improve performance by capturing finer peptide-level distinctions. Memory and computation time on available hardware prevented experiments below 0.1. Additionally, the binning process inherently aggregates multiple peptide signals into single features, which limits the precision of downstream interpretability analysis. The mapping from binned indices back to individual peptides remains approximate.

Dataset scope: All datasets in this study came from HCC patients comparing tumor versus adjacent non-tumor tissue. The model has not been tested for multi-class classification (e.g., distinguishing HCC subtypes or differentiation grades as separate classes) or for other cancer types. The authors expressed intent to extend the framework to other tumors, but no cross-cancer evidence was presented. The sample sizes for external validation were also relatively small (30 to 43 files per dataset), which limits the statistical power of the generalizability claims.

TL;DR: AUC dropped to 0.8442 on a non-Orbitrap instrument (TripleTOF 5600), revealing instrument dependence. Biological interpretation is limited because MS1 alone cannot identify specific peptides or proteins. Bin sizes below 0.1 were not tested due to hardware constraints. External test sets were small (30-43 files each), and only binary tumor-vs-normal classification was evaluated.

Future Directions

Pages 9-10

Multi-Cancer Extension, MS2 Integration, and Clinical Translation

Expanding to other tumor types: The MS1Former framework was designed to be generalizable beyond HCC. The authors envision applying it to the classification of other solid tumors where tissue-based proteomics data is available. Since the model operates on raw MS1 spectra without disease-specific preprocessing, retraining on new tumor types would primarily require collecting appropriately labeled datasets. The end-to-end architecture means the entire pipeline from raw data to classification can be reused without modification.

MS2 integration for biomarker discovery: Perhaps the most impactful next step is incorporating MS2 spectral data to bridge the gap between classification performance and biological understanding. By linking the significant m/z bins identified by LIME to specific peptide sequences via MS2-based identification, the framework could transition from a diagnostic classifier to a biomarker discovery tool. The authors specifically cite the MonoMS1 approach as a promising direction, where MS2-identified peptides serve as training labels for an MS1-based peptide recognition model. This integration would allow clinicians to not only receive a tumor/non-tumor prediction but also understand which proteins are driving the classification.

Pathological staging: The observation that MS1Former captures differences between HCC differentiation grades in its feature heatmaps suggests the model could potentially be extended for pathological staging, not just binary classification. Training on graded tissue samples (well-differentiated, moderately differentiated, poorly differentiated) could allow the model to predict disease aggressiveness directly from MS1 data. The authors noted that the differences in m/z feature distributions across differentiation types were already visible in the model's learned representations.

Clinical deployment considerations: For rapid clinical diagnosis, the key advantage of MS1Former is its end-to-end nature: raw data goes in, classification comes out, with no intermediate software dependencies like MaxQuant. However, moving from research validation to clinical deployment would require prospective validation studies, regulatory clearance, and standardized sample preparation protocols. The current reliance on Orbitrap instruments would also need to be addressed, either by expanding training data across platforms or by demonstrating acceptable performance on the specific instruments used in target clinical laboratories.

TL;DR: Future work includes extending MS1Former to other cancer types, integrating MS2 data via approaches like MonoMS1 for biomarker identification, expanding to pathological staging beyond binary classification, and pursuing prospective clinical validation. The end-to-end design makes retraining for new tumors straightforward, but multi-platform instrument coverage and regulatory clearance remain necessary for clinical deployment.

A deep learning framework for hepatocellular carcinoma diagnosis using MS1 data

Original Paper (PDF)