Back

Positive-Unlabeled Learning for Predicting Small Molecule MS2 Identifiability from MS1 Context and Acquisition Parameters

Bekbergenova, M.; Jiang, T.; NOTHIAS, L.-F.; Bittremieux, W.

2026-01-24 bioinformatics
10.64898/2026.01.23.701093 bioRxiv
Show abstract

MotivationThe quality of tandem mass spectra critically determines metabolite identifiability in untargeted metabolomics, yet optimizing MS2 acquisition parameters experimentally is costly, time-consuming, and infeasible across the full diversity of samples and instruments. A key obstacle to computational quality assessment is the absence of reliable negative labels: most MS2 spectra remain unannotated not because they are low quality, but because the corresponding compounds are absent from reference libraries. This label ambiguity fundamentally limits supervised learning approaches and complicates acquisition-time decision making. ResultsWe present a deep learning framework that predicts the probability that an MS2 scan will be identifiable using only the preceding MS1 spectrum and instrument acquisition parameters, without inspecting the MS2 spectrum itself. The problem is formulated as positive-unlabeled learning and addressed using a non-negative positive-unlabeled objective, enabling robust training despite missing negative labels. Trained on over eight million MS2 scans from public Orbitrap metabolomics datasets and evaluated on laboratory-disjoint benchmarks, the model recovered 90% of known identifiable spectra in a held-out test set. Predicted probabilities generalized to unseen laboratories and stratified unlabeled spectra in a physicochemically meaningful manner. Independent validation demonstrated that high predicted quality is associated with increased structural explainability, richer fragmentation patterns, canonical precursor charge states, and reduced spectral interference. These results indicate that MS2 identifiability can be anticipated from precursor context and acquisition settings alone. Availability and implementationCode is available at https://github.com/bittremieuxlab/pu_ms2_identifiability. Model weights and processed datasets are available at 10.5281/zenodo.18266932.

Matching journals

The top 3 journals account for 50% of the predicted probability mass.

1
Bioinformatics
1061 papers in training set
Top 0.5%
35.6%
2
Analytical Chemistry
205 papers in training set
Top 0.2%
10.8%
3
Nature Communications
4913 papers in training set
Top 16%
10.5%
50% of probability mass above
4
Journal of Proteome Research
215 papers in training set
Top 0.4%
8.7%
5
Molecular & Cellular Proteomics
158 papers in training set
Top 0.5%
5.0%
6
Journal of the American Society for Mass Spectrometry
33 papers in training set
Top 0.1%
2.8%
7
Metabolites
50 papers in training set
Top 0.4%
2.0%
8
PLOS ONE
4510 papers in training set
Top 52%
1.8%
9
Computational and Structural Biotechnology Journal
216 papers in training set
Top 4%
1.7%
10
Nature Machine Intelligence
61 papers in training set
Top 2%
1.5%
11
BMC Bioinformatics
383 papers in training set
Top 5%
1.5%
12
Communications Biology
886 papers in training set
Top 12%
1.4%
13
Briefings in Bioinformatics
326 papers in training set
Top 5%
1.3%
14
PLOS Computational Biology
1633 papers in training set
Top 21%
0.9%
15
Nature Methods
336 papers in training set
Top 6%
0.8%
16
Cell Systems
167 papers in training set
Top 11%
0.8%
17
Advanced Science
249 papers in training set
Top 19%
0.7%
18
Scientific Reports
3102 papers in training set
Top 75%
0.7%
19
PROTEOMICS
35 papers in training set
Top 0.9%
0.7%
20
GigaScience
172 papers in training set
Top 4%
0.7%
21
Cancer Research Communications
46 papers in training set
Top 2%
0.5%
22
mSystems
361 papers in training set
Top 8%
0.5%
23
Scientific Data
174 papers in training set
Top 3%
0.5%
24
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 48%
0.5%
25
Bioinformatics Advances
184 papers in training set
Top 5%
0.5%