Back

A Machine Learning and Benchmarking Approach for Molecular Formula Assignment of Ultra High-Resolution Mass Spectrometry Data from Complex Mixtures

Shabbir, B.; Oliveira, P. B.; Fernandez-Lima, F.; Saeed, F.

2026-02-19 bioinformatics
10.64898/2026.02.17.706479 bioRxiv
Show abstract

A machine learning approach to molecular formula assignment is crucial for unlocking the full potential of ultra-high resolution mass spectrometry (UHRMS) when analyzing complex mixtures. By combining data-driven models with rigorous benchmarking, the accuracy, consistency, and speed in identifying plausible molecular formulas from vast spectral datasets can be improved. Compared with traditional de novo methods that rely heavily on rule-based heuristics, and manual parameter tuning, machine learning approaches can capture complex patterns in data and adapt more readily to diverse sample types. In this paper, we describe the application of a machine learning methods using the k-nearest neighbors (KNN) algorithm trained on curated chemical formula datasets of UHRMS analysis of dissolved organic matter (DOM) covering the saline river continuum and tropical wet/dry season variability. The influence of the mass accuracy (training set with 0.15-1ppm) was evaluated on a blind test set of DOMs of different geographical origins. A Decision Tree Regressor (DTR) and Random Forest Regressor (RFR) based on mass accuracy (<1ppm) was used. Results from our ML models exhibit 43% more formulas annotated than traditional methods (5796 vs 4047), Model-Synthetic achieved 99.9% assignment rate and annotated/assigned 2x more formulas (8,268 vs 4047). DTR and RFR achieved formula-level accuracies (FA) of 86.5% and 60.4%, respectively. Overall, results show an increase in formula assignment when compared with traditional methods. This ultimately enables more reliable characterization of complex natural and engineered systems, supporting advances in fields such as environmental science, metabolomics, and petroleomics. Furthermore, the novel data set produced for this study is made publicly available, establishing an initial benchmark for molecular formula assignment in UHRMS using machine learning. The dataset and code are publicly available at: https://github.com/pcdslab/dom-formula-assignment-using-ml CCS CONCEPTSComputing methodologies [-&gt;] Machine Learning [-&gt;] Learning paradigms [-&gt;] Supervised Learning

Matching journals

The top 8 journals account for 50% of the predicted probability mass.

1
Analytical Chemistry
205 papers in training set
Top 0.1%
15.2%
2
PLOS ONE
4510 papers in training set
Top 20%
9.4%
3
Journal of Proteome Research
215 papers in training set
Top 0.4%
7.0%
4
Journal of the American Society for Mass Spectrometry
33 papers in training set
Top 0.1%
6.5%
5
Bioinformatics
1061 papers in training set
Top 5%
4.5%
6
Analytica Chimica Acta
17 papers in training set
Top 0.1%
3.7%
7
Computational and Structural Biotechnology Journal
216 papers in training set
Top 2%
3.2%
8
PROTEOMICS
35 papers in training set
Top 0.2%
3.2%
50% of probability mass above
9
Frontiers in Plant Science
240 papers in training set
Top 3%
1.9%
10
Limnology and Oceanography: Methods
11 papers in training set
Top 0.1%
1.9%
11
Analytical and Bioanalytical Chemistry
17 papers in training set
Top 0.1%
1.7%
12
Frontiers in Molecular Biosciences
100 papers in training set
Top 1%
1.7%
13
Scientific Reports
3102 papers in training set
Top 56%
1.7%
14
Water Research
74 papers in training set
Top 1.0%
1.5%
15
iScience
1063 papers in training set
Top 21%
1.3%
16
Nature Communications
4913 papers in training set
Top 56%
1.3%
17
BMC Bioinformatics
383 papers in training set
Top 6%
1.0%
18
Metabolites
50 papers in training set
Top 0.8%
1.0%
19
Science of The Total Environment
179 papers in training set
Top 4%
1.0%
20
Communications Chemistry
39 papers in training set
Top 0.7%
0.9%
21
Journal of Chemical Information and Modeling
207 papers in training set
Top 3%
0.8%
22
International Journal of Molecular Sciences
453 papers in training set
Top 14%
0.8%
23
PLOS Computational Biology
1633 papers in training set
Top 24%
0.8%
24
SoftwareX
15 papers in training set
Top 0.4%
0.8%
25
Environmental Science & Technology Letters
22 papers in training set
Top 0.4%
0.8%
26
Briefings in Bioinformatics
326 papers in training set
Top 6%
0.8%
27
ACS Omega
90 papers in training set
Top 4%
0.7%
28
mSystems
361 papers in training set
Top 7%
0.7%
29
Microbiology Spectrum
435 papers in training set
Top 6%
0.7%
30
Frontiers in Microbiology
375 papers in training set
Top 10%
0.7%