Back

MassID provides near complete annotation of metabolomics data with identification probabilities

Stancliffe, E.; Gandhi, M.; Guzior, D. V.; Mehta, A.; Acharya, S.; Richardson, A. D.; Cho, K.; Cohen, T.; Patti, G. J.

2026-02-14 bioinformatics
10.64898/2026.02.11.704864 bioRxiv
Show abstract

Liquid chromatography coupled to mass spectrometry (LC/MS) is a powerful tool in metabolomics research, generating tens-of-thousands of signals from a single biological sample. However, current software solutions for unbiased assessment of metabolomics data analysis are limited by complex sources of noise and non-quantitative metabolite identifications that make results difficult to interpret. Here, we present MassID, a cloud-based untargeted metabolomics pipeline that aims to overcome the innate challenges of unbiased metabolite analysis and perform end-to-end data processing, transforming raw spectra to normalized and identified metabolite profiles. MassID incorporates a suite of software functionalities, including deep learning-based peak detection and comprehensive noise filtering. In addition, with MassID we introduce a novel software module: DecoID2 that enables probabilistic metabolite identification for false discovery rate (FDR)-controlled metabolomics. When applied to a human plasma dataset, MassID results in near-complete signal annotation, identification of >4,000 metabolites (including >1,200 compounds at an FDR <5%) across four complementary LC/MS runs, and enables integrated downstream analyses to understand biochemical dysregulation at both the molecular and pathway level. When compared to the Metabolomics Standards Initiative (MSI) confidence levels, identification probability generally correlated with MSI levels. However, only 356/418 of MSI Level 1 compounds were identified with <5% FDR and the remaining 884 FDR < 5% compounds were identified from MSI L2-L3 compounds, highlighting the enhanced specificity and discovery potential achieved by MassID.

Matching journals

The top 5 journals account for 50% of the predicted probability mass.

1
Nature Communications
4913 papers in training set
Top 4%
22.0%
2
Bioinformatics
1061 papers in training set
Top 2%
14.0%
3
Metabolites
50 papers in training set
Top 0.1%
8.9%
4
Analytical Chemistry
205 papers in training set
Top 0.6%
4.7%
5
Journal of Proteome Research
215 papers in training set
Top 0.6%
4.2%
50% of probability mass above
6
Nature Methods
336 papers in training set
Top 3%
3.9%
7
Cell Reports Methods
141 papers in training set
Top 1.0%
3.5%
8
Molecular & Cellular Proteomics
158 papers in training set
Top 0.7%
3.5%
9
Nature Biotechnology
147 papers in training set
Top 3%
3.5%
10
PLOS ONE
4510 papers in training set
Top 43%
3.0%
11
Briefings in Bioinformatics
326 papers in training set
Top 3%
1.8%
12
BMC Bioinformatics
383 papers in training set
Top 4%
1.8%
13
Advanced Science
249 papers in training set
Top 11%
1.7%
14
Bioinformatics Advances
184 papers in training set
Top 3%
1.6%
15
Computational and Structural Biotechnology Journal
216 papers in training set
Top 5%
1.4%
16
Scientific Reports
3102 papers in training set
Top 65%
1.3%
17
Genome Biology
555 papers in training set
Top 6%
1.2%
18
Genome Medicine
154 papers in training set
Top 6%
1.2%
19
Nature Machine Intelligence
61 papers in training set
Top 3%
0.9%
20
Communications Biology
886 papers in training set
Top 18%
0.9%
21
PLOS Computational Biology
1633 papers in training set
Top 24%
0.8%
22
Genomics, Proteomics & Bioinformatics
171 papers in training set
Top 6%
0.8%
23
npj Systems Biology and Applications
99 papers in training set
Top 3%
0.7%
24
iScience
1063 papers in training set
Top 34%
0.7%
25
Cell Systems
167 papers in training set
Top 13%
0.7%