Back

BertMS-enabled molecular networking for unknown compounds dereplication

Luning, Z.; Shuang, W.; Jixing, P.; Xiaofei, H.; Wenxue, W.; Dehai, L.

2026-03-19 microbiology
10.64898/2026.03.19.712370 bioRxiv
Show abstract

Spectral similarity is widely used as a proxy for structural similarity in tandem mass spectrometry (MS/MS) analyses, including library matching and molecular networking. However, the relationship between spectral similarity scores and true structural similarity remains imperfect, limiting compound identification in metabolomics studies. Here, we present BertMS, a spectral similarity framework based on bidirectional encoder representations from transformers (BERT), which learns contextualized representations of fragment ions from large-scale MS/MS data. Using datasets from MoNA and GNPS comprising over 100,000 unique molecules, we systematically evaluate BertMS against existing methods, including cosine similarity and Spec2Vec. BertMS shows improved performance across multiple evaluation metrics, with average gains of approximately 15-25% depending on the task. Notably, improvements are most evident in molecular similarity assessment. We further demonstrate the applicability of BertMS in molecular networking and dereplication of microbial metabolites, where it enables more consistent identification of structurally related compounds. Together, these results demonstrate that transformer-based representations improve spectral similarity estimation and enable more reliable metabolite annotation in complex mixtures.

Matching journals

The top 8 journals account for 50% of the predicted probability mass.

1
Nature Communications
4913 papers in training set
Top 3%
22.3%
2
Metabolites
50 papers in training set
Top 0.1%
6.8%
3
Nature Methods
336 papers in training set
Top 2%
6.8%
4
Molecular & Cellular Proteomics
158 papers in training set
Top 0.6%
4.3%
5
Journal of Proteome Research
215 papers in training set
Top 0.8%
3.6%
6
Cell Systems
167 papers in training set
Top 4%
3.0%
7
Genomics, Proteomics & Bioinformatics
171 papers in training set
Top 2%
2.9%
8
Advanced Science
249 papers in training set
Top 7%
2.7%
50% of probability mass above
9
Genome Biology
555 papers in training set
Top 3%
2.7%
10
Nature Biotechnology
147 papers in training set
Top 3%
2.4%
11
Bioinformatics
1061 papers in training set
Top 7%
2.1%
12
Microbiome
139 papers in training set
Top 1%
2.1%
13
Nature Machine Intelligence
61 papers in training set
Top 2%
2.1%
14
Briefings in Bioinformatics
326 papers in training set
Top 3%
1.9%
15
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 30%
1.9%
16
Nucleic Acids Research
1128 papers in training set
Top 11%
1.7%
17
Communications Biology
886 papers in training set
Top 9%
1.7%
18
PLOS ONE
4510 papers in training set
Top 57%
1.5%
19
PLOS Computational Biology
1633 papers in training set
Top 18%
1.5%
20
Computational and Structural Biotechnology Journal
216 papers in training set
Top 6%
1.3%
21
Analytical Chemistry
205 papers in training set
Top 2%
1.3%
22
ISME Communications
103 papers in training set
Top 1%
1.3%
23
iScience
1063 papers in training set
Top 20%
1.3%
24
mSystems
361 papers in training set
Top 6%
1.2%
25
Genome Medicine
154 papers in training set
Top 7%
0.9%
26
Bioinformatics Advances
184 papers in training set
Top 4%
0.8%
27
Angewandte Chemie International Edition
81 papers in training set
Top 3%
0.7%
28
Communications Chemistry
39 papers in training set
Top 1%
0.7%
29
Nature Microbiology
133 papers in training set
Top 5%
0.7%
30
Genome Research
409 papers in training set
Top 5%
0.7%