Back

Bootstrap resampling of mass spectral pairs with SpecReBoot reveals hidden molecular relationships

Giron, E. C.; Ortega, L. R. T.; Greef, J. M.; Felix, Y. M.; Ortega, N. H. C.; Surup, F.; Medema, M. H.; van der Hooft, J. J. J.

2026-02-05 biochemistry
10.64898/2026.02.03.703446 bioRxiv
Show abstract

Mass spectral molecular networking (MN) has emerged as a key computational approach to organize and analyze the vast volumes of tandem mass spectrometry (MS/MS) data generated in natural product research. MN connections are based on mass spectral similarities derived from cosine-based scores or machine learning-based scores such as Spec2Vec and MS2DeepScore. These similarity scores are single, deterministic values and provide no estimate of the statistical robustness of the inferred mass spectral connections. As a result, molecular networks frequently contain edges arising from noise, missing fragments, or experimental variability, while, simultaneously, authentic chemical relationships remain hidden. To remedy this situation, here, we introduce SpecReBoot, a statistical framework that adapts Felsensteins bootstrap principle from phylogenetics to metabolomics. Within this framework, mass fragmentation peaks are treated as resampling units with replacement to generate pseudo-replicate spectra. Spectral similarities are recalculated across replicates, and the robustness of each edge between a pair of spectra is quantified by how frequently they appear as mutual top-k neighbors across bootstrap replicates. This approach generates bootstrap-derived confidence scores for every spectral connection, transforming mass spectral similarity from an absolute score into a distribution-based, confidence-aware measure. We show how, across public GNPS spectral library and natural products discovery case study datasets on bioactive metabolites produced by bacteria and fungi, SpecReBoot reliably identifies high-confidence spectral connections, filters unstable or noise-driven edges, and rescues chemically meaningful relationships that conventional metrics systematically miss. Applying SpecReBoot to study the polyketide-lactones produced by the endophytic fungus Diaporthe caliensis revealed previously hidden spectral relationships leading to the discovery of the novel caliensomycin macrolactone scaffold, biosynthetically and biochemically related to the bioactive phomol known polyketide present in D. caliensis. In conclusion, this study provides the first statistical framework for quantifying uncertainty in MS/MS mass spectral similarity. Due to its context-agnostic nature, we anticipate that our computational metabolomics framework can also be adopted across other disciplines like clinical and environmental metabolomics. Altogether, SpecReBoot introduces statistical rigor, improves reproducibility, thus enhancing molecular networking-based natural product discovery.

Matching journals

The top 5 journals account for 50% of the predicted probability mass.

1
Nature Communications
4913 papers in training set
Top 10%
14.7%
2
Metabolites
50 papers in training set
Top 0.1%
14.3%
3
Analytical Chemistry
205 papers in training set
Top 0.2%
10.4%
4
Bioinformatics
1061 papers in training set
Top 4%
6.4%
5
Molecular & Cellular Proteomics
158 papers in training set
Top 0.4%
6.4%
50% of probability mass above
6
Journal of Proteome Research
215 papers in training set
Top 0.6%
4.3%
7
Cell Reports Methods
141 papers in training set
Top 0.6%
4.3%
8
PLOS Computational Biology
1633 papers in training set
Top 10%
3.6%
9
Computational and Structural Biotechnology Journal
216 papers in training set
Top 3%
2.4%
10
Chemical Science
71 papers in training set
Top 0.9%
1.7%
11
PLOS ONE
4510 papers in training set
Top 55%
1.7%
12
mSystems
361 papers in training set
Top 5%
1.7%
13
Nucleic Acids Research
1128 papers in training set
Top 12%
1.5%
14
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 36%
1.3%
15
Nature Biotechnology
147 papers in training set
Top 5%
1.3%
16
Communications Biology
886 papers in training set
Top 14%
1.2%
17
Nature Methods
336 papers in training set
Top 5%
0.9%
18
Angewandte Chemie International Edition
81 papers in training set
Top 3%
0.9%
19
Advanced Science
249 papers in training set
Top 16%
0.9%
20
Frontiers in Molecular Biosciences
100 papers in training set
Top 4%
0.9%
21
Cell Systems
167 papers in training set
Top 10%
0.9%
22
Nature Machine Intelligence
61 papers in training set
Top 3%
0.8%
23
iScience
1063 papers in training set
Top 29%
0.8%
24
Briefings in Bioinformatics
326 papers in training set
Top 7%
0.7%
25
Journal of Chemical Information and Modeling
207 papers in training set
Top 3%
0.7%
26
Journal of the American Chemical Society
199 papers in training set
Top 5%
0.6%
27
Communications Chemistry
39 papers in training set
Top 2%
0.6%
28
eLife
5422 papers in training set
Top 61%
0.6%