Back

Bootstrap resampling of mass spectral pairs with SpecReBoot reveals hidden molecular relationships

Giron, E. C.; Ortega, L. R. T.; Greef, J. M.; Felix, Y. M.; Ortega, N. H. C.; Surup, F.; Medema, M. H.; van der Hooft, J. J. J.

2026-02-05 biochemistry

10.64898/2026.02.03.703446 bioRxiv

Show abstract

Mass spectral molecular networking (MN) has emerged as a key computational approach to organize and analyze the vast volumes of tandem mass spectrometry (MS/MS) data generated in natural product research. MN connections are based on mass spectral similarities derived from cosine-based scores or machine learning-based scores such as Spec2Vec and MS2DeepScore. These similarity scores are single, deterministic values and provide no estimate of the statistical robustness of the inferred mass spectral connections. As a result, molecular networks frequently contain edges arising from noise, missing fragments, or experimental variability, while, simultaneously, authentic chemical relationships remain hidden. To remedy this situation, here, we introduce SpecReBoot, a statistical framework that adapts Felsensteins bootstrap principle from phylogenetics to metabolomics. Within this framework, mass fragmentation peaks are treated as resampling units with replacement to generate pseudo-replicate spectra. Spectral similarities are recalculated across replicates, and the robustness of each edge between a pair of spectra is quantified by how frequently they appear as mutual top-k neighbors across bootstrap replicates. This approach generates bootstrap-derived confidence scores for every spectral connection, transforming mass spectral similarity from an absolute score into a distribution-based, confidence-aware measure. We show how, across public GNPS spectral library and natural products discovery case study datasets on bioactive metabolites produced by bacteria and fungi, SpecReBoot reliably identifies high-confidence spectral connections, filters unstable or noise-driven edges, and rescues chemically meaningful relationships that conventional metrics systematically miss. Applying SpecReBoot to study the polyketide-lactones produced by the endophytic fungus Diaporthe caliensis revealed previously hidden spectral relationships leading to the discovery of the novel caliensomycin macrolactone scaffold, biosynthetically and biochemically related to the bioactive phomol known polyketide present in D. caliensis. In conclusion, this study provides the first statistical framework for quantifying uncertainty in MS/MS mass spectral similarity. Due to its context-agnostic nature, we anticipate that our computational metabolomics framework can also be adopted across other disciplines like clinical and environmental metabolomics. Altogether, SpecReBoot introduces statistical rigor, improves reproducibility, thus enhancing molecular networking-based natural product discovery.

Bootstrap resampling of mass spectral pairs with SpecReBoot reveals hidden molecular relationships

Matching journals