Back

Revealing the Hidden Landscape of Public Metabolomics Data Reuse in MetaboLights

Karaman, I.; Payne, T.; Vizcaino, J. A.

2026-05-05 bioinformatics
10.64898/2026.05.01.722142 bioRxiv
Show abstract

Public data reuse is a key driver of progress in omics sciences, including increasingly metabolomics data. In this study, we present a validated analysis of confirmed reuse of datasets from the MetaboLights data repository, one of the leading resources in the field. Candidate publications were collected via dataset identifiers (MTBLS#) using a Python-based retrieval pipeline across major publisher databases. They were next manually validated to distinguish active reuse from citation-only mentions. Overall, 272 unique publications were confirmed to have reused at least one MetaboLights dataset. Reuse is dominated by Method/Tool Development, with smaller contributions from Secondary Biological Analysis and Data Integration/Meta-analysis. LC-MS datasets account for the majority of reuse, whereas NMR and GC-MS also contribute but at a lower level. Data reuse has increased over time, with a noticeable acceleration in the most recent years. At the dataset level, reuse follows a long-tail distribution, where a small subset of datasets accounts for repeated reuse, mainly as community benchmarks. These results provide a conservative estimate of public metabolomics data reuse and show that public datasets are predominantly used for methodological and computational applications. They also indicate that reuse is under-detected when dataset identifiers are not consistently reported, highlighting the need for standardised dataset citation to improve traceability and recognition of reuse. Statement of significance of the studyThe impact of public metabolomics repositories has been difficult to assess due to the lack of reliable evidence distinguishing true data reuse from simple literature citations. This study addresses that gap by providing a conservative, manually validated baseline for confirmed reuse of datasets from the MetaboLights data repository. The analysis clarifies how MetaboLights datasets are used in practice, showing that reuse is concentrated to a limited number of datasets and is dominated by computational and methodological applications.

Matching journals

The top 6 journals account for 50% of the predicted probability mass.

1
Journal of Proteome Research
215 papers in training set
Top 0.3%
13.7%
2
Analytical Chemistry
205 papers in training set
Top 0.2%
11.9%
3
Metabolites
50 papers in training set
Top 0.1%
11.8%
4
GigaScience
172 papers in training set
Top 0.2%
6.1%
5
Computational and Structural Biotechnology Journal
216 papers in training set
Top 0.7%
6.0%
6
Bioinformatics
1061 papers in training set
Top 4%
6.0%
50% of probability mass above
7
BMC Bioinformatics
383 papers in training set
Top 2%
4.0%
8
Briefings in Bioinformatics
326 papers in training set
Top 2%
3.4%
9
PLOS Computational Biology
1633 papers in training set
Top 11%
3.1%
10
Nature Communications
4913 papers in training set
Top 46%
2.3%
11
Peer Community Journal
254 papers in training set
Top 1%
2.3%
12
Molecular & Cellular Proteomics
158 papers in training set
Top 0.9%
2.0%
13
Genome Biology
555 papers in training set
Top 5%
1.6%
14
PLOS ONE
4510 papers in training set
Top 59%
1.3%
15
Scientific Reports
3102 papers in training set
Top 67%
1.2%
16
PROTEOMICS
35 papers in training set
Top 0.5%
1.2%
17
NAR Genomics and Bioinformatics
214 papers in training set
Top 3%
1.2%
18
PeerJ
261 papers in training set
Top 14%
0.8%
19
Nucleic Acids Research
1128 papers in training set
Top 19%
0.7%
20
Metabolomics
11 papers in training set
Top 0.5%
0.7%
21
npj Systems Biology and Applications
99 papers in training set
Top 3%
0.7%
22
mSystems
361 papers in training set
Top 8%
0.6%
23
Journal of the American Society for Mass Spectrometry
33 papers in training set
Top 0.6%
0.6%
24
Bioinformatics Advances
184 papers in training set
Top 5%
0.6%
25
Frontiers in Bioinformatics
45 papers in training set
Top 1%
0.6%
26
Molecular Omics
21 papers in training set
Top 0.5%
0.6%
27
Database
51 papers in training set
Top 1%
0.6%
28
SoftwareX
15 papers in training set
Top 0.6%
0.6%