Revealing the Hidden Landscape of Public Metabolomics Data Reuse in MetaboLights
Karaman, I.; Payne, T.; Vizcaino, J. A.
Show abstract
Public data reuse is a key driver of progress in omics sciences, including increasingly metabolomics data. In this study, we present a validated analysis of confirmed reuse of datasets from the MetaboLights data repository, one of the leading resources in the field. Candidate publications were collected via dataset identifiers (MTBLS#) using a Python-based retrieval pipeline across major publisher databases. They were next manually validated to distinguish active reuse from citation-only mentions. Overall, 272 unique publications were confirmed to have reused at least one MetaboLights dataset. Reuse is dominated by Method/Tool Development, with smaller contributions from Secondary Biological Analysis and Data Integration/Meta-analysis. LC-MS datasets account for the majority of reuse, whereas NMR and GC-MS also contribute but at a lower level. Data reuse has increased over time, with a noticeable acceleration in the most recent years. At the dataset level, reuse follows a long-tail distribution, where a small subset of datasets accounts for repeated reuse, mainly as community benchmarks. These results provide a conservative estimate of public metabolomics data reuse and show that public datasets are predominantly used for methodological and computational applications. They also indicate that reuse is under-detected when dataset identifiers are not consistently reported, highlighting the need for standardised dataset citation to improve traceability and recognition of reuse. Statement of significance of the studyThe impact of public metabolomics repositories has been difficult to assess due to the lack of reliable evidence distinguishing true data reuse from simple literature citations. This study addresses that gap by providing a conservative, manually validated baseline for confirmed reuse of datasets from the MetaboLights data repository. The analysis clarifies how MetaboLights datasets are used in practice, showing that reuse is concentrated to a limited number of datasets and is dominated by computational and methodological applications.
Matching journals
The top 6 journals account for 50% of the predicted probability mass.