Back

Supporting Metadata Curation from Public Life Science Databases Using Open-Weight Large Language Models

Shintani, M.; Andrade, D.; Bono, H.

2026-02-18 bioinformatics
10.64898/2026.02.16.706241 bioRxiv
Show abstract

Although the Gene Expression Omnibus and other public repositories are expanding rapidly, curation across these databases has not kept pace. Data reuse is often hindered by unstandardized metadata comprising unstructured text. To address this, we developed a workflow that combines retrieval via an application programming interface with semantic filtering using large language models (LLMs) for automated curation. We benchmarked multiple LLMs using metadata from 150 candidate Arabidopsis RNA sequencing projects to classify samples treated with exogenous abscisic acid and their controls. Simple keyword searches yielded many false positives (F1=0.59); classification using LLMs significantly improved performance. Several open-weight models achieved a nearly perfect performance (F1>0.98), comparable to that of closed models. We also found that utilizing LLM confidence scores enables high-confidence cases to be processed automatically. These results suggest that open-weight LLMs can support scalable and reproducible metadata curation in local environments, providing a foundation for accelerating public dataset reuse.

Matching journals

The top 8 journals account for 50% of the predicted probability mass.

1
NAR Genomics and Bioinformatics
214 papers in training set
Top 0.1%
10.1%
2
Nucleic Acids Research
1128 papers in training set
Top 2%
10.1%
3
Bioinformatics
1061 papers in training set
Top 3%
7.2%
4
GigaScience
172 papers in training set
Top 0.2%
6.4%
5
Genomics, Proteomics & Bioinformatics
171 papers in training set
Top 1%
4.8%
6
Genome Biology
555 papers in training set
Top 1%
4.8%
7
Nature Biotechnology
147 papers in training set
Top 2%
4.3%
8
PLOS Computational Biology
1633 papers in training set
Top 8%
4.3%
50% of probability mass above
9
Cell Systems
167 papers in training set
Top 3%
4.2%
10
Nature Methods
336 papers in training set
Top 2%
4.0%
11
Briefings in Bioinformatics
326 papers in training set
Top 2%
4.0%
12
PLOS ONE
4510 papers in training set
Top 38%
3.7%
13
Database
51 papers in training set
Top 0.2%
2.6%
14
Nature Communications
4913 papers in training set
Top 45%
2.4%
15
Bioinformatics Advances
184 papers in training set
Top 2%
2.1%
16
BMC Bioinformatics
383 papers in training set
Top 4%
2.1%
17
Computational and Structural Biotechnology Journal
216 papers in training set
Top 5%
1.5%
18
Genome Medicine
154 papers in training set
Top 5%
1.3%
19
BMC Genomics
328 papers in training set
Top 4%
1.2%
20
Scientific Data
174 papers in training set
Top 2%
0.9%
21
Genome Research
409 papers in training set
Top 4%
0.9%
22
Cell Reports Methods
141 papers in training set
Top 5%
0.8%
23
Scientific Reports
3102 papers in training set
Top 73%
0.8%
24
Journal of Genetics and Genomics
36 papers in training set
Top 2%
0.7%
25
Cell Genomics
162 papers in training set
Top 8%
0.6%
26
Molecular Plant
36 papers in training set
Top 2%
0.6%
27
Plant Communications
35 papers in training set
Top 2%
0.6%