Supporting Metadata Curation from Public Life Science Databases Using Open-Weight Large Language Models

Shintani, M.; Andrade, D.; Bono, H.

2026-02-18 bioinformatics

10.64898/2026.02.16.706241 bioRxiv

Show abstract

Although the Gene Expression Omnibus and other public repositories are expanding rapidly, curation across these databases has not kept pace. Data reuse is often hindered by unstandardized metadata comprising unstructured text. To address this, we developed a workflow that combines retrieval via an application programming interface with semantic filtering using large language models (LLMs) for automated curation. We benchmarked multiple LLMs using metadata from 150 candidate Arabidopsis RNA sequencing projects to classify samples treated with exogenous abscisic acid and their controls. Simple keyword searches yielded many false positives (F1=0.59); classification using LLMs significantly improved performance. Several open-weight models achieved a nearly perfect performance (F1>0.98), comparable to that of closed models. We also found that utilizing LLM confidence scores enables high-confidence cases to be processed automatically. These results suggest that open-weight LLMs can support scalable and reproducible metadata curation in local environments, providing a foundation for accelerating public dataset reuse.

Supporting Metadata Curation from Public Life Science Databases Using Open-Weight Large Language Models

Matching journals