Back

Application of large language models to the annotation of cell lines and mouse strains in genomics data

Rogic, S.; Mancarci, B. O.; Xu, B.; Xiao, A.; Yang, C.; Pavlidis, P.

2026-03-07 bioinformatics
10.64898/2026.03.05.709906 bioRxiv
Show abstract

Accurate, consistent and comprehensive metadata are essential for the reuse of functional genomics data deposited in repositories such as the Gene Expression Omnibus (GEO), however, achieving this often requires careful manual curation that is time-consuming, costly and prone to errors. In this paper, we evaluate the performance of Large Language Models (LLMs), specifically OpenAIs GPT-4o, as an assistive tool for entity-to-ontology annotation of two commonly encountered descriptors in transcriptomic experiments - mouse strains and cell lines. Using over 9,000 manually curated experiments from the Gemma database and over 5,000 associated journal articles, we assess the models ability to identify relevant free-text entries and map them to appropriate ontology terms. Using zero-shot prompting and retrieval-augmented generation (RAG) to incorporate domain-specific ontology knowledge, GPT-4o correctly annotated 77% of mouse strain and 59% of cell line experiments, and uncovered manual curation errors in Gemma for over 200 experiments. GPT-4o substantially outperformed a regular expression-based string-matching method, which correctly annotated only 6% of mouse strain experiments due to low precision. Model errors often arose from typographical mistakes or inconsistent naming in the GEO record or publication, and resembled those made by human curators. Along with annotations, our approach requests that the model output supporting context and quotes from the sources. These were typically accurate and enabled rapid curator verification. These findings suggest that LLMs are not ready to fully replace manual curators, but can already effectively support them. A human-in-the-loop workflow, in which LLMs annotations are provided to human curators for validation, may improve the efficiency and quality of large-scale biomedical metadata curation.

Matching journals

The top 4 journals account for 50% of the predicted probability mass.

1
Database
51 papers in training set
Top 0.1%
28.2%
2
Bioinformatics
1061 papers in training set
Top 2%
14.7%
3
BMC Bioinformatics
383 papers in training set
Top 2%
5.0%
4
NAR Genomics and Bioinformatics
214 papers in training set
Top 0.4%
4.4%
50% of probability mass above
5
Nucleic Acids Research
1128 papers in training set
Top 5%
4.3%
6
Bioinformatics Advances
184 papers in training set
Top 1.0%
4.0%
7
PLOS Computational Biology
1633 papers in training set
Top 9%
3.7%
8
GigaScience
172 papers in training set
Top 0.7%
2.9%
9
Genome Medicine
154 papers in training set
Top 3%
2.8%
10
Genome Biology
555 papers in training set
Top 3%
2.7%
11
Journal of the American Medical Informatics Association
61 papers in training set
Top 1%
1.7%
12
Genomics, Proteomics & Bioinformatics
171 papers in training set
Top 3%
1.7%
13
Journal of Biomedical Informatics
45 papers in training set
Top 0.9%
1.4%
14
Computational and Structural Biotechnology Journal
216 papers in training set
Top 6%
1.3%
15
Briefings in Bioinformatics
326 papers in training set
Top 5%
1.3%
16
PLOS ONE
4510 papers in training set
Top 61%
1.1%
17
BioData Mining
15 papers in training set
Top 0.5%
1.1%
18
Scientific Reports
3102 papers in training set
Top 69%
1.0%
19
Genome Research
409 papers in training set
Top 3%
1.0%
20
Cell Systems
167 papers in training set
Top 10%
0.9%
21
Nature Methods
336 papers in training set
Top 5%
0.9%
22
Frontiers in Genetics
197 papers in training set
Top 9%
0.8%
23
iScience
1063 papers in training set
Top 31%
0.8%
24
BMC Genomics
328 papers in training set
Top 6%
0.7%
25
Nature Communications
4913 papers in training set
Top 65%
0.7%
26
Nature Biotechnology
147 papers in training set
Top 8%
0.7%
27
IEEE Journal of Biomedical and Health Informatics
34 papers in training set
Top 3%
0.5%
28
Nature Machine Intelligence
61 papers in training set
Top 4%
0.5%