Back

Can large language models reliably extract human disease genes from full-text scientific literature?

Yin, D.; Leung, M. K. S.; Pun, D. W. H.; Chen, F. H.; Kwon, J. Y.; Lin, X.; Ho, J. W. K.

2025-07-31 bioinformatics
10.1101/2025.07.27.667022 bioRxiv
Show abstract

Manual extraction of high-fidelity gene-disease-phenotype information from human genetics literature is a labor-intensive task that requires trained human genetics researchers to read through many primary research papers. This presents a major challenge for maintaining up-to-date human disease genetic databases. Recent exploration into large language models (LLMs) opens new directions in automating this manual process. However, most approaches depend on pre-training, finetuning, or specialized generative artificial intelligence (GenAI) tools, but there is a lack of empirical evidence to show whether commercially-available LLMs can be directly used to reliably extract gene-disease-phenotype for human genetic diseases. Herein, we perform a benchmark of the use of three zero-shot prompted LLMs, namely GPT-4, DeepSeek and Claude, without task-specific fine-tuning, to extract human genetic information directly from full text of scientific papers. Using known congenital heart diseases (CHD) genes found in the open access CHDgene database (https://chdgene.victorchang.edu.au/) as the benchmark data set, GPT-4o achieved overall 88.8% extraction accuracy across 23 gene entries containing over 57 references, with 100% accuracy in gene name, 78.3% and 76.7% in disease and phenotype fields respectively. This work introduces a lightweight, easy-to-deploy, and yet robust LLM-based agent named GeneAgent, analyze sources of disagreement, and highlight the feasibility of integrating powerful LLM into genetic evidence synthesis workflows. Highlight- First systematic benchmark of LLMs for extracting human gene-disease-phenotype relationships from full-text biomedical articles - GeneAgent: a lightweight, highly accurate prompt-only LLM agent - New domain task-specific evaluation framework

Matching journals

The top 3 journals account for 50% of the predicted probability mass.

1
Bioinformatics Advances
184 papers in training set
Top 0.1%
36.1%
2
Bioinformatics
1061 papers in training set
Top 3%
10.0%
3
BMC Bioinformatics
383 papers in training set
Top 0.9%
10.0%
50% of probability mass above
4
Database
51 papers in training set
Top 0.2%
3.6%
5
PLOS ONE
4510 papers in training set
Top 40%
3.6%
6
BioData Mining
15 papers in training set
Top 0.1%
3.6%
7
Frontiers in Genetics
197 papers in training set
Top 3%
2.6%
8
NAR Genomics and Bioinformatics
214 papers in training set
Top 1%
2.3%
9
Scientific Reports
3102 papers in training set
Top 51%
2.1%
10
PLOS Computational Biology
1633 papers in training set
Top 14%
2.1%
11
GigaScience
172 papers in training set
Top 1%
1.9%
12
Computational and Structural Biotechnology Journal
216 papers in training set
Top 4%
1.9%
13
Nucleic Acids Research
1128 papers in training set
Top 11%
1.7%
14
Journal of the American Medical Informatics Association
61 papers in training set
Top 2%
1.1%
15
European Journal of Human Genetics
49 papers in training set
Top 1%
0.9%
16
iScience
1063 papers in training set
Top 27%
0.9%
17
Journal of Biomedical Informatics
45 papers in training set
Top 1%
0.8%
18
IEEE Journal of Biomedical and Health Informatics
34 papers in training set
Top 2%
0.8%
19
Genome Medicine
154 papers in training set
Top 9%
0.7%
20
Biology Methods and Protocols
53 papers in training set
Top 3%
0.6%
21
IEEE Transactions on Computational Biology and Bioinformatics
17 papers in training set
Top 0.8%
0.6%
22
Briefings in Bioinformatics
326 papers in training set
Top 7%
0.6%