Can large language models reliably extract human disease genes from full-text scientific literature?

Yin, D.; Leung, M. K. S.; Pun, D. W. H.; Chen, F. H.; Kwon, J. Y.; Lin, X.; Ho, J. W. K.

2025-07-31 bioinformatics

10.1101/2025.07.27.667022 bioRxiv

Show abstract

Manual extraction of high-fidelity gene-disease-phenotype information from human genetics literature is a labor-intensive task that requires trained human genetics researchers to read through many primary research papers. This presents a major challenge for maintaining up-to-date human disease genetic databases. Recent exploration into large language models (LLMs) opens new directions in automating this manual process. However, most approaches depend on pre-training, finetuning, or specialized generative artificial intelligence (GenAI) tools, but there is a lack of empirical evidence to show whether commercially-available LLMs can be directly used to reliably extract gene-disease-phenotype for human genetic diseases. Herein, we perform a benchmark of the use of three zero-shot prompted LLMs, namely GPT-4, DeepSeek and Claude, without task-specific fine-tuning, to extract human genetic information directly from full text of scientific papers. Using known congenital heart diseases (CHD) genes found in the open access CHDgene database (https://chdgene.victorchang.edu.au/) as the benchmark data set, GPT-4o achieved overall 88.8% extraction accuracy across 23 gene entries containing over 57 references, with 100% accuracy in gene name, 78.3% and 76.7% in disease and phenotype fields respectively. This work introduces a lightweight, easy-to-deploy, and yet robust LLM-based agent named GeneAgent, analyze sources of disagreement, and highlight the feasibility of integrating powerful LLM into genetic evidence synthesis workflows. Highlight- First systematic benchmark of LLMs for extracting human gene-disease-phenotype relationships from full-text biomedical articles - GeneAgent: a lightweight, highly accurate prompt-only LLM agent - New domain task-specific evaluation framework

Can large language models reliably extract human disease genes from full-text scientific literature?

Matching journals