Back

Improving Causal Gene Identification Using Large Language Models

Ofer, D.; Kaufman, H.

2026-03-10 bioinformatics
10.64898/2026.03.08.710344 bioRxiv
Show abstract

Genome-Wide Association Studies (GWAS) have successfully identified numerous loci associated with complex traits and diseases, yet pinpointing causal genes remains a significant challenge. The reliance on simple proximity-based heuristics is often insufficient due to linkage disequilibrium, gene interactions, and regulatory effects. Recent advancements in Large Language Models (LLMs) have demonstrated potential in automating causal gene identification, but their effectiveness remains limited by knowledge representation and retrieval mechanisms. This study builds on previous research by evaluating LLMs for causal gene identification, with a focus on enhancing performance through Retrieval-Augmented Generation (RAG) and the incorporation of genomic distance information. We replicate prior results using smaller model Qwen2.5--assessing their predictive accuracy using a benchmark dataset from Open Targets. We improved the preformences when integrating RAG-based literature retrieval (F1 = 0.795) and gene distance information (F1 = 0.806). However, the combined approach yielded diminishing returns, suggesting interactions between these enhancements. Error analysis revealed that genomic distance features improved predictions by reinforcing established heuristics, while RAG enhanced domain knowledge but occasionally led to semantic biases. These findings highlight the potential of hybrid approaches in leveraging both structured genomic features and unstructured textual data.

Matching journals

The top 4 journals account for 50% of the predicted probability mass.

1
Bioinformatics
1061 papers in training set
Top 1%
21.4%
2
BMC Bioinformatics
383 papers in training set
Top 0.6%
13.6%
3
Bioinformatics Advances
184 papers in training set
Top 0.2%
9.6%
4
BioData Mining
15 papers in training set
Top 0.1%
6.0%
50% of probability mass above
5
Frontiers in Genetics
197 papers in training set
Top 1.0%
6.0%
6
Scientific Reports
3102 papers in training set
Top 21%
6.0%
7
PLOS Computational Biology
1633 papers in training set
Top 7%
4.6%
8
Computational and Structural Biotechnology Journal
216 papers in training set
Top 2%
3.4%
9
PLOS ONE
4510 papers in training set
Top 41%
3.4%
10
GigaScience
172 papers in training set
Top 1%
2.0%
11
Database
51 papers in training set
Top 0.3%
1.8%
12
European Journal of Human Genetics
49 papers in training set
Top 0.7%
1.4%
13
NAR Genomics and Bioinformatics
214 papers in training set
Top 2%
1.4%
14
Journal of the American Medical Informatics Association
61 papers in training set
Top 2%
1.2%
15
Nucleic Acids Research
1128 papers in training set
Top 14%
1.1%
16
Journal of Biomedical Informatics
45 papers in training set
Top 1%
0.9%
17
iScience
1063 papers in training set
Top 26%
0.9%
18
Briefings in Bioinformatics
326 papers in training set
Top 6%
0.8%
19
Frontiers in Bioinformatics
45 papers in training set
Top 1%
0.7%
20
IEEE Transactions on Computational Biology and Bioinformatics
17 papers in training set
Top 0.8%
0.7%
21
PeerJ
261 papers in training set
Top 17%
0.7%
22
BMC Medical Genomics
36 papers in training set
Top 2%
0.6%