Back

Deterministic retrieval recovers biomedical associations lost by language models

Halder, A.; Singh, M.; Kesarwani, R.; Mathew, B.; Bhattacharya, N.; Chikhaliya, O.; Motwani, D.; Peela, S. C. M.; Samanta, S.; Muddemmanavar, P.; Farooq, M.; Ahuja, G.; Sengupta, D.

2026-04-29 bioinformatics
10.64898/2026.04.25.720782 bioRxiv
Show abstract

Large language model (LLM)-based retrieval systems miss biomedical associations through output truncation, synonym mismatch and run-to-run variability, but the magnitude of this loss remains unclear. We present BioChirp, an open-source framework that uses LLMs for query interpretation and candidate filtering, combining multi-source consensus entity resolution with deterministic graph-based retrieval. Across four major biomedical databases, BioChirp recovered more associations with higher reproducibility than conventional LLM-based retrieval approaches.

Matching journals

The top 6 journals account for 50% of the predicted probability mass.

1
Bioinformatics
1061 papers in training set
Top 2%
17.1%
2
Genome Medicine
154 papers in training set
Top 0.7%
8.2%
3
Genome Biology
555 papers in training set
Top 0.8%
7.0%
4
Nucleic Acids Research
1128 papers in training set
Top 3%
6.2%
5
Nature Communications
4913 papers in training set
Top 31%
6.2%
6
Database
51 papers in training set
Top 0.1%
6.2%
50% of probability mass above
7
Nature Methods
336 papers in training set
Top 2%
4.7%
8
Genome Research
409 papers in training set
Top 1.0%
3.6%
9
PLOS ONE
4510 papers in training set
Top 49%
2.0%
10
Nature Biotechnology
147 papers in training set
Top 4%
2.0%
11
Cell Systems
167 papers in training set
Top 7%
1.8%
12
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 32%
1.7%
13
Scientific Reports
3102 papers in training set
Top 60%
1.7%
14
Briefings in Bioinformatics
326 papers in training set
Top 4%
1.7%
15
PLOS Computational Biology
1633 papers in training set
Top 17%
1.6%
16
npj Digital Medicine
97 papers in training set
Top 2%
1.4%
17
Genomics, Proteomics & Bioinformatics
171 papers in training set
Top 4%
1.3%
18
GigaScience
172 papers in training set
Top 2%
1.3%
19
Journal of the American Medical Informatics Association
61 papers in training set
Top 2%
1.2%
20
Science
429 papers in training set
Top 17%
1.2%
21
Nature Machine Intelligence
61 papers in training set
Top 3%
1.2%
22
Nature Medicine
117 papers in training set
Top 4%
0.9%
23
BMC Bioinformatics
383 papers in training set
Top 6%
0.9%
24
Bioinformatics Advances
184 papers in training set
Top 4%
0.9%
25
Advanced Science
249 papers in training set
Top 20%
0.7%
26
Nature Genetics
240 papers in training set
Top 8%
0.7%
27
Molecular Systems Biology
142 papers in training set
Top 2%
0.7%
28
iScience
1063 papers in training set
Top 34%
0.7%
29
Scientific Data
174 papers in training set
Top 3%
0.7%
30
NAR Genomics and Bioinformatics
214 papers in training set
Top 4%
0.7%