Back

RAGulate: Retrieval-Augmented Generation for Post-hoc Literature-Grounded Regulatory Assessment

Zandigohar, M.; Dai, Y.

2026-01-21 bioinformatics
10.64898/2026.01.20.700704 bioRxiv
Show abstract

MotivationPrioritization of transcription factor (TF)-target relationships predicted by computational models for experimental validation often requires biologists to manually inspect heterogeneous and context-dependent evidence scattered across the biomedical literature. Large Language Models (LLMs) offer a promising solution to streamline this task. However, their reliance on general-purpose knowledge may lead to hallucinations and inaccurate interpretations. ResultsWe present RAGulate, a retrieval-augmented generation (RAG) framework for literature-grounded assessment of transcriptional regulation. RAGulate leverages CollecTRI, an external regulatory knowledge base, and integrates alias-aware query expansion, sparse and dense retrieval, maximum-marginal-relevance re-ranking, and LLM-based classification of predictions within a modular pipeline. Using a balanced TF-target-context benchmark from the same resource, we evaluate retrieval, classification, and evidence faithfulness. While CollecTRI provides TF-target links with supporting PubMed Identifiers (PMIDs), RAGulate infers the context of each interaction from the retrieved literature. Results show that alias normalization markedly improves retrieval recall, while hybrid retrieval, which merges lexical and embedding-based candidates, achieves the highest evidence recovery across all cut-offs. Conditioning LLMs on retrieved documents consistently improves AUROC and AUPR for classifying whether a TF-target interaction is supported in the specified context compared with direct prompting. RAGulate reduces hallucinations and improves PMID-level citation correctness, producing explanations that faithfully reflect the supporting literature. RAGulate represents a knowledge-based AI tool that partners with biologists to accelerate the process of TF-target prioritization for experimental validation and foster hypothesis generation. Availability and implementationThe software and tutorials are available at github.com/YDaiLab/RAGulate.

Matching journals

The top 4 journals account for 50% of the predicted probability mass.

1
Bioinformatics
1061 papers in training set
Top 0.6%
34.5%
2
BMC Bioinformatics
383 papers in training set
Top 2%
6.3%
3
Bioinformatics Advances
184 papers in training set
Top 0.5%
6.3%
4
Nucleic Acids Research
1128 papers in training set
Top 4%
4.9%
50% of probability mass above
5
Database
51 papers in training set
Top 0.1%
4.0%
6
Genome Medicine
154 papers in training set
Top 3%
3.1%
7
PLOS Computational Biology
1633 papers in training set
Top 11%
3.1%
8
Genome Biology
555 papers in training set
Top 3%
2.7%
9
Nature Methods
336 papers in training set
Top 4%
2.1%
10
GigaScience
172 papers in training set
Top 1%
1.9%
11
Computational and Structural Biotechnology Journal
216 papers in training set
Top 4%
1.7%
12
NAR Genomics and Bioinformatics
214 papers in training set
Top 2%
1.7%
13
Cell Systems
167 papers in training set
Top 7%
1.7%
14
Journal of the American Medical Informatics Association
61 papers in training set
Top 1%
1.5%
15
PLOS ONE
4510 papers in training set
Top 58%
1.3%
16
BioData Mining
15 papers in training set
Top 0.4%
1.3%
17
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 36%
1.3%
18
Nature Communications
4913 papers in training set
Top 56%
1.2%
19
Scientific Reports
3102 papers in training set
Top 71%
0.9%
20
Nature Machine Intelligence
61 papers in training set
Top 3%
0.9%
21
Briefings in Bioinformatics
326 papers in training set
Top 6%
0.8%
22
Nature
575 papers in training set
Top 15%
0.8%
23
Frontiers in Genetics
197 papers in training set
Top 10%
0.7%
24
Molecular Systems Biology
142 papers in training set
Top 2%
0.7%
25
Patterns
70 papers in training set
Top 3%
0.7%
26
iScience
1063 papers in training set
Top 34%
0.7%
27
JCO Clinical Cancer Informatics
18 papers in training set
Top 1.0%
0.6%
28
Journal of Biomedical Informatics
45 papers in training set
Top 2%
0.6%
29
BMC Biology
248 papers in training set
Top 7%
0.5%
30
Development
440 papers in training set
Top 4%
0.5%