Back

A Multi-Agent RAG Framework for Biomedical Literature Analysis

Palem, R. R.; Chen, H.; Yue, Z.

2026-05-29 bioinformatics
10.64898/2026.05.26.727050 bioRxiv
Show abstract

BackgroundThe biomedical literature is expanding at an unprecedented rate, with over 4,000 new articles indexed on PubMed each day. Clinicians and researchers frequently lack the time to review this volume before making decisions. Retrieval-Augmented Generation (RAG) systems attempt to bridge this gap by grounding language model responses in relevant documents, but standard implementations rank all retrieved passages solely by semantic similarity, treating a case report and a meta-analysis as equally authoritative. ObjectiveWe aimed to develop and pilot-evaluate a RAG variant that incorporates evidence quality and publication recency into the retrieval scoring function, and to determine whether these signals improve answer quality on biomedical questions compared with standard cosine similarity RAG and a full-context baseline. MethodsWe developed ET-RAG (Evidence-Temporal RAG), which scores each retrieved chunk using a weighted combination of cosine similarity (50%), evidence quality based on the GRADE hierarchy (30%), and temporal recency (20%). We evaluated ET-RAG alongside two baselines: a full context agent powered by Gemini 2.0 Flash and a standard cosine RAG agent using GPT-4o-mini. All agents were tested on 40 benchmark questions (10 single-choice, 10 multiple-choice, 10 short answer, and 10 long answer) drawn from 10 peer-reviewed Alzheimers disease papers published between 2021 and 2025. ResultsET-RAG achieved the highest scores across all four question categories: single choice (0.90), multiple choice (0.74), short answer (0.92), and long answer (0.89), with a combined average of 0.86. Cosine RAG scored 80%, 0.48, 0.82, and 0.69, respectively (average 0.70), while the full context agent scored 0.60, 0.59, 0.71, and 0.53 (average 0.61). The full context agent, despite having access to the entire corpus through Geminis large context window, struggled with consistent answer extraction and was prone to rate limiting under heavy query loads. A control question on forestry was correctly rejected by all three agents, suggesting no hallucination on this control item. ConclusionsIn this pilot Alzheimers disease benchmark, incorporating evidence quality and recency into RAG retrieval improved answer quality relative to pure cosine similarity retrieval and full-corpus prompting. The evidence-temporal scoring function is lightweight to implement and adds minimal computational overhead to existing vector search pipelines, but broader validation across domains, evidence levels, and stronger retrieval baselines are required before claims of generalizable biomedical reliability can be made.

Matching journals

The top 6 journals account for 50% of the predicted probability mass.

1
Journal of the American Medical Informatics Association
61 papers in training set
Top 0.1%
18.5%
2
Bioinformatics Advances
184 papers in training set
Top 0.2%
8.4%
3
Bioinformatics
1061 papers in training set
Top 3%
8.4%
4
PLOS ONE
4510 papers in training set
Top 28%
6.3%
5
GigaScience
172 papers in training set
Top 0.3%
4.8%
6
npj Digital Medicine
97 papers in training set
Top 1%
4.3%
50% of probability mass above
7
Journal of Biomedical Informatics
45 papers in training set
Top 0.4%
3.9%
8
BMC Bioinformatics
383 papers in training set
Top 3%
3.6%
9
Computers in Biology and Medicine
120 papers in training set
Top 0.9%
3.6%
10
Database
51 papers in training set
Top 0.2%
3.1%
11
BioData Mining
15 papers in training set
Top 0.1%
2.6%
12
Scientific Reports
3102 papers in training set
Top 50%
2.1%
13
Nucleic Acids Research
1128 papers in training set
Top 11%
1.7%
14
Computational and Structural Biotechnology Journal
216 papers in training set
Top 4%
1.7%
15
Artificial Intelligence in Medicine
15 papers in training set
Top 0.3%
1.7%
16
International Journal of Medical Informatics
25 papers in training set
Top 1.0%
1.5%
17
PLOS Computational Biology
1633 papers in training set
Top 21%
0.9%
18
iScience
1063 papers in training set
Top 27%
0.9%
19
BMC Medical Informatics and Decision Making
39 papers in training set
Top 2%
0.9%
20
The Lancet Digital Health
25 papers in training set
Top 0.9%
0.9%
21
Journal of Medical Internet Research
85 papers in training set
Top 4%
0.9%
22
JMIR Medical Informatics
17 papers in training set
Top 1%
0.9%
23
Journal of Translational Medicine
46 papers in training set
Top 2%
0.9%
24
Research Synthesis Methods
20 papers in training set
Top 0.2%
0.8%
25
Artificial Intelligence in the Life Sciences
11 papers in training set
Top 0.3%
0.7%
26
JAMIA Open
37 papers in training set
Top 2%
0.7%
27
Scientific Data
174 papers in training set
Top 2%
0.7%
28
BMC Medical Research Methodology
43 papers in training set
Top 1%
0.7%