Back

Collaborative large language models (LLMs) are all you need for screening in systematic reviews

Parmar, M.; Naqvi, S. A. A.; Warraich, K.; Saeidi, A.; Rawal, S.; Faisal, K. S.; Kazmi, S. Z.; Fatima, M.; He, H.; Safdar, M.; Liu, W.; Haddad, T.; Wang, Z.; Murad, M. H.; Baral, C.; Riaz, I. B.

2026-02-17 health informatics
10.64898/2026.02.07.26345640 medRxiv
Show abstract

BackgroundThe ability of large language models (LLMs) to work collaboratively and screen studies in a systematic review (SR) is under-explored. Hence, we aimed to evaluate the effectiveness of LLMs in automating the process of screening in systematic reviews. MethodsThis is an observational study which included labeled data (title and abstracts) for five SRs. Originally, two reviewers screened the citations independently for eligibility. A third reviewer cross-checked each citation for quality assurance. GPT-4, Claude-3-Sonnet, and Gemini-Pro-1.0 were used using zero-shot chain-of-thought prompting. Collaborative approaches included (i): conflict resolution using benefit of the doubt, (ii) majority voting using an independent third LLM and (iii) conflict resolution using an informed third LLM. Performance was assessed using accuracy, precision for exclusion, and recall for inclusion. Work saved over samples (WSS) was computed to estimate the reduction in manual human effort. ResultsA total of 11300 articles were included in this study. The individual models, GPT-4, Claude-3-Sonnet, and Gemini-Pro-1.0 exhibited a high precision for exclusion, achieving 99.7%, 99.7%, and 99.2% and high recall for inclusion achieving 95.5%, 96.6% and 85.7%, respectively. However, the collaborative approach utilizing the two best-performing models (GPT-4 and Claude-3S) achieved an average precision of 99.9% and a recall of 98.5% (across all collaborative approaches). Furthermore, the proposed collaborative approach resulted in an average WSS of 63.5%, compared to the average WSS of 45.2% for individual models. Conversational LLM interactions showed a consistent pattern of results. LimitationsThis study was limited due to reliance on proprietary models, and evaluation on oncology datasets. ConclusionEvidence shows that collaborative LLMs enable efficient, high-performing screening in systematic reviews, supporting continuous evidence updates. Primary funding sourceNIH (U24CA265879-01-1) and Carolyn-Ann-Kennedy-Bacon Fund.

Matching journals

The top 3 journals account for 50% of the predicted probability mass.

1
Research Synthesis Methods
20 papers in training set
Top 0.1%
23.6%
2
Journal of Clinical Epidemiology
28 papers in training set
Top 0.1%
19.5%
3
Journal of the American Medical Informatics Association
61 papers in training set
Top 0.1%
18.3%
50% of probability mass above
4
PLOS ONE
4510 papers in training set
Top 37%
3.8%
5
Journal of Biomedical Informatics
45 papers in training set
Top 0.5%
2.9%
6
BMJ Open
554 papers in training set
Top 7%
2.6%
7
Journal of Medical Internet Research
85 papers in training set
Top 2%
2.2%
8
JAMIA Open
37 papers in training set
Top 0.7%
2.0%
9
BMC Medicine
163 papers in training set
Top 3%
1.9%
10
Trials
25 papers in training set
Top 0.7%
1.8%
11
BMC Medical Research Methodology
43 papers in training set
Top 0.6%
1.7%
12
BMJ Health & Care Informatics
13 papers in training set
Top 0.7%
1.0%
13
Bioinformatics
1061 papers in training set
Top 8%
1.0%
14
Artificial Intelligence in Medicine
15 papers in training set
Top 0.5%
0.9%
15
Healthcare
16 papers in training set
Top 2%
0.8%
16
JAMA
17 papers in training set
Top 0.4%
0.7%
17
Neuroscience & Biobehavioral Reviews
43 papers in training set
Top 1%
0.5%
18
Cancer Medicine
24 papers in training set
Top 2%
0.5%
19
BMC Medical Informatics and Decision Making
39 papers in training set
Top 3%
0.5%
20
Scientific Reports
3102 papers in training set
Top 79%
0.5%
21
Nature Communications
4913 papers in training set
Top 66%
0.5%
22
PLOS Biology
408 papers in training set
Top 24%
0.5%
23
International Journal of Medical Informatics
25 papers in training set
Top 2%
0.5%