Back

Collaborative large language models (LLMs) are all you need for screening in systematic reviews

Parmar, M.; Naqvi, S. A. A.; Warraich, K.; Saeidi, A.; Rawal, S.; Faisal, K. S.; Kazmi, S. Z.; Fatima, M.; He, H.; Safdar, M.; Liu, W.; Haddad, T.; Wang, Z.; Murad, M. H.; Baral, C.; Riaz, I. B.

2026-02-17 health informatics

10.64898/2026.02.07.26345640 medRxiv

Show abstract

BackgroundThe ability of large language models (LLMs) to work collaboratively and screen studies in a systematic review (SR) is under-explored. Hence, we aimed to evaluate the effectiveness of LLMs in automating the process of screening in systematic reviews. MethodsThis is an observational study which included labeled data (title and abstracts) for five SRs. Originally, two reviewers screened the citations independently for eligibility. A third reviewer cross-checked each citation for quality assurance. GPT-4, Claude-3-Sonnet, and Gemini-Pro-1.0 were used using zero-shot chain-of-thought prompting. Collaborative approaches included (i): conflict resolution using benefit of the doubt, (ii) majority voting using an independent third LLM and (iii) conflict resolution using an informed third LLM. Performance was assessed using accuracy, precision for exclusion, and recall for inclusion. Work saved over samples (WSS) was computed to estimate the reduction in manual human effort. ResultsA total of 11300 articles were included in this study. The individual models, GPT-4, Claude-3-Sonnet, and Gemini-Pro-1.0 exhibited a high precision for exclusion, achieving 99.7%, 99.7%, and 99.2% and high recall for inclusion achieving 95.5%, 96.6% and 85.7%, respectively. However, the collaborative approach utilizing the two best-performing models (GPT-4 and Claude-3S) achieved an average precision of 99.9% and a recall of 98.5% (across all collaborative approaches). Furthermore, the proposed collaborative approach resulted in an average WSS of 63.5%, compared to the average WSS of 45.2% for individual models. Conversational LLM interactions showed a consistent pattern of results. LimitationsThis study was limited due to reliance on proprietary models, and evaluation on oncology datasets. ConclusionEvidence shows that collaborative LLMs enable efficient, high-performing screening in systematic reviews, supporting continuous evidence updates. Primary funding sourceNIH (U24CA265879-01-1) and Carolyn-Ann-Kennedy-Bacon Fund.

Collaborative large language models (LLMs) are all you need for screening in systematic reviews

Matching journals