Performance of Large Language Models in Automated Medical Literature Screening: A Systematic Review and Meta-analysis

Chenggong, X.; Weichang, K.; Liuting, P.; Diaoxin, Q.; Yuxuan, Y.; Bin, W.; Liang, H.

2026-03-19 epidemiology

10.64898/2026.03.17.26348656 medRxiv

Show abstract

ObjectiveTo systematically evaluate the diagnostic performance of large language models (LLMs) in automated medical literature screening and to determine their potential role in supporting evidence synthesis workflows. MethodsA systematic review and meta-analysis was conducted according to PRISMA DTA guidance. PubMed, Web of Science, Embase, the Cochrane Library and Google Scholar were searched from 1 January 2022 to 17 November 2025. Studies assessing LLMs for automated title and abstract screening or full-text eligibility assessment in medical literature were included. Diagnostic accuracy metrics were extracted and pooled using a bivariate random effects model and hierarchical summary receiver operating characteristic (HSROC) analysis. Subgroup analyses and meta-regression were performed to explore sources of heterogeneity. ResultsEighteen studies published between 2023 and 2025 were included. In title and abstract screening, the pooled sensitivity was 0.92 and pooled specificity was 0.94. The SROC area under the curve (AUC) reached 0.98. In full-text screening, pooled sensitivity and specificity both reached 0.99 and the AUC was 0.99. Prompt strategies incorporating examples or chain-of-thought reasoning significantly improved sensitivity. Across studies, most models were deployed without task specific fine tuning and still achieved strong performance. Subgroup analyses and meta regression did not identify significant sources of heterogeneity. Many studies also reported substantial efficiency gains, including large reductions in screening workload, time and cost. ConclusionLLMs demonstrate high diagnostic accuracy for automated medical literature screening, particularly in full-text assessment. These models show strong potential as high sensitivity assistive tools that can substantially reduce manual screening burden while supporting evidence synthesis. Further methodological optimization and validation in large scale real-world settings are required to establish their long term role in evidence-based medicine.

Performance of Large Language Models in Automated Medical Literature Screening: A Systematic Review and Meta-analysis

Matching journals