Back

Automating Screening of Titles and Abstracts in Systematic Reviews: An Assessment of GPT-4o mini

Fazeli, M. S.; Kasireddy, E.; Pourrahmat, M.-M.; Chow, C.; Collet, J. P.

2026-05-20 health informatics
10.64898/2026.05.15.26353334 medRxiv
Show abstract

Background: Systematic literature reviews (SLRs) are essential in medical research, but are often time-consuming and costly, necessitating more efficient methods while maintaining accuracy. Objective: This study assessed the performance of a GPT-4o mini large language model (LLM) in automating the first phase of study selection based on titles and abstracts in systematic reviews. Specifically, we evaluated whether the model improved efficiency without compromising on quality. Methods: Structured prompts were created for a GPT-4o mini LLM to facilitate title and abstract screening. The model's performance was evaluated against expert human reviewers across five systematic reviews on inclusion rates, sensitivity, specificity, accuracy, positive predictive value, and negative predictive value. Results: The model screened a total of 15,605 records. It included a higher percentage of studies than human screeners, with 3.5% (n=549/15,605) true positives and 14.2% (n=2,218/15,605) false positives. The model achieved an overall accuracy of 85.1%, with a sensitivity of 83.2% and specificity of 85.2%. The positive predictive value was 19.8%, while the negative predictive value was 99.1%. The model was able to screen 1,000 titles and abstracts in 40 minutes, compared to 16 hours required by a human reviewer. Conclusion: This study demonstrated a strong performance and efficiency in the automation of title and abstract screening in SLRs using an advanced LLM. Further refinements could optimize the balance between sensitivity and specificity, supporting broader implementation in evidence synthesis. A hybrid AI-human approach is recommended to ensure accuracy, reduce reviewer burden, and maintain the methodological rigor required for high-quality SLRs.

Matching journals

The top 3 journals account for 50% of the predicted probability mass.

1
Research Synthesis Methods
20 papers in training set
Top 0.1%
23.2%
2
Journal of the American Medical Informatics Association
61 papers in training set
Top 0.1%
18.8%
3
Journal of Clinical Epidemiology
28 papers in training set
Top 0.1%
10.4%
50% of probability mass above
4
PLOS ONE
4510 papers in training set
Top 24%
7.0%
5
JAMIA Open
37 papers in training set
Top 0.3%
4.5%
6
BMC Medicine
163 papers in training set
Top 1%
3.7%
7
BMC Medical Research Methodology
43 papers in training set
Top 0.3%
3.0%
8
Journal of Biomedical Informatics
45 papers in training set
Top 0.5%
2.8%
9
BMJ Open
554 papers in training set
Top 8%
1.9%
10
Journal of Medical Internet Research
85 papers in training set
Top 3%
1.7%
11
Scientific Reports
3102 papers in training set
Top 63%
1.4%
12
BMC Medical Informatics and Decision Making
39 papers in training set
Top 2%
1.3%
13
Healthcare
16 papers in training set
Top 1%
1.3%
14
JMIR Medical Informatics
17 papers in training set
Top 1%
0.9%
15
BMJ Health & Care Informatics
13 papers in training set
Top 0.8%
0.8%
16
Trials
25 papers in training set
Top 2%
0.8%
17
Computers in Biology and Medicine
120 papers in training set
Top 5%
0.7%
18
BMC Research Notes
29 papers in training set
Top 0.7%
0.7%
19
Wellcome Open Research
57 papers in training set
Top 3%
0.7%
20
BMC Bioinformatics
383 papers in training set
Top 8%
0.5%
21
Artificial Intelligence in Medicine
15 papers in training set
Top 0.9%
0.5%
22
Bioinformatics
1061 papers in training set
Top 11%
0.5%
23
International Journal of Medical Informatics
25 papers in training set
Top 2%
0.5%
24
PLOS Biology
408 papers in training set
Top 24%
0.5%
25
Nature Communications
4913 papers in training set
Top 67%
0.5%