Evaluating the Reliability of a Custom GPT in Full-Text Screening of a Systematic Review

Davis, R. C.; List, S. S.; Chappell, K. G.; Heen, E.

2025-02-07 public and global health

10.1101/2025.02.04.25321655 medRxiv

Show abstract

Systematic reviewing is a time-consuming process that can be aided by artificial intelligence (AI). There are several AI options to assist with title/abstract screening, however options for full text screening (FTS) are limited. The objective of this study was to evaluate the reliability of a custom GPT (cGPT) for FTS. A cGPT powered by OpenAIs ChatGPT4o was trained and tested with a subset of articles assessed in duplicate by human reviewers. Outputs from the testing subset were coded to simulate cGPT as an autonomous and assistant reviewer. Cohens kappa was used to assess interrater agreement. The threshold for practical use was defined as a cGPT-human kappa score exceeding the lower bound of the confidence interval (CI) for the lowest human-human kappa score in inclusion/exclusion and exclusion reason decisions. cGPT as an assistant reviewer met this reliability threshold. With the Cohens kappa CI for human-human pairs ranging from 0.658 to 1.00 in the inclusion/exclusion decision, assistant cGPT and human kappa scores were encompassed in two of four pairings. In exclusion reason classification, the benchmark human-human kappa score CI range was 0.606 to 0.912. Assistant cGPT and human kappa scores were encompassed in one of four pairings. cGPT as an autonomous reviewer did not meet reliability thresholds. cGPT as an assistant could speed up systematic reviewing in a sufficiently reliable way. More research is needed to establish standardized thresholds for practical use. While the current study dealt with physiological population parameters, cGPTs can assist in FTS of systematic reviews in any field. HIGHLIGHTSO_LIThere are several AI options to assist in title/abstract screening in systematic reviewing, however, options for full text screening are limited. C_LIO_LIThe reliability of a tailor-made AI model in the form of a custom GPT was explored in the role of an assistant to a human reviewer and as an autonomous reviewer. C_LIO_LIInterrater agreement was sufficient when the model operated in the role of assistant reviewer but not in the role of autonomous reviewer. Here the model misclassified two articles out of ten, whereas the human reviewers erred in approximately one out of ten articles. C_LIO_LIThe study shows that it is possible to craft a custom GPT as a useful assistant in systematic reviews. cGPTs can be crafted to assist in reviews in any field. C_LIO_LIAn automated setup for inputting articles and coding cGPT responses is needed to maximize the potential time-saving benefit. C_LI

Evaluating the Reliability of a Custom GPT in Full-Text Screening of a Systematic Review

Matching journals