Implementation of Human-in-the-Loop ChatGPT-based Patient Screening Across Multiple Diverse Clinical Trials

Dohopolski, M.; Esselink, K.; Desai, N.; Grones, B.; Patel, T.; Jiang, S.; Peterson, E.; Navar, A. M.

2026-03-27 health informatics

10.64898/2026.03.20.26348890 medRxiv

Show abstract

Purpose: Manual screening for trial eligibility is inefficient and costly. We prospectively evaluated a large language model (LLM)-assisted prescreening workflow across multiple active trials. Methods: We deployed a retrieval-augmented generation LLM-based pipeline across multiple trials at an academic medical center. Structured electronic health record data and free-text notes were used by the LLM to classify each criterion as either met, likely met, likely not met, not met, uncertain, or no documentation found, with accompanying rationale. Coordinators were provided a sorted patient list based on LLM-derived eligibility and reviewed each case, documenting their assessment of individual criteria and final prescreening status (success vs failure). Criterion-level performance--accuracy, sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), and F1 score--was calculated and tracked over time. Patient prescreening status was also evaluated as a function of the percentage of individual AI criteria met (60--80% and [≥]80%). Results: From October 2024--September 2025, 39,182 patients were prescreened using the LLM workflow across 26 studies (21 oncology and 5 non-oncology), encompassing 112 distinct criteria. A total of 914 patients with high likelihood of eligibility underwent coordinator review (5,096 criteria evaluated). Aggregated criterion-level performance was as follows: accuracy 0.94 (95% CI, 0.92--0.96), sensitivity 0.98 (0.97--0.99), specificity 0.81 (0.71--0.88), PPV 0.95 (0.92--0.97), NPV 0.93 (0.90--0.95), and F1 score 0.97 (0.95--0.97). Twenty-seven criteria prompts across 14/26 trials were automatically updated based on coordinator feedback. Patients with [≥]80% of AI-labeled criteria classified as met or likely met were more likely to be reviewed by coordinators (544/987, 55.1% vs 372/397, 93.7%) and more likely to be labeled as prescreening successes (104/544, 19.1% vs 162/372, 43.5%) compared to those with 60--80%. The average cost was $0.12 per patient. Conclusion: An LLM-assisted, human-in-the-loop prescreening workflow demonstrated high criterion-level performance at low cost across a diverse set of actively enrolling clinical trials. Structured coordinator feedback enabled an automated learning system, improving screening efficiency while preserving necessary human oversight.

Implementation of Human-in-the-Loop ChatGPT-based Patient Screening Across Multiple Diverse Clinical Trials

Matching journals