How Large Language Models Perform on the United States Medical Licensing Examination: A Systematic Review
Brin, D.; Sorin, V.; Konen, E.; Glicksberg, B. S.; Nadkarni, G.; Klang, E.
Show abstract
ABSTRACTO_ST_ABSObjectiveC_ST_ABSThe United States Medical Licensing Examination (USMLE) assesses physicians competency and passing is a requirement to practice medicine in the U.S. With the emergence of large language models (LLMs) like ChatGPT and GPT-4, understanding their performance on these exams illuminates their potential in medical education and healthcare. Materials and MethodsA literature search following the 2020 PRISMA guidelines was conducted, focusing on studies using official USMLE questions and publicly available LLMs. ResultsThree relevant studies were found, with GPT-4 showcasing the highest accuracy rates of 80-90% on the USMLE. Open-ended prompts typically outperformed multiple-choice ones, with 5-shot prompting slightly edging out zero-shot. ConclusionLLMs, especially GPT-4, display proficiency in tackling USMLE-standard questions. While the USMLE is a structured evaluation tool, it may not fully capture the expansive capabilities and limitations of LLMs in medical scenarios. As AI integrates further into healthcare, ongoing assessments against trusted benchmarks are essential.
Matching journals
The top 8 journals account for 50% of the predicted probability mass.