Back

How Large Language Models Perform on the United States Medical Licensing Examination: A Systematic Review

Brin, D.; Sorin, V.; Konen, E.; Glicksberg, B. S.; Nadkarni, G.; Klang, E.

2023-09-03 health informatics
10.1101/2023.09.03.23294842 medRxiv
Show abstract

ABSTRACTO_ST_ABSObjectiveC_ST_ABSThe United States Medical Licensing Examination (USMLE) assesses physicians competency and passing is a requirement to practice medicine in the U.S. With the emergence of large language models (LLMs) like ChatGPT and GPT-4, understanding their performance on these exams illuminates their potential in medical education and healthcare. Materials and MethodsA literature search following the 2020 PRISMA guidelines was conducted, focusing on studies using official USMLE questions and publicly available LLMs. ResultsThree relevant studies were found, with GPT-4 showcasing the highest accuracy rates of 80-90% on the USMLE. Open-ended prompts typically outperformed multiple-choice ones, with 5-shot prompting slightly edging out zero-shot. ConclusionLLMs, especially GPT-4, display proficiency in tackling USMLE-standard questions. While the USMLE is a structured evaluation tool, it may not fully capture the expansive capabilities and limitations of LLMs in medical scenarios. As AI integrates further into healthcare, ongoing assessments against trusted benchmarks are essential.

Matching journals

The top 8 journals account for 50% of the predicted probability mass.

1
BMJ Health & Care Informatics
13 papers in training set
Top 0.1%
10.4%
2
Healthcare
16 papers in training set
Top 0.1%
8.2%
3
International Journal of Medical Informatics
25 papers in training set
Top 0.1%
7.1%
4
BMC Medical Informatics and Decision Making
39 papers in training set
Top 0.5%
6.3%
5
PLOS Digital Health
91 papers in training set
Top 0.4%
6.3%
6
Journal of the American Medical Informatics Association
61 papers in training set
Top 0.6%
4.3%
7
PLOS ONE
4510 papers in training set
Top 34%
4.3%
8
Scientific Reports
3102 papers in training set
Top 28%
4.3%
50% of probability mass above
9
JMIR Medical Informatics
17 papers in training set
Top 0.3%
3.6%
10
Biology Methods and Protocols
53 papers in training set
Top 0.3%
3.6%
11
Computers in Biology and Medicine
120 papers in training set
Top 0.9%
3.6%
12
npj Digital Medicine
97 papers in training set
Top 1%
3.2%
13
Journal of Medical Internet Research
85 papers in training set
Top 2%
2.6%
14
JAMIA Open
37 papers in training set
Top 0.6%
2.6%
15
BMC Medical Education
20 papers in training set
Top 0.5%
1.9%
16
Frontiers in Digital Health
20 papers in training set
Top 0.5%
1.9%
17
Artificial Intelligence in Medicine
15 papers in training set
Top 0.3%
1.7%
18
JMIR Public Health and Surveillance
45 papers in training set
Top 2%
1.7%
19
Journal of Personalized Medicine
28 papers in training set
Top 0.4%
1.7%
20
Frontiers in Public Health
140 papers in training set
Top 5%
1.7%
21
JMIR Formative Research
32 papers in training set
Top 0.9%
1.5%
22
BMJ Open
554 papers in training set
Top 10%
1.3%
23
Journal of Biomedical Informatics
45 papers in training set
Top 1%
1.2%
24
Frontiers in Artificial Intelligence
18 papers in training set
Top 0.7%
0.8%
25
Data in Brief
13 papers in training set
Top 0.4%
0.8%
26
DIGITAL HEALTH
12 papers in training set
Top 0.7%
0.7%
27
Journal of General Internal Medicine
20 papers in training set
Top 1%
0.6%