Back

Guidelines For Rigorous Evaluation of Clinical LLMs For Conversational Reasoning

Johri, S.; Jeong, J.; Tran, B. A.; Schlessinger, D. I.; Wongvibulsin, S.; Cai, Z. R.; Daneshjou, R.; Rajpurkar, P.

2024-01-23 dermatology
10.1101/2023.09.12.23295399 medRxiv
Show abstract

The integration of Large Language Models (LLMs) like GPT-4 and GPT-3.5 into clinical diagnostics has the potential to transform patient-doctor interactions. However, the readiness of these models for real-world clinical application remains inadequately tested. This paper introduces the Conversational Reasoning Assessment Framework for Testing in Medicine (CRAFT-MD), a novel approach for evaluating clinical LLMs. Unlike traditional methods that rely on structured medical exams, CRAFT-MD focuses on natural dialogues, using simulated AI agents to interact with LLMs in a controlled, ethical environment. We applied CRAFT-MD to assess the diagnostic capabilities of GPT-4 and GPT-3.5 in the context of skin diseases. Our experiments revealed critical insights into the limitations of current LLMs in terms of clinical conversational reasoning, history taking, and diagnostic accuracy. Based on these findings, we propose a comprehensive set of guidelines for future evaluations of clinical LLMs. These guidelines emphasize realistic doctor-patient conversations, comprehensive history taking, open-ended questioning, and a combination of automated and expert evaluations. The introduction of CRAFT-MD marks a significant advancement in LLM testing, aiming to ensure that these models augment medical practice effectively and ethically.

Matching journals

The top 6 journals account for 50% of the predicted probability mass.

1
Frontiers in Public Health
140 papers in training set
Top 0.1%
23.3%
2
PLOS ONE
4510 papers in training set
Top 17%
10.4%
3
Scientific Reports
3102 papers in training set
Top 8%
8.7%
4
npj Digital Medicine
97 papers in training set
Top 1%
3.7%
5
eLife
5422 papers in training set
Top 33%
2.5%
6
Informatics in Medicine Unlocked
21 papers in training set
Top 0.3%
2.4%
50% of probability mass above
7
Computers in Biology and Medicine
120 papers in training set
Top 1%
2.1%
8
PLOS Digital Health
91 papers in training set
Top 1%
2.0%
9
iScience
1063 papers in training set
Top 12%
1.8%
10
Nature Communications
4913 papers in training set
Top 51%
1.7%
11
BMC Medical Informatics and Decision Making
39 papers in training set
Top 2%
1.5%
12
Computational and Structural Biotechnology Journal
216 papers in training set
Top 5%
1.4%
13
BMC Bioinformatics
383 papers in training set
Top 5%
1.4%
14
Journal of Medical Internet Research
85 papers in training set
Top 3%
1.4%
15
Bioinformatics Advances
184 papers in training set
Top 3%
1.4%
16
Journal of the American Medical Informatics Association
61 papers in training set
Top 1%
1.3%
17
BMC Cancer
52 papers in training set
Top 2%
1.1%
18
GigaScience
172 papers in training set
Top 2%
1.0%
19
IEEE Journal of Biomedical and Health Informatics
34 papers in training set
Top 2%
1.0%
20
Frontiers in Immunology
586 papers in training set
Top 6%
0.9%
21
Frontiers in Medicine
113 papers in training set
Top 5%
0.9%
22
Frontiers in Psychiatry
83 papers in training set
Top 3%
0.8%
23
Computer Methods and Programs in Biomedicine
27 papers in training set
Top 0.8%
0.8%
24
Data in Brief
13 papers in training set
Top 0.3%
0.8%
25
International Journal of Medical Informatics
25 papers in training set
Top 1%
0.8%
26
PLOS Computational Biology
1633 papers in training set
Top 24%
0.8%
27
Royal Society Open Science
193 papers in training set
Top 4%
0.8%
28
Frontiers in Digital Health
20 papers in training set
Top 1%
0.8%
29
Cureus
67 papers in training set
Top 5%
0.8%
30
Bulletin of Mathematical Biology
84 papers in training set
Top 2%
0.8%