Back

Empowering Radiologists with ChatGPT-4o: Comparative Evaluation of Large Language Models and Radiologists in Cardiac Cases

Cesur, T.; Gunes, Y. C.; Camur, E.; Dagli, M.

2024-06-25 radiology and imaging
10.1101/2024.06.25.24309247
Show abstract

PurposeThis study evaluated the diagnostic accuracy and differential diagnosis capabilities of 12 Large Language Models (LLMs), one cardiac radiologist, and three general radiologists in cardiac radiology. The impact of ChatGPT-4o assistance on radiologist performance was also investigated. Materials and MethodsWe collected publicly available 80 "Cardiac Case of the Month from the Society of Thoracic Radiology website. LLMs and Radiologist-III were provided with text-based information, whereas other radiologists visually assessed the cases with and without ChatGPT-4o assistance. Diagnostic accuracy and differential diagnosis scores (DDx Score) were analyzed using the chi-square, Kruskal-Wallis, Wilcoxon, McNemar, and Mann-Whitney U tests. ResultsThe unassisted diagnostic accuracy of the cardiac radiologist was 72.5%, General Radiologist-I was 53.8%, and General Radiologist-II was 51.3%. With ChatGPT-4o, the accuracy improved to 78.8%, 70.0%, and 63.8%, respectively. The improvements for General Radiologists-I and II were statistically significant (P[&le;]0.006). All radiologists DDx scores improved significantly with ChatGPT-4o assistance (P[&le;]0.05). Remarkably, Radiologist-Is GPT-4o-assisted diagnostic accuracy and DDx Score were not significantly different from the Cardiac Radiologists unassisted performance (P>0.05). Among the LLMs, Claude 3.5 Sonnet and Claude 3 Opus had the highest accuracy (81.3%), followed by Claude 3 Sonnet (70.0%). Regarding the DDx Score, Claude 3 Opus outperformed all models and Radiologist-III (P<0.05). The accuracy of the general radiologist-III significantly improved from 48.8% to 63.8% with GPT4o-assistance (P<0.001). ConclusionChatGPT-4o may enhance the diagnostic performance of general radiologists for cardiac imaging, suggesting its potential as a valuable diagnostic support tool. Further research is required to assess its clinical integration.

Matching journals

The top 5 journals account for 50% of the predicted probability mass.

1
European Radiology
based on 11 papers
Top 0.1%
34.4%
2
Diagnostics
based on 36 papers
Top 0.4%
5.5%
3
PLOS ONE
based on 1737 papers
Top 67%
5.3%
4
Scientific Reports
based on 701 papers
Top 40%
4.7%
5
Annals of Translational Medicine
based on 14 papers
Top 0.4%
4.7%
50% of probability mass above
6
PLOS Digital Health
based on 88 papers
Top 4%
3.1%
7
Cureus
based on 64 papers
Top 5%
2.9%
8
Medicine
based on 29 papers
Top 3%
2.4%
9
npj Digital Medicine
based on 85 papers
Top 7%
2.4%
10
Computers in Biology and Medicine
based on 39 papers
Top 3%
1.9%
11
Heliyon
based on 57 papers
Top 4%
1.9%
12
Journal of the American Medical Informatics Association
based on 53 papers
Top 4%
1.8%
13
Stroke: Vascular and Interventional Neurology
based on 12 papers
Top 1%
1.8%
14
Informatics in Medicine Unlocked
based on 11 papers
Top 1%
1.6%
15
Journal of Magnetic Resonance Imaging
based on 10 papers
Top 2%
1.4%
16
Journal of Clinical Medicine
based on 77 papers
Top 13%
1.2%
17
Frontiers in Oncology
based on 34 papers
Top 5%
0.8%
18
Archives of Clinical and Biomedical Research
based on 18 papers
Top 2%
0.8%
19
JMIRx Med
based on 29 papers
Top 6%
0.7%
20
BMC Cancer
based on 21 papers
Top 5%
0.7%
21
Neuro-Oncology Advances
based on 14 papers
Top 2%
0.7%
22
The Lancet Digital Health
based on 25 papers
Top 5%
0.7%
23
Cancers
based on 57 papers
Top 7%
0.7%
24
Brain and Behavior
based on 19 papers
Top 5%
0.7%
25
Magnetic Resonance in Medicine
based on 11 papers
Top 1%
0.7%