Back

Empowering Radiologists with ChatGPT-4o: Comparative Evaluation of Large Language Models and Radiologists in Cardiac Cases

Cesur, T.; Gunes, Y. C.; Camur, E.; Dagli, M.

2024-06-25 radiology and imaging
10.1101/2024.06.25.24309247 medRxiv
Show abstract

PurposeThis study evaluated the diagnostic accuracy and differential diagnosis capabilities of 12 Large Language Models (LLMs), one cardiac radiologist, and three general radiologists in cardiac radiology. The impact of ChatGPT-4o assistance on radiologist performance was also investigated. Materials and MethodsWe collected publicly available 80 "Cardiac Case of the Month from the Society of Thoracic Radiology website. LLMs and Radiologist-III were provided with text-based information, whereas other radiologists visually assessed the cases with and without ChatGPT-4o assistance. Diagnostic accuracy and differential diagnosis scores (DDx Score) were analyzed using the chi-square, Kruskal-Wallis, Wilcoxon, McNemar, and Mann-Whitney U tests. ResultsThe unassisted diagnostic accuracy of the cardiac radiologist was 72.5%, General Radiologist-I was 53.8%, and General Radiologist-II was 51.3%. With ChatGPT-4o, the accuracy improved to 78.8%, 70.0%, and 63.8%, respectively. The improvements for General Radiologists-I and II were statistically significant (P[&le;]0.006). All radiologists DDx scores improved significantly with ChatGPT-4o assistance (P[&le;]0.05). Remarkably, Radiologist-Is GPT-4o-assisted diagnostic accuracy and DDx Score were not significantly different from the Cardiac Radiologists unassisted performance (P>0.05). Among the LLMs, Claude 3.5 Sonnet and Claude 3 Opus had the highest accuracy (81.3%), followed by Claude 3 Sonnet (70.0%). Regarding the DDx Score, Claude 3 Opus outperformed all models and Radiologist-III (P<0.05). The accuracy of the general radiologist-III significantly improved from 48.8% to 63.8% with GPT4o-assistance (P<0.001). ConclusionChatGPT-4o may enhance the diagnostic performance of general radiologists for cardiac imaging, suggesting its potential as a valuable diagnostic support tool. Further research is required to assess its clinical integration.

Matching journals

The top 3 journals account for 50% of the predicted probability mass.

1
European Radiology
14 papers in training set
Top 0.1%
39.7%
2
Diagnostics
48 papers in training set
Top 0.2%
6.7%
3
Scientific Reports
3102 papers in training set
Top 15%
6.7%
50% of probability mass above
4
Annals of Translational Medicine
17 papers in training set
Top 0.2%
4.4%
5
PLOS ONE
4510 papers in training set
Top 37%
3.9%
6
Medical Physics
14 papers in training set
Top 0.2%
3.8%
7
Ultrasound in Medicine & Biology
10 papers in training set
Top 0.1%
3.2%
8
Medicine
30 papers in training set
Top 0.7%
2.5%
9
Frontiers in Medicine
113 papers in training set
Top 2%
2.2%
10
PLOS Digital Health
91 papers in training set
Top 1%
2.2%
11
Frontiers in Oncology
95 papers in training set
Top 2%
1.7%
12
Stroke: Vascular and Interventional Neurology
13 papers in training set
Top 0.3%
1.4%
13
Journal of Medical Imaging
11 papers in training set
Top 0.2%
1.0%
14
Computer Methods and Programs in Biomedicine
27 papers in training set
Top 0.7%
0.9%
15
Heliyon
146 papers in training set
Top 6%
0.8%
16
Informatics in Medicine Unlocked
21 papers in training set
Top 1%
0.8%
17
Frontiers in Cardiovascular Medicine
49 papers in training set
Top 2%
0.8%
18
Photoacoustics
11 papers in training set
Top 0.4%
0.8%
19
Computers in Biology and Medicine
120 papers in training set
Top 4%
0.8%
20
IEEE Access
31 papers in training set
Top 1%
0.7%
21
Archives of Clinical and Biomedical Research
28 papers in training set
Top 3%
0.7%
22
Journal of Magnetic Resonance Imaging
14 papers in training set
Top 0.7%
0.5%
23
JMIRx Med
31 papers in training set
Top 2%
0.5%