Back

Assessing Performance of Multimodal ChatGPT-4 on an image based Radiology Board-style Examination: An exploratory study

Bera, K.; Gupta, A.; Jiang, S.; Berlin, S.; Faraji, N.; Tippareddy, C.; Chiong, I.; Jones, R.; Nemer, O.; Nayate, A.; Tirumani, S. H.; Ramaiya, N.

2024-01-13 radiology and imaging
10.1101/2024.01.12.24301222 medRxiv
Show abstract

ObjectiveTo evaluate the performance of multimodal ChatGPT 4 on a radiology board-style examination containing text and radiologic images.s Materials and MethodsIn this prospective exploratory study from October 30 to December 10, 2023, 110 multiple-choice questions containing images designed to match the style and content of radiology board examination like the American Board of Radiology Core or Canadian Board of Radiology examination were prompted to multimodal ChatGPT 4. Questions were further sub stratified according to lower-order (recall, understanding) and higher-order (analyze, synthesize), domains (according to radiology subspecialty), imaging modalities and difficulty (rated by both radiologists and radiologists-in-training). ChatGPT performance was assessed overall as well as in subcategories using Fishers exact test with multiple comparisons. Confidence in answering questions was assessed using a Likert scale (1-5) by consensus between a radiologist and radiologist-in-training. Reproducibility was assessed by comparing two different runs using two different accounts. ResultsChatGPT 4 answered 55% (61/110) of image-rich questions correctly. While there was no significant difference in performance amongst the various sub-groups on exploratory analysis, performance was better on lower-order [61% (25/41)] when compared to higher-order [52% (36/69)] [P=.46]. Among clinical domains, performance was best on cardiovascular imaging [80% (8/10)], and worst on thoracic imaging [30% [3/10)]. Confidence in answering questions was confident/highly confident [89%(98/110)], even when incorrect There was poor reproducibility between two runs, with the answers being different in 14% (15/110) questions. ConclusionDespite no radiology specific pre-training, multimodal capabilities of ChatGPT appear promising on questions containing images. However, the lack of reproducibility among two runs, even with the same questions poses challenges of reliability.

Matching journals

The top 4 journals account for 50% of the predicted probability mass.

1
European Radiology
14 papers in training set
Top 0.1%
29.7%
2
Diagnostics
48 papers in training set
Top 0.1%
10.8%
3
Scientific Reports
3102 papers in training set
Top 14%
6.8%
4
PLOS ONE
4510 papers in training set
Top 25%
6.8%
50% of probability mass above
5
Ultrasound in Medicine & Biology
10 papers in training set
Top 0.1%
3.3%
6
Medical Physics
14 papers in training set
Top 0.2%
3.1%
7
Medicine
30 papers in training set
Top 0.7%
2.6%
8
BMJ Open
554 papers in training set
Top 7%
2.5%
9
JAMA Network Open
127 papers in training set
Top 2%
2.0%
10
Annals of Translational Medicine
17 papers in training set
Top 0.5%
2.0%
11
Archives of Clinical and Biomedical Research
28 papers in training set
Top 0.5%
1.9%
12
Stroke: Vascular and Interventional Neurology
13 papers in training set
Top 0.2%
1.8%
13
Frontiers in Oncology
95 papers in training set
Top 2%
1.6%
14
Frontiers in Medicine
113 papers in training set
Top 4%
1.3%
15
Journal of Medical Imaging
11 papers in training set
Top 0.2%
1.2%
16
PLOS Digital Health
91 papers in training set
Top 2%
1.0%
17
Journal of Clinical Medicine
91 papers in training set
Top 5%
1.0%
18
JMIRx Med
31 papers in training set
Top 1%
1.0%
19
Frontiers in Psychology
49 papers in training set
Top 1%
0.8%
20
Computer Methods and Programs in Biomedicine
27 papers in training set
Top 0.8%
0.8%
21
JCO Clinical Cancer Informatics
18 papers in training set
Top 0.8%
0.8%
22
npj Precision Oncology
48 papers in training set
Top 1%
0.7%
23
Photoacoustics
11 papers in training set
Top 0.5%
0.7%
24
The Lancet Digital Health
25 papers in training set
Top 1%
0.7%
25
Journal of Magnetic Resonance Imaging
14 papers in training set
Top 0.6%
0.7%
26
Informatics in Medicine Unlocked
21 papers in training set
Top 2%
0.5%
27
Frontiers in Artificial Intelligence
18 papers in training set
Top 1.0%
0.5%