Back

Assessing Performance of Multimodal ChatGPT-4 on an image based Radiology Board-style Examination: An exploratory study

Bera, K.; Gupta, A.; Jiang, S.; Berlin, S.; Faraji, N.; Tippareddy, C.; Chiong, I.; Jones, R.; Nemer, O.; Nayate, A.; Tirumani, S. H.; Ramaiya, N.

2024-01-13 radiology and imaging
10.1101/2024.01.12.24301222
Show abstract

ObjectiveTo evaluate the performance of multimodal ChatGPT 4 on a radiology board-style examination containing text and radiologic images.s Materials and MethodsIn this prospective exploratory study from October 30 to December 10, 2023, 110 multiple-choice questions containing images designed to match the style and content of radiology board examination like the American Board of Radiology Core or Canadian Board of Radiology examination were prompted to multimodal ChatGPT 4. Questions were further sub stratified according to lower-order (recall, understanding) and higher-order (analyze, synthesize), domains (according to radiology subspecialty), imaging modalities and difficulty (rated by both radiologists and radiologists-in-training). ChatGPT performance was assessed overall as well as in subcategories using Fishers exact test with multiple comparisons. Confidence in answering questions was assessed using a Likert scale (1-5) by consensus between a radiologist and radiologist-in-training. Reproducibility was assessed by comparing two different runs using two different accounts. ResultsChatGPT 4 answered 55% (61/110) of image-rich questions correctly. While there was no significant difference in performance amongst the various sub-groups on exploratory analysis, performance was better on lower-order [61% (25/41)] when compared to higher-order [52% (36/69)] [P=.46]. Among clinical domains, performance was best on cardiovascular imaging [80% (8/10)], and worst on thoracic imaging [30% [3/10)]. Confidence in answering questions was confident/highly confident [89%(98/110)], even when incorrect There was poor reproducibility between two runs, with the answers being different in 14% (15/110) questions. ConclusionDespite no radiology specific pre-training, multimodal capabilities of ChatGPT appear promising on questions containing images. However, the lack of reproducibility among two runs, even with the same questions poses challenges of reliability.

Matching journals

The top 7 journals account for 50% of the predicted probability mass.

1
European Radiology
based on 11 papers
Top 0.1%
21.6%
2
PLOS ONE
based on 1737 papers
Top 54%
8.3%
3
Diagnostics
based on 36 papers
Top 0.3%
5.8%
4
Scientific Reports
based on 701 papers
Top 35%
5.2%
5
Cureus
based on 64 papers
Top 3%
5.2%
6
Journal of Clinical Medicine
based on 77 papers
Top 4%
3.2%
7
BMC Cancer
based on 21 papers
Top 1%
3.1%
50% of probability mass above
8
PLOS Digital Health
based on 88 papers
Top 4%
3.1%
9
BMJ Open
based on 553 papers
Top 32%
2.7%
10
Stroke: Vascular and Interventional Neurology
based on 12 papers
Top 0.8%
2.7%
11
Journal of Magnetic Resonance Imaging
based on 10 papers
Top 0.9%
2.7%
12
Frontiers in Oncology
based on 34 papers
Top 4%
1.7%
13
Medicine
based on 29 papers
Top 4%
1.7%
14
Annals of Translational Medicine
based on 14 papers
Top 2%
1.7%
15
Archives of Clinical and Biomedical Research
based on 18 papers
Top 0.4%
1.7%
16
Cancers
based on 57 papers
Top 6%
1.5%
17
Neuro-Oncology Advances
based on 14 papers
Top 1%
1.5%
18
Heliyon
based on 57 papers
Top 6%
1.5%
19
Brain and Behavior
based on 19 papers
Top 3%
1.3%
20
Informatics in Medicine Unlocked
based on 11 papers
Top 2%
1.3%
21
Journal of the American Medical Informatics Association
based on 53 papers
Top 6%
0.9%
22
JMIRx Med
based on 29 papers
Top 5%
0.9%
23
npj Digital Medicine
based on 85 papers
Top 13%
0.7%
24
JCO Clinical Cancer Informatics
based on 14 papers
Top 4%
0.7%
25
Computers in Biology and Medicine
based on 39 papers
Top 7%
0.7%
26
BMC Medical Education
based on 16 papers
Top 1%
0.7%
27
Radiotherapy and Oncology
based on 11 papers
Top 2%
0.7%