Back

Assessing accuracy and legitimacy of multimodal large language models on Japan Diagnostic Radiology Board Examination

Hirano, Y.; Miki, S.; Yamagishi, Y.; Hanaoka, S.; Nakao, T.; Kikuchi, T.; Nakamura, Y.; Nomura, Y.; Yoshikawa, T.; Abe, O.

2025-06-23 radiology and imaging
10.1101/2025.06.23.25329534
Show abstract

PurposeTo assess and compare the accuracy and legitimacy of multimodal large language models (LLMs) on the Japan Diagnostic Radiology Board Examination (JDRBE). Materials and methodsThe dataset comprised questions from JDRBE 2021, 2023, and 2024, with ground-truth answers established through consensus among multiple board-certified diagnostic radiologists. Questions without associated images and those lacking unanimous agreement on answers were excluded. Eight LLMs were evaluated: GPT-4 Turbo, GPT-4o, GPT-4.5, GPT-4.1, o3, o4-mini, Claude 3.7 Sonnet, and Gemini 2.5 Pro. Each model was evaluated under two conditions: with inputting images (vision) and without (text-only). Performance differences between the conditions were assessed using McNemars exact test. Two diagnostic radiologists (with 2 and 18 years of experience) independently rated the legitimacy of responses from four models (GPT-4 Turbo, Claude 3.7 Sonnet, o3, and Gemini 2.5 Pro) using a five-point Likert scale, blinded to model identity. Legitimacy scores were analyzed using Friedmans test, followed by pairwise Wilcoxon signed-rank tests with Holm correction. ResultsThe dataset included 233 questions. Under the vision condition, o3 achieved the highest accuracy at 72%, followed by o4-mini (70%) and Gemini 2.5 Pro (70%). Under the text-only condition, o3 topped the list with an accuracy of 67%. Addition of image input significantly improved the accuracy of two models (Gemini 2.5 Pro and GPT-4.5), but not the others. Both o3 and Gemini 2.5 Pro received significantly higher legitimacy scores than GPT-4 Turbo and Claude 3.7 Sonnet from both raters. ConclusionRecent multimodal LLMs, particularly o3 and Gemini 2.5 Pro, have demonstrated remarkable progress on JDRBE questions, reflecting their rapid evolution in diagnostic radiology. Secondary abstract Eight multimodal large language models were evaluated on the Japan Diagnostic Radiology Board Examination. OpenAIs o3 and Google DeepMinds Gemini 2.5 Pro achieved high accuracy rates (72% and 70%) and received good legitimacy scores from human raters, demonstrating steady progress.

Matching journals

The top 6 journals account for 50% of the predicted probability mass.

1
European Radiology
based on 11 papers
Top 0.1%
18.9%
2
Scientific Reports
based on 701 papers
Top 18%
8.4%
3
PLOS ONE
based on 1737 papers
Top 60%
7.1%
4
PLOS Digital Health
based on 88 papers
Top 2%
5.6%
5
Diagnostics
based on 36 papers
Top 0.5%
5.2%
6
npj Digital Medicine
based on 85 papers
Top 4%
5.0%
50% of probability mass above
7
Journal of the American Medical Informatics Association
based on 53 papers
Top 3%
3.3%
8
Cureus
based on 64 papers
Top 6%
2.7%
9
Heliyon
based on 57 papers
Top 4%
2.0%
10
Annals of Translational Medicine
based on 14 papers
Top 2%
1.7%
11
Neuro-Oncology Advances
based on 14 papers
Top 1%
1.7%
12
Computers in Biology and Medicine
based on 39 papers
Top 4%
1.7%
13
JCO Clinical Cancer Informatics
based on 14 papers
Top 2%
1.5%
14
Informatics in Medicine Unlocked
based on 11 papers
Top 1%
1.5%
15
Journal of Medical Internet Research
based on 81 papers
Top 10%
1.5%
16
Stroke: Vascular and Interventional Neurology
based on 12 papers
Top 1%
1.5%
17
Journal of Clinical Medicine
based on 77 papers
Top 10%
1.5%
18
The Lancet Digital Health
based on 25 papers
Top 2%
1.5%
19
Journal of Magnetic Resonance Imaging
based on 10 papers
Top 2%
1.3%
20
Medicine
based on 29 papers
Top 5%
1.3%
21
BMC Cancer
based on 21 papers
Top 4%
1.3%
22
Scientific Data
based on 30 papers
Top 3%
0.9%
23
Archives of Clinical and Biomedical Research
based on 18 papers
Top 1%
0.9%
24
BMJ Open
based on 553 papers
Top 48%
0.9%
25
International Journal of Medical Informatics
based on 25 papers
Top 6%
0.7%
26
JMIRx Med
based on 29 papers
Top 6%
0.7%
27
International Journal of Cancer
based on 18 papers
Top 2%
0.7%
28
Radiotherapy and Oncology
based on 11 papers
Top 2%
0.7%
29
BMC Medical Informatics and Decision Making
based on 36 papers
Top 7%
0.7%