Evaluation of Closed and Open Large Language Models in Pediatric Cardiology Board Exam Performance
Nikolovski, N.; Morgan, C. T.; Gritti, M. N.
Show abstract
IntroductionLarge language models (LLMs) have gained traction in medicine, but there is limited research comparing closed- and open-source models in subspecialty contexts. This study evaluated ChatGPT-4.0o and DeepSeek-R1 on a pediatric cardiology board-style examination to quantify their accuracy and discuss clinical and educational utility. MethodsChatGPT-4.0o and DeepSeek-R1 were used to answer 88 text-based multiple-choice questions across 11 pediatric cardiology subtopics from a Pediatric Cardiology Board Review textbook. DeepSeek-R1s processing time per question was measured. Statistical analyses for model comparison were conducted using an unpaired two-tailed t-test, and bivariate correlations were assessed using Pearsons r. ResultsChatGPT-4.0o and DeepSeek-R1 achieved 70% (62/88) and 68% (60/88) accuracy, respectively (p=0.79). Subtopic accuracy was equal in 5 of 11 chapters, with each model outperforming its counterpart in 3 of 11. DeepSeek-R1s processing time negatively correlated with accuracy (r = -0.68, p = 0.02). ConclusionChatGPT-4.0o and DeepSeek-R1 approached the passing threshold on a pediatric cardiology board examination, with comparable accuracy and potential for open-source models to enhance clinical and educational outcomes while supporting sustainable AI development.
Matching journals
The top 8 journals account for 50% of the predicted probability mass.