Retrieval-Augmented Claude Opus 4.7 and GPT-5.5 Surpass Human Performance on the Nuclear Cardiology Board Preparation Exam (and Claude Drafts a Paper About it)
Killekar, A.; Shanbhag, A.; Miller, R. J.; Dey, D.; Bourque, J.; Phillips, L.; Chareonthaitawee, P.; Slomka, P.
Show abstract
BackgroundPrevious studies evaluated large language model (LLM) performance on the American Society of Nuclear Cardiology (ASNC) Board Preparation Exam. Without domain-specific context, the best model (GPT-4o) achieved 63.1%, below the estimated 65% passing threshold and the 78% mean score of human fellows-in-training (FITs). Providing textbook context improved GPT-4o to 73.8% on text-only questions, but still fell short of human trainees. Whether next-generation LLMs with retrieval-augmented generation (RAG) can exceed this gap is unknown. MethodsClaude Opus 4.7 and GPT-5.5 were administered all 168 questions (141 text-only, 27 image-based) from the 2023 ASNC Board Preparation Exam across 5 iterations each, using RAG with a nuclear cardiology textbook, companion atlas, and ASNC clinical guidelines. Claude used local FAISS-based semantic retrieval; GPT-5.5 used Azures cloud-hosted vector store. Performance was compared to prior LLM results and 13 human FITs. ResultsAcross 5 iterations, Claude Opus 4.7 achieved a mean accuracy of 86.3% {+/-} 1.4% (text 88.8%, image 73.3%). GPT-5.5 achieved 86.7% {+/-} 2.2% (text 88.5%, image 77.0%) but refused a mean of 12.2 questions (7.3%) per iteration due to safety filters. Both models surpassed the human FIT mean (78.0%) and the estimated passing threshold. Compared to GPT-4o without context (63.1%), this represents a 23-percentage-point improvement in 18 months. ConclusionNext-generation LLMs with RAG now surpass average human trainee performance on nuclear cardiology board preparation questions, suggesting significant potential as educational tools and knowledge-reference aids in cardiovascular imaging. Condensed AbstractAcross 5 iterations each, Claude Opus 4.7 and GPT-5.5 with retrieval-augmented generation achieved mean accuracies of 86.3% and 86.7% on the 2023 ASNC Board Preparation Exam (168 questions), both surpassing the mean human fellow-in-training score of 78%. GPT-5.5 refused a mean of 12.2 questions (7.3%) per iteration due to safety filters. These results represent a 23-percentage-point improvement over the best prior LLM without context (63.1%), demonstrating that RAG-enhanced LLMs have reached human-level proficiency in nuclear cardiology knowledge. Graphical Abstract O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=111 SRC="FIGDIR/small/26352768v2_ufig1.gif" ALT="Figure 1"> View larger version (49K): org.highwire.dtl.DTLVardef@5f2465org.highwire.dtl.DTLVardef@4e80d3org.highwire.dtl.DTLVardef@1ebbb93org.highwire.dtl.DTLVardef@167d3c1_HPS_FORMAT_FIGEXP M_FIG C_FIG Overview of the three-study research arc evaluating LLM performance on the 2023 ASNC Board Preparation Exam. Study 1 (2024) tested four LLMs without context (best: GPT-4o, 63.1%). Study 2 (2025) added textbook context to GPT-4o (73.8%). Study 3 (2026, current) evaluated Claude Opus 4.7 and GPT-5.5 with retrieval-augmented generation across 5 iterations each (mean 86.3% and 86.7%, respectively), both surpassing the human fellow-in-training mean of 78%. Right panel shows the performance scale with key thresholds.
Matching journals
The top 12 journals account for 50% of the predicted probability mass.