Back

Retrieval-Augmented Claude Opus 4.7 and GPT-5.5 Surpass Human Performance on the Nuclear Cardiology Board Preparation Exam (and Claude Drafts a Paper About it)

Killekar, A.; Shanbhag, A.; Miller, R. J.; Dey, D.; Bourque, J.; Phillips, L.; Chareonthaitawee, P.; Slomka, P.

2026-05-13 radiology and imaging
10.64898/2026.05.08.26352768 medRxiv
Show abstract

BackgroundPrevious studies evaluated large language model (LLM) performance on the American Society of Nuclear Cardiology (ASNC) Board Preparation Exam. Without domain-specific context, the best model (GPT-4o) achieved 63.1%, below the estimated 65% passing threshold and the 78% mean score of human fellows-in-training (FITs). Providing textbook context improved GPT-4o to 73.8% on text-only questions, but still fell short of human trainees. Whether next-generation LLMs with retrieval-augmented generation (RAG) can exceed this gap is unknown. MethodsClaude Opus 4.7 and GPT-5.5 were administered all 168 questions (141 text-only, 27 image-based) from the 2023 ASNC Board Preparation Exam across 5 iterations each, using RAG with a nuclear cardiology textbook, companion atlas, and ASNC clinical guidelines. Claude used local FAISS-based semantic retrieval; GPT-5.5 used Azures cloud-hosted vector store. Performance was compared to prior LLM results and 13 human FITs. ResultsAcross 5 iterations, Claude Opus 4.7 achieved a mean accuracy of 86.3% {+/-} 1.4% (text 88.8%, image 73.3%). GPT-5.5 achieved 86.7% {+/-} 2.2% (text 88.5%, image 77.0%) but refused a mean of 12.2 questions (7.3%) per iteration due to safety filters. Both models surpassed the human FIT mean (78.0%) and the estimated passing threshold. Compared to GPT-4o without context (63.1%), this represents a 23-percentage-point improvement in 18 months. ConclusionNext-generation LLMs with RAG now surpass average human trainee performance on nuclear cardiology board preparation questions, suggesting significant potential as educational tools and knowledge-reference aids in cardiovascular imaging. Condensed AbstractAcross 5 iterations each, Claude Opus 4.7 and GPT-5.5 with retrieval-augmented generation achieved mean accuracies of 86.3% and 86.7% on the 2023 ASNC Board Preparation Exam (168 questions), both surpassing the mean human fellow-in-training score of 78%. GPT-5.5 refused a mean of 12.2 questions (7.3%) per iteration due to safety filters. These results represent a 23-percentage-point improvement over the best prior LLM without context (63.1%), demonstrating that RAG-enhanced LLMs have reached human-level proficiency in nuclear cardiology knowledge. Graphical Abstract O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=111 SRC="FIGDIR/small/26352768v2_ufig1.gif" ALT="Figure 1"> View larger version (49K): org.highwire.dtl.DTLVardef@5f2465org.highwire.dtl.DTLVardef@4e80d3org.highwire.dtl.DTLVardef@1ebbb93org.highwire.dtl.DTLVardef@167d3c1_HPS_FORMAT_FIGEXP M_FIG C_FIG Overview of the three-study research arc evaluating LLM performance on the 2023 ASNC Board Preparation Exam. Study 1 (2024) tested four LLMs without context (best: GPT-4o, 63.1%). Study 2 (2025) added textbook context to GPT-4o (73.8%). Study 3 (2026, current) evaluated Claude Opus 4.7 and GPT-5.5 with retrieval-augmented generation across 5 iterations each (mean 86.3% and 86.7%, respectively), both surpassing the human fellow-in-training mean of 78%. Right panel shows the performance scale with key thresholds.

Matching journals

The top 12 journals account for 50% of the predicted probability mass.

1
Medical Physics
14 papers in training set
Top 0.1%
9.5%
2
Journal of Medical Imaging
11 papers in training set
Top 0.1%
5.1%
3
European Radiology
14 papers in training set
Top 0.1%
5.1%
4
npj Digital Medicine
97 papers in training set
Top 0.9%
5.1%
5
PLOS Digital Health
91 papers in training set
Top 0.4%
5.1%
6
Computers in Biology and Medicine
120 papers in training set
Top 0.7%
3.8%
7
Scientific Reports
3102 papers in training set
Top 33%
3.7%
8
JCO Clinical Cancer Informatics
18 papers in training set
Top 0.2%
3.7%
9
PLOS ONE
4510 papers in training set
Top 43%
2.8%
10
Frontiers in Artificial Intelligence
18 papers in training set
Top 0.1%
2.7%
11
Computer Methods and Programs in Biomedicine
27 papers in training set
Top 0.2%
2.5%
12
The Lancet Digital Health
25 papers in training set
Top 0.2%
2.2%
50% of probability mass above
13
Diagnostics
48 papers in training set
Top 0.7%
2.2%
14
JMIRx Med
31 papers in training set
Top 0.4%
2.2%
15
JAMA Network Open
127 papers in training set
Top 2%
2.0%
16
Photoacoustics
11 papers in training set
Top 0.2%
2.0%
17
BMJ Open
554 papers in training set
Top 9%
1.8%
18
Archives of Clinical and Biomedical Research
28 papers in training set
Top 0.6%
1.8%
19
GigaScience
172 papers in training set
Top 1%
1.7%
20
Patterns
70 papers in training set
Top 0.8%
1.7%
21
IEEE Access
31 papers in training set
Top 0.4%
1.5%
22
European Journal of Nuclear Medicine and Molecular Imaging
19 papers in training set
Top 0.2%
1.3%
23
Frontiers in Physiology
93 papers in training set
Top 4%
1.2%
24
Physics in Medicine & Biology
17 papers in training set
Top 0.3%
1.2%
25
International Journal of Radiation Oncology*Biology*Physics
21 papers in training set
Top 0.4%
0.9%
26
Frontiers in Neuroinformatics
38 papers in training set
Top 0.6%
0.9%
27
Ultrasound in Medicine & Biology
10 papers in training set
Top 0.4%
0.8%
28
Frontiers in Oncology
95 papers in training set
Top 3%
0.8%
29
Magnetic Resonance in Medicine
72 papers in training set
Top 0.5%
0.8%
30
Biomedical Optics Express
84 papers in training set
Top 1.0%
0.8%