Back

Multidimensional Evaluation of Large Language Models on the AAP In-Service Examination: Assessing Accuracy, Calibration, and Citation Reliability

Dhaimade, P. A.; Henderson, R.

2025-10-17 dentistry and oral medicine
10.1101/2025.10.14.25338040 medRxiv
Show abstract

BackgroundLarge language models (LLMs) have demonstrated rapid advancements in natural language understanding and generation, prompting their integration into biomedical research, clinical practice, and professional education. However, systematic evaluation of LLMs in specialty-specific domains such as dentistry and periodontology remain limited, particularly regarding multidimensional performance metrics. ObjectiveTo conduct a comprehensive, multidimensional assessment of commercially available LLMs: GPT-4.0, GPT-5.0, and Claude SONNET 4.0 on the American Academy of Periodontology in-service examination, focusing on response accuracy, self-assessed confidence calibration, citation validity, and hallucination prevalence. MethodsModels were evaluated on the 2024 AAP In-Service Examination (331 questions) using two formats: Full Test (all questions at once) and Individual Question (one at a time). Prompts were standardized; models selected answers, and for GPT-5.0 and Claude SONNET 4.0, also provided confidence ratings and citations. Citation validity was assessed using a human-in-the-loop protocol with expert review. Statistical analyses included chi-square, McNemars, and logistic regression to assess accuracy, question fatigue, confidence calibration, and citation reliability. ResultsLLMs achieved high overall accuracy (78-87%), with the Individual Question format consistently yielding higher scores than Full Test, though differences were not statistically significant. Accuracy was highest in fact-dense domains (biochemistry, physiology, microbiology) and lowest in integrative domains (diagnosis, therapy). Significant question fatigue was observed in GPT-5.0 Full Test mode (OR = 0.997, p = 0.035), but not in Individual Question mode. Confidence scores predicted accuracy, with the strongest calibration in Individual Question mode. Citation analysis revealed frequent hallucinations, mostly critically erroneous, and citation validity was independent of answer accuracy. ConclusionsLLMs can answer a broad spectrum of periodontal specialty questions, but their reliability varies with context and information presentation. While promising as adjunctive tools, their outputs-- especially for complex reasoning and citations--require rigorous human review in educational and research settings to ensure accuracy and safety. Author SummaryArtificial intelligence chatbots are rapidly entering medical education, yet we lack comprehensive understanding of their reliability when students depend on them for learning. We developed a multidimensional evaluation framework to systematically assess AI performance beyond simple accuracy, examining how these systems behave across different medical topics, question types, and presentation formats. Using 331 real dental examination questions, we tested three major AI systems, analyzing not only correctness but also confidence calibration - whether AI confidence levels match actual accuracy - and implementing human-in-the-loop verification to check if cited sources actually exist. Our findings highlight critical vulnerabilities in current AI systems. Most alarmingly, these chatbots fabricated nearly half of their citations while maintaining unwavering confidence in both correct and incorrect responses. This combination of overconfidence and misinformation means students cannot distinguish reliable from unreliable AI responses. Additionally, we documented progressive performance decline during sequential questioning, similar to human cognitive fatigue. While we know AI systems generate rather than retrieve information, our research demonstrates the real-world consequences of this limitation. As artificial intelligence integrates into education, healthcare diagnostics, and insurance decisions, these findings underscore the urgent need for better evaluation frameworks and user education about AI limitations.

Matching journals

The top 4 journals account for 50% of the predicted probability mass.

1
Biology Methods and Protocols
53 papers in training set
Top 0.1%
26.5%
2
PLOS Digital Health
91 papers in training set
Top 0.1%
15.2%
3
PLOS ONE
4510 papers in training set
Top 26%
6.6%
4
European Radiology
14 papers in training set
Top 0.2%
4.5%
50% of probability mass above
5
Frontiers in Public Health
140 papers in training set
Top 1%
4.4%
6
npj Digital Medicine
97 papers in training set
Top 1%
4.1%
7
Royal Society Open Science
193 papers in training set
Top 0.6%
3.7%
8
Journal of the American Medical Informatics Association
61 papers in training set
Top 0.8%
3.4%
9
Journal of Medical Internet Research
85 papers in training set
Top 2%
2.8%
10
Healthcare
16 papers in training set
Top 0.6%
1.7%
11
Scientific Reports
3102 papers in training set
Top 56%
1.7%
12
PLOS Biology
408 papers in training set
Top 9%
1.7%
13
BMC Medical Informatics and Decision Making
39 papers in training set
Top 2%
1.5%
14
International Journal of Medical Informatics
25 papers in training set
Top 1%
1.1%
15
Computers in Biology and Medicine
120 papers in training set
Top 3%
1.0%
16
Frontiers in Digital Health
20 papers in training set
Top 1%
0.8%
17
Biomolecules
95 papers in training set
Top 2%
0.8%
18
Bioinformatics Advances
184 papers in training set
Top 5%
0.7%
19
Bioinformatics
1061 papers in training set
Top 10%
0.7%
20
Computer Methods and Programs in Biomedicine
27 papers in training set
Top 1%
0.7%
21
Cancer Medicine
24 papers in training set
Top 2%
0.5%
22
BMC Biology
248 papers in training set
Top 7%
0.5%
23
BMC Bioinformatics
383 papers in training set
Top 8%
0.5%