Back

Accuracy and Consistency of Frontier LLMs on Orthodontic Diagnostic Tasks: A Repeated-Trial Comparison

Kang, W. J.; Sim, J.; Loh, E. E. M.; Lim, A. C. Y.; FOONG, K. W. C.

2026-05-20 health informatics
10.64898/2026.05.17.26353409 medRxiv
Show abstract

Importance. Large language models are increasingly explored as clinical decision support tools in orthodontics, yet existing evaluations have been confined to knowledge based question answering where reported accuracy ranges from 18% to 100%. No study has evaluated performance on the computational and classificatory tasks that define daily diagnostic work. Furthermore, 84.3% of published healthcare large language model studies fail to report the number of repeated queries performed, leaving output stochasticity unexamined. Objective. To compare the diagnostic accuracy and output consistency of three frontier reasoning-enhanced large language models, namely, ChatGPT 5.4 (Thinking), Gemini 3 (Thinking), and Claude Opus 4.6 (Extended Thinking), on Bolton analysis, Index of Orthodontic Treatment Need-Dental Health Component (IOTN DHC) classification, space analysis, and lateral cephalometric interpretation. Methods. In this comparative cross-sectional study with a repeated-measures design, each model, accessed through its respective consumer facing web interfaces under default provider settings rather than through application programming interfaces, processed 200 purpose-built items (50 per task) across four independent trials, yielding 2,400 observations. Responses were scored against a pre-established reference standard by two independent raters using strict binary exact match criteria. Accuracy was reported with exact binomial 95% confidence intervals. Inter-model comparisons used Cochran's Q test with post-hoc McNemar's tests and Bonferroni correction. A supplementary context-rich prompting evaluation was conducted on 40 items (480 observations). Results. Claude Opus 4.6 (Extended Thinking) achieved the highest accuracy (99.0%; 95% CI: 96.4 to 99.9%), followed by Gemini 3 (Thinking) (95.5%; 91.6 to 98.1%) and ChatGPT 5.4 (Thinking) (94.0%; 89.8 to 96.9%) (Cochran's Q=6.87, p=0.032). Each model exhibited distinct, non-overlapping error profiles concentrated at the normal-abnormal classification boundary. An accuracy-consistency paradox emerged: the most accurate model was the least consistent (93.0%), while the least accurate was the second-most consistent (98.0%). Context rich prompting eliminated all errors across all three models. Interpretation. Frontier reasoning large language models achieved high overall accuracy on orthodontic diagnostic tasks but retained concealed, task-specific vulnerabilities detectable only through repeated-trial evaluation. An accuracy-consistency paradox, in which the most accurate model was the least consistent, demonstrates that single-trial evaluations cannot characterise clinical risk. The reasoning modes were associated with high arithmetic accuracy but did not compensate for imprecise parametric knowledge on classification tasks; however, the absence of a non-thinking baseline means this association cannot be attributed to the thinking mode itself. Context-rich prompting eliminated all errors on synthetic data but should be regarded as a necessary yet insufficient prerequisite for clinical deployment pending prospective validation on real patient data.

Matching journals

The top 6 journals account for 50% of the predicted probability mass.

1
PLOS Digital Health
91 papers in training set
Top 0.1%
20.0%
2
PLOS ONE
4510 papers in training set
Top 16%
10.8%
3
Scientific Reports
3102 papers in training set
Top 8%
9.0%
4
Frontiers in Digital Health
20 papers in training set
Top 0.1%
5.2%
5
Journal of the American Medical Informatics Association
61 papers in training set
Top 0.9%
2.9%
6
Philosophical Transactions of the Royal Society B
51 papers in training set
Top 2%
2.2%
50% of probability mass above
7
npj Digital Medicine
97 papers in training set
Top 2%
2.2%
8
Healthcare
16 papers in training set
Top 0.4%
2.0%
9
Journal of Medical Internet Research
85 papers in training set
Top 2%
2.0%
10
International Journal of Medical Informatics
25 papers in training set
Top 0.7%
2.0%
11
JCO Clinical Cancer Informatics
18 papers in training set
Top 0.4%
1.8%
12
JMIR Medical Informatics
17 papers in training set
Top 0.8%
1.6%
13
BMJ Health & Care Informatics
13 papers in training set
Top 0.5%
1.6%
14
JAMA Network Open
127 papers in training set
Top 3%
1.4%
15
BMC Medical Research Methodology
43 papers in training set
Top 0.7%
1.4%
16
Trials
25 papers in training set
Top 1%
1.3%
17
Computers in Biology and Medicine
120 papers in training set
Top 3%
1.2%
18
Journal of Clinical Epidemiology
28 papers in training set
Top 0.4%
1.0%
19
PeerJ
261 papers in training set
Top 11%
1.0%
20
Journal of Clinical Medicine
91 papers in training set
Top 5%
1.0%
21
BMJ Open
554 papers in training set
Top 11%
1.0%
22
DIGITAL HEALTH
12 papers in training set
Top 0.5%
1.0%
23
European Radiology
14 papers in training set
Top 0.6%
1.0%
24
BMC Medical Informatics and Decision Making
39 papers in training set
Top 2%
0.8%
25
Cureus
67 papers in training set
Top 4%
0.8%
26
Royal Society Open Science
193 papers in training set
Top 4%
0.8%
27
JAMA Pediatrics
10 papers in training set
Top 0.2%
0.8%
28
Journal of Personalized Medicine
28 papers in training set
Top 1%
0.7%
29
Computer Methods and Programs in Biomedicine
27 papers in training set
Top 1%
0.7%
30
BMJ Paediatrics Open
21 papers in training set
Top 0.8%
0.7%