Back

Closing the Pediatric Divide: A Performance Analysis of the GPT-5 Family in Medical Diagnostics

Mondillo, G.; Abbate, F. G.; Masino, M.; Colosimo, S.; Perrotta, A.; Frattolillo, V.

2025-08-29 pediatrics
10.1101/2025.08.28.25334657 medRxiv
Show abstract

AO_SCPLOWBSTRACTC_SCPLOWO_ST_ABSBackgroundC_ST_ABSLarge Language Models (LLMs) have demonstrated significant potential in clinical medicine, but a persistent performance gap exists in the pediatric domain due to its unique complexities. This study provides the first comparative evaluation of the new GPT-5 family (Nano, Mini, and full) to assess the impact of model scale on diagnostic accuracy and this specific adult-pediatric disparity. MethodsA benchmarking study was conducted using 2,000 multiple-choice questions from the MedQA dataset, equally divided between adult (n=1,000) and pediatric (n=1,000) domains. GPT-5, GPT-5 Mini, and GPT-5 Nano were tested via API with standardized parameters (temperature=0, reasoning effort=minimal, verbosity=low, maxtoken=170). Accuracy was calculated and statistically compared across domains for each model. ResultsA clear dose-response relationship was observed between model size and accuracy. GPT-5 Nano exhibited a significant performance gap, with an accuracy of 71.0% in adult medicine versus 55.4% in pediatrics (a 15.6 percentage point difference, p<0.001). GPT-5 Mini substantially narrowed this gap to 5.7 points (81.5% vs. 75.8%, p=0.001). Critically, the full GPT-5 model eliminated the disparity, achieving comparable accuracy in adult medicine (86.3%) and slightly higher accuracy in pediatrics (88.5%) (p=0.138). Performance gains from scaling up were disproportionately larger for the pediatric domain. ConclusionThe GPT-5 family marks a substantial advancement in medical AI. The full-size model not only achieves high diagnostic accuracy but, crucially, overcomes the previously documented performance limitations in pediatrics. This demonstrates that sufficient model scale is vital for mastering the nuances of specialized clinical domains. These findings support a tiered implementation strategy based on task criticality and underscore the need for continued validation in real-world clinical settings.

Matching journals

The top 5 journals account for 50% of the predicted probability mass.

1
PLOS Digital Health
91 papers in training set
Top 0.1%
24.1%
2
BioData Mining
15 papers in training set
Top 0.1%
9.0%
3
PLOS ONE
4510 papers in training set
Top 24%
7.3%
4
Scientific Reports
3102 papers in training set
Top 12%
7.3%
5
Healthcare
16 papers in training set
Top 0.1%
6.8%
50% of probability mass above
6
IEEE Journal of Biomedical and Health Informatics
34 papers in training set
Top 0.3%
5.2%
7
Journal of the American Medical Informatics Association
61 papers in training set
Top 0.6%
4.6%
8
npj Digital Medicine
97 papers in training set
Top 1%
2.8%
9
JMIRx Med
31 papers in training set
Top 0.4%
2.2%
10
BMC Medical Informatics and Decision Making
39 papers in training set
Top 1%
1.8%
11
JAMA Network Open
127 papers in training set
Top 2%
1.8%
12
International Journal of Medical Informatics
25 papers in training set
Top 0.8%
1.8%
13
BMC Bioinformatics
383 papers in training set
Top 5%
1.3%
14
Computers in Biology and Medicine
120 papers in training set
Top 3%
1.3%
15
Biology Methods and Protocols
53 papers in training set
Top 2%
1.2%
16
GigaScience
172 papers in training set
Top 2%
1.0%
17
Nature Medicine
117 papers in training set
Top 4%
0.8%
18
Cureus
67 papers in training set
Top 5%
0.8%
19
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 46%
0.7%
20
Journal of Clinical Epidemiology
28 papers in training set
Top 0.6%
0.7%
21
Genome Medicine
154 papers in training set
Top 9%
0.7%
22
BMC Medical Education
20 papers in training set
Top 0.9%
0.7%
23
Annals of Internal Medicine
27 papers in training set
Top 1%
0.7%
24
European Journal of Neuroscience
168 papers in training set
Top 2%
0.5%
25
Bioinformatics Advances
184 papers in training set
Top 5%
0.5%