Closing the Pediatric Divide: A Performance Analysis of the GPT-5 Family in Medical Diagnostics

Mondillo, G.; Abbate, F. G.; Masino, M.; Colosimo, S.; Perrotta, A.; Frattolillo, V.

2025-08-29 pediatrics

10.1101/2025.08.28.25334657 medRxiv

Show abstract

AO_SCPLOWBSTRACTC_SCPLOWO_ST_ABSBackgroundC_ST_ABSLarge Language Models (LLMs) have demonstrated significant potential in clinical medicine, but a persistent performance gap exists in the pediatric domain due to its unique complexities. This study provides the first comparative evaluation of the new GPT-5 family (Nano, Mini, and full) to assess the impact of model scale on diagnostic accuracy and this specific adult-pediatric disparity. MethodsA benchmarking study was conducted using 2,000 multiple-choice questions from the MedQA dataset, equally divided between adult (n=1,000) and pediatric (n=1,000) domains. GPT-5, GPT-5 Mini, and GPT-5 Nano were tested via API with standardized parameters (temperature=0, reasoning effort=minimal, verbosity=low, maxtoken=170). Accuracy was calculated and statistically compared across domains for each model. ResultsA clear dose-response relationship was observed between model size and accuracy. GPT-5 Nano exhibited a significant performance gap, with an accuracy of 71.0% in adult medicine versus 55.4% in pediatrics (a 15.6 percentage point difference, p<0.001). GPT-5 Mini substantially narrowed this gap to 5.7 points (81.5% vs. 75.8%, p=0.001). Critically, the full GPT-5 model eliminated the disparity, achieving comparable accuracy in adult medicine (86.3%) and slightly higher accuracy in pediatrics (88.5%) (p=0.138). Performance gains from scaling up were disproportionately larger for the pediatric domain. ConclusionThe GPT-5 family marks a substantial advancement in medical AI. The full-size model not only achieves high diagnostic accuracy but, crucially, overcomes the previously documented performance limitations in pediatrics. This demonstrates that sufficient model scale is vital for mastering the nuances of specialized clinical domains. These findings support a tiered implementation strategy based on task criticality and underscore the need for continued validation in real-world clinical settings.

Closing the Pediatric Divide: A Performance Analysis of the GPT-5 Family in Medical Diagnostics

Matching journals