Back

Evaluating Large Language Models for Transparent Quality-of-Care Measurement in Children with ADHD

Bannett, Y.; Pillai, M.; Huang, T.; Luo, I.; Gunturkun, F.; Hernandez-Boussard, T.

2026-04-17 pediatrics
10.64898/2026.04.12.26350732 medRxiv
Show abstract

ImportanceGuideline-concordant care for young children with attention-deficit/hyperactivity disorder (ADHD) includes recommending parent training in behavior management (PTBM) as first-line treatment. However, assessing guideline adherence through manual chart review is time-consuming and costly, limiting scalable and timely quality-of-care measurement. ObjectiveTo evaluate the accuracy and explainability of large language models (LLMs) in identifying PTBM recommendations in pediatric electronic health record (EHR) notes as a scalable alternative to manual chart review. Design, Setting, and ParticipantsThis retrospective cohort study was conducted in a community-based pediatric healthcare network in California consisting of 27 primary care clinics. The study cohort included children aged 4-6 years with [≥] 2 primary care visits between 2020-2024 and ICD-10 diagnoses of ADHD or ADHD symptoms (n=542 patients). Clinical notes from the first ADHD-related visit were included. A stratified subset of 122 notes, including all cases with model disagreement, was manually annotated to assess model performance in identifying PTBM recommendations and rank model explanations. ExposuresAssessment and plan sections of clinical notes were analyzed using three generative large language models (Claude-3.5, GPT-4o, and LLaMA-3.3-70B) to identify the presence of PTBM recommendations and generate explanatory rationales and documentation evidence. Main Outcomes and MeasuresModel performance in identifying PTBM recommendations (measured by sensitivity, positive predictive value (PPV), and F1-score) and qualitative explainability ratings of model-generated rationales (based on the QUEST framework). ResultsAll three models demonstrated high performance compared to expert chart review. Claude-3.5 showed balanced performance (sensitivity=0.89, PPV=0.95, and F1-score=0.92) and ranked highest in explainability. LLaMA3.3-70B achieved sensitivity=0.91, PPV=0.89, and F1-score=0.90, ranking second for explainability. GPT-4o had the highest PPV [0.97] but lowest sensitivity [0.82], with an F1-score of 0.89 and the lowest explainability ranking. Based on classifications from the best-performing model, Claude-3.5, 26.4% (143/542) of patients had documented PTBM recommendations at their first ADHD-related visit. Conclusions and RelevanceLLMs can accurately extract guideline-concordant clinician recommendations for non-pharmacological ADHD treatment from unstructured clinical notes while providing clear explanations and supporting evidence. Evaluating model explainability as part of LLM implementation for medical chart review tasks can promote transparent and scalable solutions for quality-of-care measurement.

Matching journals

The top 8 journals account for 50% of the predicted probability mass.

1
JAMA Network Open
127 papers in training set
Top 0.1%
18.5%
2
The Journal of Pediatrics
15 papers in training set
Top 0.1%
7.3%
3
Journal of the American Medical Informatics Association
61 papers in training set
Top 0.4%
6.9%
4
Journal of Clinical Epidemiology
28 papers in training set
Top 0.1%
4.9%
5
PLOS ONE
4510 papers in training set
Top 31%
4.9%
6
Canadian Medical Association Journal
15 papers in training set
Top 0.1%
3.6%
7
Archives of Disease in Childhood
15 papers in training set
Top 0.1%
3.6%
8
JAMA Pediatrics
10 papers in training set
Top 0.1%
3.1%
50% of probability mass above
9
PLOS Medicine
98 papers in training set
Top 2%
2.6%
10
BioData Mining
15 papers in training set
Top 0.2%
2.1%
11
PLOS Digital Health
91 papers in training set
Top 1%
2.1%
12
Healthcare
16 papers in training set
Top 0.4%
1.9%
13
Nature Human Behaviour
85 papers in training set
Top 2%
1.8%
14
Nature Communications
4913 papers in training set
Top 49%
1.8%
15
BMJ Open
554 papers in training set
Top 9%
1.7%
16
Scientific Reports
3102 papers in training set
Top 59%
1.7%
17
Pilot and Feasibility Studies
12 papers in training set
Top 0.3%
1.3%
18
Psychological Medicine
74 papers in training set
Top 1%
1.3%
19
Genetics in Medicine
69 papers in training set
Top 0.8%
1.2%
20
Genome Medicine
154 papers in training set
Top 7%
0.9%
21
Psychiatry Research
35 papers in training set
Top 1%
0.8%
22
JMIRx Med
31 papers in training set
Top 2%
0.8%
23
JMIR Formative Research
32 papers in training set
Top 1%
0.8%
24
Pediatrics
10 papers in training set
Top 0.2%
0.8%
25
Clinical Infectious Diseases
231 papers in training set
Top 4%
0.8%
26
Annals of Internal Medicine
27 papers in training set
Top 0.8%
0.8%
27
Frontiers in Digital Health
20 papers in training set
Top 1%
0.8%
28
Clinical and Translational Science
21 papers in training set
Top 1%
0.8%
29
Autism Research
32 papers in training set
Top 0.4%
0.8%
30
Journal of Affective Disorders Reports
10 papers in training set
Top 0.3%
0.8%