Back

Large Language Models Improve Coding Accuracy and Reimbursement in a Neonatal Intensive Care Unit

Holmes, E.; Massarelli, C.; Richter, F.; Bernard, S.; Freeman, R.; Gavin, N.; Juliano, C.; Gelb, B. D.; Glicksberg, B. S.; Nadkarni, G. N.; Klang, E.

2025-07-24 health informatics
10.1101/2025.07.23.25332086 medRxiv
Show abstract

ImportanceDiagnosis coding is essential for clinical care, research validity, and hospital reimbursement. In neonatal settings, manual coding is frequently error-prone, contributing to misclassification and financial losses. Large language models (LLMs) offer a scalable approach to improve diagnostic consistency and optimize revenue. ObjectiveTo compare the diagnostic accuracy of LLMs with human coders in identifying common neonatal diagnoses and assess the potential impact on revenue from Diagnosis-Related Group (DRG) assignment. DesignThis was a retrospective cross-sectional study conducted using data from 2022 to 2023. LLMs were prompted with all physician notes from the admission. Two neonatologists independently and blindly adjudicated diagnoses from three sources: human coders, GPT-4o, and GPT-o3-mini. SettingA single academic referral centers neonatal intensive care unit (NICU). ParticipantsThe study included a consecutive sample of 100 infants admitted to the NICU who did not require respiratory support. All available physician notes from the hospital stay were included. ExposureTwo HIPAA-compliant LLMs (GPT-4o and GPT-o3-mini) were prompted to assign diagnoses from a standardized list based on physician notes. Three prompt iterations were developed and reviewed for optimization prior to final evaluation. Main Outcomes and MeasuresThe primary outcome was diagnostic accuracy compared with physician adjudication. Secondary outcomes included changes in expected DRG assignment and projected annual revenue. ResultsAmong 100 infants (median gestational age 35.6 weeks, 52% male), GPT-o3-mini achieved 79.1% diagnostic accuracy (95% CI, 74.0%-84.2%), comparable to human coders at 76.3% (95% CI, 70.9%-81.7%; P = .52). GPT-4o underperformed at 58.6% (95% CI, 52.5%-64.7%; P < .001 vs both). Accuracy of GPT-o3-mini did not differ by DRG impact. Extrapolated to one year, correct GPT-o3-mini diagnoses yielded projected revenue of $5.71 million, compared to $4.82 million from human coders, an 18% increase. Conclusions and RelevanceA HIPAA-compliant LLM demonstrated diagnostic accuracy comparable to human coders in neonatal billing while identifying higher-acuity diagnoses that improved projected reimbursement. LLMs may serve as effective adjuncts to manual coding in neonatal care, with potential clinical and financial benefit. Key PointsO_ST_ABSQuestionC_ST_ABSCan a large language model support accurate diagnosis generation for neonatal billing? FindingsIn this retrospective study of 100 neonates hospitalized in the Neonatal Intensive Care Unit, GPT-o3-mini demonstrated diagnostic accuracy comparable to human coders, as confirmed by physician review. Its implementation could yield an estimated 18% increase in revenue. MeaningLarge language models may serve as effective adjuncts in neonatal coding, offering both diagnostic precision and financial benefit.

Matching journals

The top 8 journals account for 50% of the predicted probability mass.

1
Journal of the American Medical Informatics Association
61 papers in training set
Top 0.1%
18.9%
2
BMJ Open
554 papers in training set
Top 2%
8.5%
3
JAMA Pediatrics
10 papers in training set
Top 0.1%
4.9%
4
JAMA Network Open
127 papers in training set
Top 0.5%
4.9%
5
CMAJ Open
12 papers in training set
Top 0.1%
4.4%
6
BMJ
49 papers in training set
Top 0.2%
4.4%
7
BMJ Health & Care Informatics
13 papers in training set
Top 0.2%
3.1%
8
PLOS ONE
4510 papers in training set
Top 45%
2.6%
50% of probability mass above
9
BMJ Open Quality
15 papers in training set
Top 0.4%
2.1%
10
Critical Care Explorations
15 papers in training set
Top 0.2%
1.9%
11
JMIR Medical Informatics
17 papers in training set
Top 0.7%
1.7%
12
PLOS Global Public Health
293 papers in training set
Top 4%
1.7%
13
eClinicalMedicine
55 papers in training set
Top 0.5%
1.7%
14
Trials
25 papers in training set
Top 0.8%
1.7%
15
International Journal of Medical Informatics
25 papers in training set
Top 0.8%
1.7%
16
BMJ Paediatrics Open
21 papers in training set
Top 0.5%
1.5%
17
JAMIA Open
37 papers in training set
Top 0.9%
1.5%
18
The Journal of Pediatrics
15 papers in training set
Top 0.4%
1.5%
19
PLOS Digital Health
91 papers in training set
Top 2%
1.3%
20
JAMA
17 papers in training set
Top 0.1%
1.3%
21
BMC Health Services Research
42 papers in training set
Top 1%
1.3%
22
The Lancet Digital Health
25 papers in training set
Top 0.6%
1.2%
23
Journal of General Internal Medicine
20 papers in training set
Top 0.7%
1.2%
24
Philosophical Transactions of the Royal Society B
51 papers in training set
Top 4%
1.1%
25
BMC Medicine
163 papers in training set
Top 5%
1.1%
26
Journal of Medical Internet Research
85 papers in training set
Top 4%
1.0%
27
Scientific Reports
3102 papers in training set
Top 69%
1.0%
28
BMC Pregnancy and Childbirth
20 papers in training set
Top 0.6%
0.9%
29
Archives of Disease in Childhood
15 papers in training set
Top 0.4%
0.8%
30
Canadian Medical Association Journal
15 papers in training set
Top 0.3%
0.8%