Back

Validation of 13,102 ICD-10-CM Codes Using a Large Language Model-Based System

Wang, Y.; Song, Y.; Siu, R.; Nimma, I. R.; Yan, Y.; Savage, T. R.; Wang, Y.; Li, Z.; Ramai, D.; Wang, J.; Badurdeen, D.; Tao, C.; Kumbhari, V.; Huang, Y.

2025-12-31 health informatics
10.64898/2025.12.30.25343244 medRxiv
Show abstract

ObjectiveTo comprehensively evaluate the validity of ICD-10-CM codes for both prevalent diagnoses and less common diseases, and to assess the performance of a large language model (LLM)-based system in validating these codes. Materials and MethodsThis retrospective study analyzed hospital admissions from the Medical Information Mart for Intensive Care (MIMIC-IV) database. We developed a validated LLM-based system using GPT-4o, refined through iterative prompt engineering, to assess ICD-10-CM code validity. We measured the PPV of ICD-10-CM codes, PPV of principal and secondary diagnoses, and the performance of an LLM-based system in code validation. ResultsAmong 865,079 assigned codes, the PPV was 84.6% (95% CI, 84.5%-84.6%). Principal diagnoses had a PPV of 93.9% (95% CI, 93.7%-94.1%), while secondary diagnoses had a PPV of 83.8% (95% CI, 83.7%-83.9%). The LLM system demonstrated high performance in validating ICD codes, achieving 93.6% accuracy, 95.4% sensitivity and 85.2% specificity. Among correctly assigned secondary diagnoses, the majority (67.9%) represented historical or baseline conditions, while 32.1% reflected active conditions that deviated from baseline status; 22.3% of these emerged after hospital admission. PPV decreases with later diagnosis positions, with the largest decline occurring between principal and secondary diagnoses. Discussion and ConclusionIn this large-scale evaluation, ICD-10-CM codes exhibited generally high accuracy, though variability existed by position and condition type. A validated LLM system performed comparably to physician review and offers a scalable means to improve coding accuracy. These findings support the potential for integrating LLM-based auditing into routine workflows to strengthen the quality of administrative and research data.

Matching journals

The top 4 journals account for 50% of the predicted probability mass.

1
JMIR Medical Informatics
17 papers in training set
Top 0.1%
18.5%
2
Journal of the American Medical Informatics Association
61 papers in training set
Top 0.1%
17.4%
3
BMC Medical Informatics and Decision Making
39 papers in training set
Top 0.2%
10.4%
4
JAMIA Open
37 papers in training set
Top 0.1%
10.0%
50% of probability mass above
5
International Journal of Medical Informatics
25 papers in training set
Top 0.2%
4.8%
6
BMC Medical Research Methodology
43 papers in training set
Top 0.2%
3.9%
7
JMIR Public Health and Surveillance
45 papers in training set
Top 0.5%
3.9%
8
Journal of Medical Internet Research
85 papers in training set
Top 1%
3.6%
9
PLOS ONE
4510 papers in training set
Top 42%
3.0%
10
Journal of Biomedical Informatics
45 papers in training set
Top 0.5%
3.0%
11
BMJ Open
554 papers in training set
Top 8%
2.1%
12
npj Digital Medicine
97 papers in training set
Top 2%
2.1%
13
Scientific Reports
3102 papers in training set
Top 54%
1.9%
14
BMJ Health & Care Informatics
13 papers in training set
Top 0.4%
1.7%
15
JAMA Network Open
127 papers in training set
Top 4%
0.9%
16
Journal of General Internal Medicine
20 papers in training set
Top 0.8%
0.9%
17
The Lancet Digital Health
25 papers in training set
Top 1.0%
0.8%
18
Frontiers in Public Health
140 papers in training set
Top 8%
0.7%
19
PLOS Digital Health
91 papers in training set
Top 3%
0.7%
20
Frontiers in Digital Health
20 papers in training set
Top 1%
0.7%
21
Journal of the American Heart Association
119 papers in training set
Top 4%
0.7%
22
Computer Methods and Programs in Biomedicine
27 papers in training set
Top 1%
0.7%
23
Heliyon
146 papers in training set
Top 8%
0.6%