Validation of 13,102 ICD-10-CM Codes Using a Large Language Model-Based System
Wang, Y.; Song, Y.; Siu, R.; Nimma, I. R.; Yan, Y.; Savage, T. R.; Wang, Y.; Li, Z.; Ramai, D.; Wang, J.; Badurdeen, D.; Tao, C.; Kumbhari, V.; Huang, Y.
Show abstract
ObjectiveTo comprehensively evaluate the validity of ICD-10-CM codes for both prevalent diagnoses and less common diseases, and to assess the performance of a large language model (LLM)-based system in validating these codes. Materials and MethodsThis retrospective study analyzed hospital admissions from the Medical Information Mart for Intensive Care (MIMIC-IV) database. We developed a validated LLM-based system using GPT-4o, refined through iterative prompt engineering, to assess ICD-10-CM code validity. We measured the PPV of ICD-10-CM codes, PPV of principal and secondary diagnoses, and the performance of an LLM-based system in code validation. ResultsAmong 865,079 assigned codes, the PPV was 84.6% (95% CI, 84.5%-84.6%). Principal diagnoses had a PPV of 93.9% (95% CI, 93.7%-94.1%), while secondary diagnoses had a PPV of 83.8% (95% CI, 83.7%-83.9%). The LLM system demonstrated high performance in validating ICD codes, achieving 93.6% accuracy, 95.4% sensitivity and 85.2% specificity. Among correctly assigned secondary diagnoses, the majority (67.9%) represented historical or baseline conditions, while 32.1% reflected active conditions that deviated from baseline status; 22.3% of these emerged after hospital admission. PPV decreases with later diagnosis positions, with the largest decline occurring between principal and secondary diagnoses. Discussion and ConclusionIn this large-scale evaluation, ICD-10-CM codes exhibited generally high accuracy, though variability existed by position and condition type. A validated LLM system performed comparably to physician review and offers a scalable means to improve coding accuracy. These findings support the potential for integrating LLM-based auditing into routine workflows to strengthen the quality of administrative and research data.
Matching journals
The top 4 journals account for 50% of the predicted probability mass.