Identify Patients at Risk of HIV Using a Clinical Large Language Model from Electronic Health Records

Liu, Y.; Chen, Z.; Suman, P.; Cho, H.; Prosperi, M.; Wu, Y.

2026-04-23 hiv aids

10.64898/2026.04.21.26351427 medRxiv

Show abstract

This study developed a large language model (LLM)-based solution to identify people at HIV risk using electronic health records. We transformed structured EHR data, including demographics, diagnoses, and medications, into narrative descriptions ordered by visit date and applied GatorTron, a widely used clinical LLM trained on 82 billion words of de-identified clinical text. We compared GatorTron with traditional machine learning models, including LASSO and XGBoost. We identified a cohort with 54,265 individuals, where only 3,342 (6%) had new HIV diagnoses. Our LLM solution, based on GatorTron, achieved excellent performance, reaching an F1 score of 53.5% and an AUC of 0.88, comparable to traditional machine learning approaches. Subgroup analysis showed that, across age, sex, and race/ethnicity groups, both LLM and traditional models achieved AUCs above 0.82. Interpretability analyses showed broadly consistent patterns across LLM models and traditional machine learning models.

Identify Patients at Risk of HIV Using a Clinical Large Language Model from Electronic Health Records

Matching journals