Back

Identify Patients at Risk of HIV Using a Clinical Large Language Model from Electronic Health Records

Liu, Y.; Chen, Z.; Suman, P.; Cho, H.; Prosperi, M.; Wu, Y.

2026-04-23 hiv aids
10.64898/2026.04.21.26351427 medRxiv
Show abstract

This study developed a large language model (LLM)-based solution to identify people at HIV risk using electronic health records. We transformed structured EHR data, including demographics, diagnoses, and medications, into narrative descriptions ordered by visit date and applied GatorTron, a widely used clinical LLM trained on 82 billion words of de-identified clinical text. We compared GatorTron with traditional machine learning models, including LASSO and XGBoost. We identified a cohort with 54,265 individuals, where only 3,342 (6%) had new HIV diagnoses. Our LLM solution, based on GatorTron, achieved excellent performance, reaching an F1 score of 53.5% and an AUC of 0.88, comparable to traditional machine learning approaches. Subgroup analysis showed that, across age, sex, and race/ethnicity groups, both LLM and traditional models achieved AUCs above 0.82. Interpretability analyses showed broadly consistent patterns across LLM models and traditional machine learning models.

Matching journals

The top 6 journals account for 50% of the predicted probability mass.

1
Nature Medicine
117 papers in training set
Top 0.1%
12.6%
2
Nature Human Behaviour
85 papers in training set
Top 0.1%
10.4%
3
International Journal of Medical Informatics
25 papers in training set
Top 0.1%
9.1%
4
npj Digital Medicine
97 papers in training set
Top 0.7%
7.1%
5
Journal of Medical Internet Research
85 papers in training set
Top 0.8%
6.3%
6
PLOS ONE
4510 papers in training set
Top 29%
6.3%
50% of probability mass above
7
eLife
5422 papers in training set
Top 14%
6.3%
8
JAIDS Journal of Acquired Immune Deficiency Syndromes
19 papers in training set
Top 0.1%
4.8%
9
Nature Communications
4913 papers in training set
Top 37%
3.9%
10
PLOS Computational Biology
1633 papers in training set
Top 10%
3.6%
11
Science Translational Medicine
111 papers in training set
Top 1%
3.1%
12
Journal of the American Medical Informatics Association
61 papers in training set
Top 1%
2.1%
13
Communications Biology
886 papers in training set
Top 11%
1.5%
14
Communications Medicine
85 papers in training set
Top 0.4%
1.3%
15
Epidemics
104 papers in training set
Top 1%
1.2%
16
Clinical Infectious Diseases
231 papers in training set
Top 4%
1.1%
17
PLOS Medicine
98 papers in training set
Top 4%
0.9%
18
Journal of The Royal Society Interface
189 papers in training set
Top 4%
0.9%
19
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 43%
0.8%
20
Science
429 papers in training set
Top 20%
0.7%
21
eBioMedicine
130 papers in training set
Top 4%
0.7%
22
Nature Genetics
240 papers in training set
Top 8%
0.7%
23
American Journal of Epidemiology
57 papers in training set
Top 2%
0.6%