Back

Extracting TNFi Switching Reasons and Trajectories From Real-World Data Using Large Language Models

Miao, B. Y.; Binvignat, M.; Garcia-Agundez, A.; Bravo, M.; Williams, C. Y.; Miao, C. Q.; Alaa, A.; Rudrapatna, V. A.; Butte, A. J.; Schmajuk, G.; Yazdany, J.

2025-04-16 health informatics
10.1101/2025.04.14.25325834 medRxiv
Show abstract

ImportanceTumor necrosis factor inhibitors (TNFi) are widely used for auto-immune conditions. Despite their efficacy, many patients switch TNFis due to lack of efficacy, cost-related reasons, or adverse events. Understanding why switches occur is important, but requires extensive chart review. ObjectiveTo determine whether large language models (LLMs) can automatically perform chart review, accurately identifying TNFi switching trajectories and reasons for switching in a large real-world cohort. DesignObservational study using de-identified electronic health record data (2012-2023). Medication orders and associated clinical notes for TNFi agents were extracted; at least 6 months of follow-up was required to ascertain switches. SettingSingle academic medical center (University of California, San Francisco). Participants9,187 patients (mean [SD] age, 39.9 [19.0] years; 57.1% female) who received [≥]1 TNFi with adequate follow-up. Among these, 1,481 (16.1%) had [≥]1 TNFi switch, 418 (4.5%) had [≥]2 switches, and 150 (1.6%) had [≥]3 switches. ExposuresSwitching was defined as a change from one TNFi to a different TNFi at consecutive encounters. Main Outcome(s) and Measure(s)Using GPT-4, we extracted which TNFi was stopped or started, and reasons for switching: adverse event; drug resistance; insurance/cost; lack of efficacy; patient preference; other; unknown. Performance was compared with eight open source LLMs, structured medication data and expert annotations. ResultsAfter applying inclusion criteria, 3,104 switches between different TNFi drugs in 2,112 patients were identified. GPT-4 achieved micro-F1 scores of 0.75 for stopped TNFi, 0.80 for started TNFi, and 0.83 for switch reason. From all open-source models, Starling-7B-beta and Llama-3-8B offered the most competitive performance overall compared to GPT-4 and achieved similar win-loss ratios. The primary reasons identified by GPT-4 was lack of efficacy (56.9%), followed by adverse events (13.5%) and insurance/cost (10.8%). Conclusions and RelevanceBoth GPT-4 and locally deployable LLMs, demonstrated potential in executing complex reasoning tasks, specifically identifying reasons for switching between TNF inhibitors. This finding suggests broader application in clinical research and documentation. Further research is needed to assess model performance across additional medication classes and patient populations. Keys pointsO_ST_ABSQuestionC_ST_ABSCan large language models (LLMs) identify (TNF inhibitors) TNFi switching trajectories and reasons from clinical notes? FindingsWe used de-identified electronic health records from UCSF (University of California San Francisco) from 9,187 patients who received [≥]1 TNFi. GPT-4 achieved micro-F1 scores up to 0.830 identifying reasons and specific TNFi starts/stops compared to clinical expert annotations, surpassing eight open-source LLMs. The best open-source models, Llama-3-8b-chat-hf and Starling-7B-beta, matched GPT-4 in determining which TNFi was started but had lower accuracy in identifying reasons for switching. MeaningLLMs evaluated in this study were capable of performing complex reasoning tasks in identifying reasons for switching between TNFi. Broader application could be used for other biological but also in other pharmacoepidemiology studies and in chart summarization.

Matching journals

The top 1 journal accounts for 50% of the predicted probability mass.

1
Journal of the American Medical Informatics Association
61 papers in training set
Top 0.1%
51.8%
50% of probability mass above
2
npj Digital Medicine
97 papers in training set
Top 0.5%
10.1%
3
Journal of Biomedical Informatics
45 papers in training set
Top 0.4%
3.6%
4
Journal of Medical Internet Research
85 papers in training set
Top 2%
2.4%
5
JAMA Network Open
127 papers in training set
Top 2%
2.1%
6
The Lancet Digital Health
25 papers in training set
Top 0.3%
1.9%
7
International Journal of Medical Informatics
25 papers in training set
Top 0.8%
1.8%
8
JAMIA Open
37 papers in training set
Top 0.9%
1.7%
9
PLOS ONE
4510 papers in training set
Top 55%
1.7%
10
BMC Medical Informatics and Decision Making
39 papers in training set
Top 2%
1.5%
11
JMIR Medical Informatics
17 papers in training set
Top 1%
1.2%
12
Scientific Reports
3102 papers in training set
Top 66%
1.2%
13
JCO Clinical Cancer Informatics
18 papers in training set
Top 0.6%
1.1%
14
Pharmacoepidemiology and Drug Safety
13 papers in training set
Top 0.3%
0.9%
15
Clinical Pharmacology & Therapeutics
25 papers in training set
Top 0.5%
0.9%
16
BMC Medical Research Methodology
43 papers in training set
Top 1.0%
0.9%
17
European Respiratory Journal
54 papers in training set
Top 2%
0.9%
18
BMC Medicine
163 papers in training set
Top 7%
0.7%
19
PLOS Digital Health
91 papers in training set
Top 3%
0.7%
20
JMIR Public Health and Surveillance
45 papers in training set
Top 4%
0.7%
21
iScience
1063 papers in training set
Top 35%
0.7%
22
BMJ Open
554 papers in training set
Top 13%
0.7%
23
Research Synthesis Methods
20 papers in training set
Top 0.2%
0.6%
24
Inflammatory Bowel Diseases
15 papers in training set
Top 0.3%
0.6%