Extracting TNFi Switching Reasons and Trajectories From Real-World Data Using Large Language Models
Miao, B. Y.; Binvignat, M.; Garcia-Agundez, A.; Bravo, M.; Williams, C. Y.; Miao, C. Q.; Alaa, A.; Rudrapatna, V. A.; Butte, A. J.; Schmajuk, G.; Yazdany, J.
Show abstract
ImportanceTumor necrosis factor inhibitors (TNFi) are widely used for auto-immune conditions. Despite their efficacy, many patients switch TNFis due to lack of efficacy, cost-related reasons, or adverse events. Understanding why switches occur is important, but requires extensive chart review. ObjectiveTo determine whether large language models (LLMs) can automatically perform chart review, accurately identifying TNFi switching trajectories and reasons for switching in a large real-world cohort. DesignObservational study using de-identified electronic health record data (2012-2023). Medication orders and associated clinical notes for TNFi agents were extracted; at least 6 months of follow-up was required to ascertain switches. SettingSingle academic medical center (University of California, San Francisco). Participants9,187 patients (mean [SD] age, 39.9 [19.0] years; 57.1% female) who received [≥]1 TNFi with adequate follow-up. Among these, 1,481 (16.1%) had [≥]1 TNFi switch, 418 (4.5%) had [≥]2 switches, and 150 (1.6%) had [≥]3 switches. ExposuresSwitching was defined as a change from one TNFi to a different TNFi at consecutive encounters. Main Outcome(s) and Measure(s)Using GPT-4, we extracted which TNFi was stopped or started, and reasons for switching: adverse event; drug resistance; insurance/cost; lack of efficacy; patient preference; other; unknown. Performance was compared with eight open source LLMs, structured medication data and expert annotations. ResultsAfter applying inclusion criteria, 3,104 switches between different TNFi drugs in 2,112 patients were identified. GPT-4 achieved micro-F1 scores of 0.75 for stopped TNFi, 0.80 for started TNFi, and 0.83 for switch reason. From all open-source models, Starling-7B-beta and Llama-3-8B offered the most competitive performance overall compared to GPT-4 and achieved similar win-loss ratios. The primary reasons identified by GPT-4 was lack of efficacy (56.9%), followed by adverse events (13.5%) and insurance/cost (10.8%). Conclusions and RelevanceBoth GPT-4 and locally deployable LLMs, demonstrated potential in executing complex reasoning tasks, specifically identifying reasons for switching between TNF inhibitors. This finding suggests broader application in clinical research and documentation. Further research is needed to assess model performance across additional medication classes and patient populations. Keys pointsO_ST_ABSQuestionC_ST_ABSCan large language models (LLMs) identify (TNF inhibitors) TNFi switching trajectories and reasons from clinical notes? FindingsWe used de-identified electronic health records from UCSF (University of California San Francisco) from 9,187 patients who received [≥]1 TNFi. GPT-4 achieved micro-F1 scores up to 0.830 identifying reasons and specific TNFi starts/stops compared to clinical expert annotations, surpassing eight open-source LLMs. The best open-source models, Llama-3-8b-chat-hf and Starling-7B-beta, matched GPT-4 in determining which TNFi was started but had lower accuracy in identifying reasons for switching. MeaningLLMs evaluated in this study were capable of performing complex reasoning tasks in identifying reasons for switching between TNFi. Broader application could be used for other biological but also in other pharmacoepidemiology studies and in chart summarization.
Matching journals
The top 1 journal accounts for 50% of the predicted probability mass.