Back

Extracting adverse event nature, severity, timelines and resulting interventions from clinical notes of patients receiving CAR-T therapy using large language models.

Guillot, J.; Miao, B.; Suresh, A.; Sushil, M.; Williams, C. Y.; Vashisht, R.; Oskotsky, T. T.; Sirota, M.; Butte, A. J.

2026-05-05 health informatics
10.64898/2026.04.28.26351782 medRxiv
Show abstract

Chimeric Antigen Receptor T-cell (CAR-T) therapy, where genetically engineered patient T cells target tumor antigens, has transformed care for hematologic malignancies but requires careful tracking of adverse events (AEs) often documented only in unstructured EHR notes. We evaluated a Large Language Model (LLM)-based approach in UCSFs secure environment to extract AEs, dates, grades, and interventions within 30 days post-infusion for six commercial CAR-T products (2012-2023), benchmarking against two evaluators. Using GPT-4-0314 in a zero-shot setting with four prompts (prespecified AEs, non-prespecified AEs, CRS, ICANS), we compared outputs against dual annotations on a random sample of 50 notes using accuracy, precision, recall, F1, and Cohens kappa. From 4,762 progress notes for 293 patients (median age 65.6), CRS occurred in 80.2% (median onset 4 days); neutropenia 70.0% (16 days); neutropenic fever 64.8% (4 days); ICANS in 34.8%. Interventions included tocilizumab and corticosteroids. Grades were frequently undocumented (CRS 62.3%, ICANS 56.1%); documented cases were mainly CRS grade 1 (59.4%) and ICANS grade 2 (28.0%). Performance was high on CRS and ICANS grading (accuracy of 0.97 and 0.91, respectively). Moderate performances were assessed for prespecified AE extraction (accuracies 0.62-0.76), and non-prespecified AEs (accuracies 0.76-0.84). Inter-rater reliability was strong to near-perfect for CRS/ICANS presence and grade (kappa 0.86-0.96), moderate for dates and interventions, and weaker for broader AE attributes. LLM-derived insights can augment AE monitoring and real-world evidence generation by unlocking unstructured clinical detail and characteristic timelines after CAR T. However, performance varied for broader AE attributes, warranting cautious use. Performance was highest for detecting the presence and grade of CRS and ICANS, with strong to near-perfect inter-rater reliability. While cautious use of LLMs for broad AE extraction is warranted due to the variable performance observed in this study, these results support integrating high-performing CRS/ICANS extraction into EHR workflows. Author summaryChimeric Antigen Receptor T-cell (CAR-T) therapy has transformed care for blood cancer but requires careful tracking of adverse events (AEs). We asked whether a large language model could read routine clinical notes and extract AEs after CAR T-cell therapy. We analyzed de-identified notes from the first month after infusion. The model identified when two key side effects occurred--cytokine release syndrome (a whole-body inflammatory reaction) and neurotoxicity (brain and nerve symptoms)--and how severe they were, with accuracy similar to human reviewers. It also captured when side effects started and what treatments were given, though performance was more variable for the wider range of side effects beyond these two. In our data, these reactions often arose within the first week; blood count problems and infections were also common. Because many notes did not state severity explicitly, the model sometimes could not assign a grade. Our findings suggest that language models can help unlock important details hidden in clinical notes and could be incorporated into electronic records to support faster, more reliable side-effect monitoring and research. We recommend careful, supervised use and continued validation, especially for broader side-effect categories.

Matching journals

The top 2 journals account for 50% of the predicted probability mass.

1
Journal of the American Medical Informatics Association
61 papers in training set
Top 0.1%
35.2%
2
JCO Clinical Cancer Informatics
18 papers in training set
Top 0.1%
19.1%
50% of probability mass above
3
npj Digital Medicine
97 papers in training set
Top 0.7%
7.4%
4
Journal of Biomedical Informatics
45 papers in training set
Top 0.3%
5.0%
5
Journal of Medical Internet Research
85 papers in training set
Top 1%
3.7%
6
JAMIA Open
37 papers in training set
Top 0.5%
2.7%
7
International Journal of Medical Informatics
25 papers in training set
Top 0.5%
2.7%
8
BMC Medical Informatics and Decision Making
39 papers in training set
Top 1%
2.1%
9
BMJ Health & Care Informatics
13 papers in training set
Top 0.4%
1.7%
10
Frontiers in Digital Health
20 papers in training set
Top 0.6%
1.7%
11
JMIR Medical Informatics
17 papers in training set
Top 1%
1.1%
12
iScience
1063 papers in training set
Top 23%
1.1%
13
Med
38 papers in training set
Top 0.5%
1.0%
14
BMC Medical Research Methodology
43 papers in training set
Top 0.9%
1.0%
15
Scientific Reports
3102 papers in training set
Top 70%
0.9%
16
Informatics in Medicine Unlocked
21 papers in training set
Top 1.0%
0.8%
17
PLOS ONE
4510 papers in training set
Top 65%
0.8%
18
Artificial Intelligence in Medicine
15 papers in training set
Top 0.7%
0.7%
19
Computers in Biology and Medicine
120 papers in training set
Top 5%
0.7%
20
Cancer Medicine
24 papers in training set
Top 2%
0.7%
21
Clinical Pharmacology & Therapeutics
25 papers in training set
Top 0.9%
0.7%
22
Inflammatory Bowel Diseases
15 papers in training set
Top 0.3%
0.5%
23
Cureus
67 papers in training set
Top 6%
0.5%
24
Journal of Personalized Medicine
28 papers in training set
Top 2%
0.5%