Extracting adverse event nature, severity, timelines and resulting interventions from clinical notes of patients receiving CAR-T therapy using large language models.

Guillot, J.; Miao, B.; Suresh, A.; Sushil, M.; Williams, C. Y.; Vashisht, R.; Oskotsky, T. T.; Sirota, M.; Butte, A. J.

2026-05-05 health informatics

10.64898/2026.04.28.26351782 medRxiv

Show abstract

Chimeric Antigen Receptor T-cell (CAR-T) therapy, where genetically engineered patient T cells target tumor antigens, has transformed care for hematologic malignancies but requires careful tracking of adverse events (AEs) often documented only in unstructured EHR notes. We evaluated a Large Language Model (LLM)-based approach in UCSFs secure environment to extract AEs, dates, grades, and interventions within 30 days post-infusion for six commercial CAR-T products (2012-2023), benchmarking against two evaluators. Using GPT-4-0314 in a zero-shot setting with four prompts (prespecified AEs, non-prespecified AEs, CRS, ICANS), we compared outputs against dual annotations on a random sample of 50 notes using accuracy, precision, recall, F1, and Cohens kappa. From 4,762 progress notes for 293 patients (median age 65.6), CRS occurred in 80.2% (median onset 4 days); neutropenia 70.0% (16 days); neutropenic fever 64.8% (4 days); ICANS in 34.8%. Interventions included tocilizumab and corticosteroids. Grades were frequently undocumented (CRS 62.3%, ICANS 56.1%); documented cases were mainly CRS grade 1 (59.4%) and ICANS grade 2 (28.0%). Performance was high on CRS and ICANS grading (accuracy of 0.97 and 0.91, respectively). Moderate performances were assessed for prespecified AE extraction (accuracies 0.62-0.76), and non-prespecified AEs (accuracies 0.76-0.84). Inter-rater reliability was strong to near-perfect for CRS/ICANS presence and grade (kappa 0.86-0.96), moderate for dates and interventions, and weaker for broader AE attributes. LLM-derived insights can augment AE monitoring and real-world evidence generation by unlocking unstructured clinical detail and characteristic timelines after CAR T. However, performance varied for broader AE attributes, warranting cautious use. Performance was highest for detecting the presence and grade of CRS and ICANS, with strong to near-perfect inter-rater reliability. While cautious use of LLMs for broad AE extraction is warranted due to the variable performance observed in this study, these results support integrating high-performing CRS/ICANS extraction into EHR workflows. Author summaryChimeric Antigen Receptor T-cell (CAR-T) therapy has transformed care for blood cancer but requires careful tracking of adverse events (AEs). We asked whether a large language model could read routine clinical notes and extract AEs after CAR T-cell therapy. We analyzed de-identified notes from the first month after infusion. The model identified when two key side effects occurred--cytokine release syndrome (a whole-body inflammatory reaction) and neurotoxicity (brain and nerve symptoms)--and how severe they were, with accuracy similar to human reviewers. It also captured when side effects started and what treatments were given, though performance was more variable for the wider range of side effects beyond these two. In our data, these reactions often arose within the first week; blood count problems and infections were also common. Because many notes did not state severity explicitly, the model sometimes could not assign a grade. Our findings suggest that language models can help unlock important details hidden in clinical notes and could be incorporated into electronic records to support faster, more reliable side-effect monitoring and research. We recommend careful, supervised use and continued validation, especially for broader side-effect categories.

Extracting adverse event nature, severity, timelines and resulting interventions from clinical notes of patients receiving CAR-T therapy using large language models.

Matching journals