Fully Automated Abstraction of Longitudinal Breast Oncology Records with Off-The-Shelf Large Language Models
Dickerson, J. C.; McClure, M. B.; Shaw, M.; Reitsma, M. B.; Dalal, N. H.; Kurian, A. W.; Caswell-Jin, J. L.
Show abstract
Background: Manual chart abstraction is a major bottleneck in clinical research. In oncology, important outcomes such as disease recurrence and the treatment history are often only documented in clinical notes, limiting the scale and quality of observational and epidemiologic studies. We developed an open-source pipeline that, in a HIPAA-compliant setting, can use any commercially available large language model (LLM) to determine whether variables from complex longitudinal oncology records can be abstracted with performance similar to that of expert medical oncologists. Methods: We randomly selected 100 patients from an institutional breast cancer cohort enriched for complex care. We abstracted a range of key variables from unstructured data, including dates of diagnosis and recurrence, clinical stage, biomarker subtype, genetic testing results, and prescribed systemic therapies, including treatment timing, intent, and reason for discontinuation. The inputs to the LLM were unnormalized, unlabeled, and unedited clinical notes, pathology reports, med admin records, and demographics. Breast oncologists abstracted the same variables to create the reference standard. For systemic therapy extraction, a second oncologist and research coordinators served as comparators. In addition to variable-level performance, we examined whether survival and hazard-ratio estimates were similar for fully LLM-derived datasets compared with expert-derived datasets. Results: Among 100 patients, the median chart had more than 3,100 pages of text; patients received a median of 7 lines of therapy over 6.5 years of follow-up. The best-performing LLM achieved 99% concordance with the expert for recurrence status, 100% for germline BRCA1/2 pathogenic variant detection, 99% for hormone receptor status, 96% for HER2 status, 91% for clinical stage, 91% for PIK3CA mutation status, and 90% for ESR1 mutation status. For anti-cancer drug extraction, the best-performing LLM approached inter-oncologist variability. For exact therapy-line reconstruction, mean patient-level performance remained 9 percentage points lower than the second oncologist, although inter-LLM disagreement was similar to inter-oncologist disagreement. All four LLMs tested outperformed the research coordinators on systemic therapy abstraction. Recurrence-free survival, overall survival, and hazard ratio estimates were similar between expert-derived and LLM-derived datasets. In an external cohort of 97 young patients with early-stage breast cancer, the unmodified pipeline showed similar performance for recurrence detection and adjuvant endocrine therapy use. Conclusions: Off-the-shelf LLMs in a fixed retrieval pipeline were able to abstract a range of variables from complex longitudinal oncology records with performance approaching inter-oncologist variability for key tasks, without any fine-tuning or institution-specific retraining. This approach offers a practical path to scaling the creation of research-grade retrospective datasets from narrative medical records.
Matching journals
The top 3 journals account for 50% of the predicted probability mass.