Back

Fully Automated Abstraction of Longitudinal Breast Oncology Records with Off-The-Shelf Large Language Models

Dickerson, J. C.; McClure, M. B.; Shaw, M.; Reitsma, M. B.; Dalal, N. H.; Kurian, A. W.; Caswell-Jin, J. L.

2026-03-25 oncology
10.64898/2026.03.23.26349012 medRxiv
Show abstract

Background: Manual chart abstraction is a major bottleneck in clinical research. In oncology, important outcomes such as disease recurrence and the treatment history are often only documented in clinical notes, limiting the scale and quality of observational and epidemiologic studies. We developed an open-source pipeline that, in a HIPAA-compliant setting, can use any commercially available large language model (LLM) to determine whether variables from complex longitudinal oncology records can be abstracted with performance similar to that of expert medical oncologists. Methods: We randomly selected 100 patients from an institutional breast cancer cohort enriched for complex care. We abstracted a range of key variables from unstructured data, including dates of diagnosis and recurrence, clinical stage, biomarker subtype, genetic testing results, and prescribed systemic therapies, including treatment timing, intent, and reason for discontinuation. The inputs to the LLM were unnormalized, unlabeled, and unedited clinical notes, pathology reports, med admin records, and demographics. Breast oncologists abstracted the same variables to create the reference standard. For systemic therapy extraction, a second oncologist and research coordinators served as comparators. In addition to variable-level performance, we examined whether survival and hazard-ratio estimates were similar for fully LLM-derived datasets compared with expert-derived datasets. Results: Among 100 patients, the median chart had more than 3,100 pages of text; patients received a median of 7 lines of therapy over 6.5 years of follow-up. The best-performing LLM achieved 99% concordance with the expert for recurrence status, 100% for germline BRCA1/2 pathogenic variant detection, 99% for hormone receptor status, 96% for HER2 status, 91% for clinical stage, 91% for PIK3CA mutation status, and 90% for ESR1 mutation status. For anti-cancer drug extraction, the best-performing LLM approached inter-oncologist variability. For exact therapy-line reconstruction, mean patient-level performance remained 9 percentage points lower than the second oncologist, although inter-LLM disagreement was similar to inter-oncologist disagreement. All four LLMs tested outperformed the research coordinators on systemic therapy abstraction. Recurrence-free survival, overall survival, and hazard ratio estimates were similar between expert-derived and LLM-derived datasets. In an external cohort of 97 young patients with early-stage breast cancer, the unmodified pipeline showed similar performance for recurrence detection and adjuvant endocrine therapy use. Conclusions: Off-the-shelf LLMs in a fixed retrieval pipeline were able to abstract a range of variables from complex longitudinal oncology records with performance approaching inter-oncologist variability for key tasks, without any fine-tuning or institution-specific retraining. This approach offers a practical path to scaling the creation of research-grade retrospective datasets from narrative medical records.

Matching journals

The top 3 journals account for 50% of the predicted probability mass.

1
JCO Clinical Cancer Informatics
18 papers in training set
Top 0.1%
41.8%
2
Breast Cancer Research
32 papers in training set
Top 0.2%
4.6%
3
Clinical Cancer Research
58 papers in training set
Top 0.3%
4.2%
50% of probability mass above
4
npj Digital Medicine
97 papers in training set
Top 1%
2.9%
5
Nature Communications
4913 papers in training set
Top 43%
2.7%
6
JAMA Network Open
127 papers in training set
Top 1%
2.7%
7
JCO Precision Oncology
14 papers in training set
Top 0.1%
2.6%
8
PLOS Computational Biology
1633 papers in training set
Top 13%
2.2%
9
European Journal of Cancer
10 papers in training set
Top 0.1%
2.2%
10
PLOS ONE
4510 papers in training set
Top 49%
2.0%
11
npj Precision Oncology
48 papers in training set
Top 0.4%
2.0%
12
Cancer Medicine
24 papers in training set
Top 0.7%
1.8%
13
Annals of Oncology
13 papers in training set
Top 0.4%
1.8%
14
Scientific Reports
3102 papers in training set
Top 55%
1.8%
15
npj Breast Cancer
18 papers in training set
Top 0.1%
1.4%
16
Frontiers in Oncology
95 papers in training set
Top 2%
1.4%
17
Nature Cancer
35 papers in training set
Top 0.9%
1.3%
18
Nature Medicine
117 papers in training set
Top 3%
1.0%
19
Cancers
200 papers in training set
Top 4%
1.0%
20
JNCI Cancer Spectrum
10 papers in training set
Top 0.4%
1.0%
21
Cancer Research
116 papers in training set
Top 3%
0.8%
22
BMC Cancer
52 papers in training set
Top 2%
0.8%
23
Cancer Epidemiology, Biomarkers & Prevention
17 papers in training set
Top 0.5%
0.8%
24
iScience
1063 papers in training set
Top 29%
0.8%
25
BMC Bioinformatics
383 papers in training set
Top 7%
0.8%
26
British Journal of Cancer
42 papers in training set
Top 2%
0.8%
27
BMC Research Notes
29 papers in training set
Top 0.9%
0.5%
28
Database
51 papers in training set
Top 1%
0.5%