Back

Metastasis Extraction from NSCLC Clinical Notes: A Retrospective Comparative Evaluation of Large Language Model-Based Classification

Balaji, S.; Campbell, K.; Chen, R.-Z.; Smith, D. G.; Reyna, M. A.; Sarker, A.; Wallach, J. D.; Parikh, R. B.; Bozkurt, S.

2026-04-29 health informatics
10.64898/2026.04.27.26351872 medRxiv
Show abstract

BackgroundIdentification of metastasis status in non-small cell lung cancer (NSCLC) is a critical part of understanding disease prognosis, treatment courses, trial eligibility, and population-level cancer surveillance. However, metastasis record are inconsistently recorded in structured cancer registry fields, since manual abstraction of clinical notes is often a resource intensive and error-prone process. This challenge highlights an opportunity for leveraging large language models (LLMs) to conduct high-scale metastasis extraction from real-world clinical documentation. ObjectiveWe conducted a retrospective, multi-cohort comparative evaluation of three distinct LLMs for two independent classification tasks: overall metastasis presence at any site and brain/CNS metastasis presence. We evaluated model performance on two independent NSCLC cohorts: (1) a registry-linked cohort used for model development and validation and (2) an independent cohort with manual note-level annotations for additional validation. We further explored whether our methods could analyze clinical documentation and recover missing or outdated metastasis information in structured registry labels. MethodsPatient cohorts were derived from the Winship Cancer Institute. Cohort 1 (n=579 patients; 24,887 notes across 69 note types; 2023-2025) used registry-linked metastasis fields as the reference standard. Cohort 2 (n=22 patients; 644 radiology notes; 2010-2021) was drawn from two completed randomized trials and used dual-annotator manual labels (Cohens &[kappa]: 0.93 overall metastasis, 0.88 CNS metastasis) as the reference standard. We fine-tuned the GatorTron-base encoder model for each independent binary classification task, respectively. We evaluated MedGemma-27B-text and Llama 3.1-70B using zero-shot prompting. A separate cohort of 675 patients with missing or unknown registry labels was used for an exploratory missingness-recovery analysis, validated against manual annotations of a random subsample. ResultsMore than half (54%) of initially identified Cohort 1 patients had missing or unknown registry metastasis labels. For overall metastasis, fine-tuned MedGemma demonstrated the best performance in overall metastasis classification (Cohort 1: F1=0.80, Cohort 2 patient level: F1=1.0, Cohort 2 note level: F1=0.93). For brain/CNS metastasis, Llama3 performed best in both cohorts (Cohort 1: F1=0.79, Cohort 2 patient-level: F1=0.93, Cohort 2 note-level: F1=0.86). The fine-tuned GatorTron model showed strong performance for classification of overall metastasis in Cohort 1 (F1=0.72). Error analysis indicated that most model errors reflected incomplete registry labels, ambiguous clinical language, or missing documentation rather than true model errors. In the exploratory recovery analysis, model predictions agreed with manual annotations at accuracy=0.90 and F1=0.89. ConclusionsAll models demonstrated relatively high performance. The zero-shot generative models were more robust to nuanced documentation and context-dependent brain/CNS metastasis extraction. The fine-tuned encoder model demonstrated strong classification performance but may have been limited by potential inaccuracies in the registry reference standards during model training. This study further demonstrated the potential of LLMs in recovering clinically plausible structured labels from narrative text, complementing cancer registries for metastasis ascertainment.

Matching journals

The top 6 journals account for 50% of the predicted probability mass.

1
JCO Clinical Cancer Informatics
18 papers in training set
Top 0.1%
22.6%
2
npj Digital Medicine
97 papers in training set
Top 0.6%
8.4%
3
Journal of the American Medical Informatics Association
61 papers in training set
Top 0.4%
7.2%
4
Journal of Biomedical Informatics
45 papers in training set
Top 0.3%
4.9%
5
BMC Medical Informatics and Decision Making
39 papers in training set
Top 0.7%
4.0%
6
The Lancet Digital Health
25 papers in training set
Top 0.1%
3.9%
50% of probability mass above
7
Artificial Intelligence in Medicine
15 papers in training set
Top 0.1%
3.6%
8
Scientific Reports
3102 papers in training set
Top 36%
3.6%
9
JAMIA Open
37 papers in training set
Top 0.5%
3.1%
10
BMJ Health & Care Informatics
13 papers in training set
Top 0.2%
3.1%
11
International Journal of Medical Informatics
25 papers in training set
Top 0.6%
2.1%
12
Bioinformatics
1061 papers in training set
Top 7%
1.7%
13
PLOS Digital Health
91 papers in training set
Top 1%
1.7%
14
Nature Communications
4913 papers in training set
Top 54%
1.5%
15
PLOS ONE
4510 papers in training set
Top 56%
1.5%
16
JMIR Medical Informatics
17 papers in training set
Top 0.8%
1.5%
17
JAMA Network Open
127 papers in training set
Top 3%
1.3%
18
eBioMedicine
130 papers in training set
Top 2%
1.3%
19
Annals of Internal Medicine
27 papers in training set
Top 0.6%
1.2%
20
iScience
1063 papers in training set
Top 21%
1.2%
21
Biology Methods and Protocols
53 papers in training set
Top 2%
1.1%
22
Med
38 papers in training set
Top 0.5%
1.1%
23
Frontiers in Artificial Intelligence
18 papers in training set
Top 0.5%
1.0%
24
Nature Machine Intelligence
61 papers in training set
Top 3%
0.8%
25
Cancer Medicine
24 papers in training set
Top 1%
0.7%
26
Inflammatory Bowel Diseases
15 papers in training set
Top 0.3%
0.7%
27
Journal of Pathology Informatics
13 papers in training set
Top 0.4%
0.7%
28
JAMA
17 papers in training set
Top 0.4%
0.7%
29
Frontiers in Digital Health
20 papers in training set
Top 2%
0.6%