Metastasis Extraction from NSCLC Clinical Notes: A Retrospective Comparative Evaluation of Large Language Model-Based Classification
Balaji, S.; Campbell, K.; Chen, R.-Z.; Smith, D. G.; Reyna, M. A.; Sarker, A.; Wallach, J. D.; Parikh, R. B.; Bozkurt, S.
Show abstract
BackgroundIdentification of metastasis status in non-small cell lung cancer (NSCLC) is a critical part of understanding disease prognosis, treatment courses, trial eligibility, and population-level cancer surveillance. However, metastasis record are inconsistently recorded in structured cancer registry fields, since manual abstraction of clinical notes is often a resource intensive and error-prone process. This challenge highlights an opportunity for leveraging large language models (LLMs) to conduct high-scale metastasis extraction from real-world clinical documentation. ObjectiveWe conducted a retrospective, multi-cohort comparative evaluation of three distinct LLMs for two independent classification tasks: overall metastasis presence at any site and brain/CNS metastasis presence. We evaluated model performance on two independent NSCLC cohorts: (1) a registry-linked cohort used for model development and validation and (2) an independent cohort with manual note-level annotations for additional validation. We further explored whether our methods could analyze clinical documentation and recover missing or outdated metastasis information in structured registry labels. MethodsPatient cohorts were derived from the Winship Cancer Institute. Cohort 1 (n=579 patients; 24,887 notes across 69 note types; 2023-2025) used registry-linked metastasis fields as the reference standard. Cohort 2 (n=22 patients; 644 radiology notes; 2010-2021) was drawn from two completed randomized trials and used dual-annotator manual labels (Cohens &[kappa]: 0.93 overall metastasis, 0.88 CNS metastasis) as the reference standard. We fine-tuned the GatorTron-base encoder model for each independent binary classification task, respectively. We evaluated MedGemma-27B-text and Llama 3.1-70B using zero-shot prompting. A separate cohort of 675 patients with missing or unknown registry labels was used for an exploratory missingness-recovery analysis, validated against manual annotations of a random subsample. ResultsMore than half (54%) of initially identified Cohort 1 patients had missing or unknown registry metastasis labels. For overall metastasis, fine-tuned MedGemma demonstrated the best performance in overall metastasis classification (Cohort 1: F1=0.80, Cohort 2 patient level: F1=1.0, Cohort 2 note level: F1=0.93). For brain/CNS metastasis, Llama3 performed best in both cohorts (Cohort 1: F1=0.79, Cohort 2 patient-level: F1=0.93, Cohort 2 note-level: F1=0.86). The fine-tuned GatorTron model showed strong performance for classification of overall metastasis in Cohort 1 (F1=0.72). Error analysis indicated that most model errors reflected incomplete registry labels, ambiguous clinical language, or missing documentation rather than true model errors. In the exploratory recovery analysis, model predictions agreed with manual annotations at accuracy=0.90 and F1=0.89. ConclusionsAll models demonstrated relatively high performance. The zero-shot generative models were more robust to nuanced documentation and context-dependent brain/CNS metastasis extraction. The fine-tuned encoder model demonstrated strong classification performance but may have been limited by potential inaccuracies in the registry reference standards during model training. This study further demonstrated the potential of LLMs in recovering clinically plausible structured labels from narrative text, complementing cancer registries for metastasis ascertainment.
Matching journals
The top 6 journals account for 50% of the predicted probability mass.