Back

Combining Token Classification With Large Language Model Revision for Age-Friendly 4M Entity Recognition From Nursing Home Text Messages: Development and Evaluation Study

Amewudah, P.; Popescu, M.; Farmer, M. S.; Powell, K. R.

2026-04-01 health informatics
10.64898/2026.03.31.26349861 medRxiv
Show abstract

Background: Secure text messages (TMs) exchanged among interdisciplinary care teams in nursing homes (NHs) contain clinical information that aligns with the Age-Friendly Health Systems 4Ms: What Matters, Medication, Mentation, and Mobility, yet, this information is not captured in any structured form, making it unavailable for systematic monitoring or quality reporting. Automatically extracting 4M information accurately and efficiently from these messages could enable several downstream applications within long term care settings. This task, however, is challenging because of the fragmented syntax, brevity, abbreviations, and informality of TMs. Objective: This study aimed to develop and evaluate a multi-stage 4M Entity Recognition (4M-ER) pipeline that combines a fine-tuned token classifier with large language model (LLM) revision, using only locally deployed open-source models, to improve 4M information extraction from clinical TMs. Methods: We used an expert-annotated dataset of 1,169 TMs collected from interdisciplinary teams across 16 Midwest NHs. The pipeline first identifies candidate text spans using a fine-tuned Bio-ClinicalBERT token classifier. A semantic similarity retriever then selects in-context exemplars to guide an LLM revision in which the LLM (Gemma, Phi, Qwen, or Mistral) performs boundary correction, label evaluation, and selective acceptance or rejection of candidate spans. Baselines for comparison included single-stage zero-shot LLMs, single-stage fine-tuned Bio-ClinicalBERT, and a fine-tuned LLM (Gemma) from a prior study. Ablation studies assessed the contribution of each pipeline stage and the effect of message filtering. Robustness was evaluated across 5 repeated runs. Results: The 4M-ER pipeline outperformed the previously fine-tuned Gemma LLM across all 4M domains, achieving F1 (entity type) improvements of +2 to +11 percentage points without any additional fine-tuning and at roughly half the GPU memory (12 vs 24 GB). It also improved upon single-stage fine-tuned Bio-ClinicalBERT in Mobility, Mentation, and What Matters (+0.02 to +0.05 F1). Error analysis showed that LLM revision reduced false positives by 25% to 35% by correcting misclassifications caused by conversational ambiguity, while the fine-tuned Bio-ClinicalBERT's high recall captured subtle entities that the fine-tuned Gemma missed. Silver data augmentation further improved the hardest domains, raising What Matters F1 from 0.59 to 0.67 and Mobility from 0.64 to 0.67. Ablation studies confirmed that restricting LLMs to revision only yielded optimal accuracy and efficiency. Conclusions: The 4M-ER pipeline enables accurate and scalable extraction of 4M entities from clinical TMs by combining fine-tuned Bio-ClinicalBERT with LLM revision using only locally deployed open-source models. The structured 4M data produced by the pipeline can support 4M taxonomy and ontology construction, as demonstrated in the prior work, and provides a foundation for downstream applications including real-time clinical surveillance, compliance with emerging age-friendly quality measures, and predictive modeling in long-term care settings.

Matching journals

The top 5 journals account for 50% of the predicted probability mass.

1
Journal of the American Medical Informatics Association
61 papers in training set
Top 0.1%
18.4%
2
Journal of Biomedical Informatics
45 papers in training set
Top 0.1%
14.2%
3
npj Digital Medicine
97 papers in training set
Top 0.6%
8.3%
4
Journal of Medical Internet Research
85 papers in training set
Top 0.6%
7.1%
5
Frontiers in Digital Health
20 papers in training set
Top 0.1%
7.1%
50% of probability mass above
6
JAMIA Open
37 papers in training set
Top 0.3%
4.8%
7
Scientific Reports
3102 papers in training set
Top 25%
4.8%
8
BMC Medical Informatics and Decision Making
39 papers in training set
Top 0.6%
4.3%
9
IEEE Journal of Biomedical and Health Informatics
34 papers in training set
Top 0.4%
3.9%
10
Artificial Intelligence in Medicine
15 papers in training set
Top 0.2%
2.4%
11
International Journal of Medical Informatics
25 papers in training set
Top 0.6%
2.4%
12
JMIR Medical Informatics
17 papers in training set
Top 0.5%
2.1%
13
PLOS Digital Health
91 papers in training set
Top 1%
1.9%
14
PLOS ONE
4510 papers in training set
Top 52%
1.8%
15
iScience
1063 papers in training set
Top 23%
1.1%
16
Patterns
70 papers in training set
Top 2%
0.9%
17
Computers in Biology and Medicine
120 papers in training set
Top 4%
0.8%
18
BMC Medical Research Methodology
43 papers in training set
Top 1%
0.7%
19
Computer Methods and Programs in Biomedicine
27 papers in training set
Top 1%
0.7%
20
Cureus
67 papers in training set
Top 6%
0.6%
21
Schizophrenia
19 papers in training set
Top 0.4%
0.6%
22
Biology Methods and Protocols
53 papers in training set
Top 3%
0.6%