Back

Reproducibility and Robustness of Large Language Models for Mobility Functional Status Extraction

Liu, X.; Garg, M.; Jeon, E.; Jia, H.; Sauver, J. S.; Pagali, S. R.; Sohn, S.

2026-04-05 health informatics
10.64898/2026.04.03.26350117 medRxiv
Show abstract

Clinical narrative text contains crucial patient information, yet reliable extraction remains challenging due to linguistic variability, documentation habits, and differences across care settings. Large language models (LLMs) have shown strong accuracy on clinical information extraction (IE), but their reproducibility (stability under repeated runs) and robustness (stability under small, natural prompt variations) are less consistently quantified, despite being central to clinical deployment. In this study, we evaluate three open-weight LLMs representing distinct modeling choices: a dense general-purpose model (Llama 3.3), a mixture-of-experts (MoE) general-purpose model (Llama 4), and a domain-tuned medical model (MedGemma). We focus on binary clinical IE aligned with four mobility classes from the International Classification of Functioning, Disability and Health (ICF) framework. Using a controlled experimental design, we quantify (1) intra-prompt reproducibility across repeated sampling and (2) inter-prompt robustness across paraphrased prompts. We jointly report predictive performance (F1-score) and stability (Fleiss' Kappa [{kappa}]). And we test factor effects using three-way ANOVA with post-hoc comparisons. Results show that increasing temperature generally degrades agreement, but the magnitude depends on model and task; furthermore, prompt paraphrasing can substantially reduce stability, with particularly large drops for the MoE model. Finally, we evaluate a practical mitigation, self-consistency via majority voting, which improves {kappa} substantially and often improves or preserves F1-score, at the cost of additional inference. Together, these findings provide a reproducible framework and concrete recommendations for evaluating and improving LLM reliability in clinical IE.

Matching journals

The top 4 journals account for 50% of the predicted probability mass.

1
npj Digital Medicine
97 papers in training set
Top 0.2%
22.2%
2
Journal of Biomedical Informatics
45 papers in training set
Top 0.1%
14.1%
3
Journal of the American Medical Informatics Association
61 papers in training set
Top 0.3%
9.9%
4
Scientific Reports
3102 papers in training set
Top 7%
9.9%
50% of probability mass above
5
Artificial Intelligence in Medicine
15 papers in training set
Top 0.1%
6.2%
6
Frontiers in Digital Health
20 papers in training set
Top 0.3%
3.5%
7
IEEE Journal of Biomedical and Health Informatics
34 papers in training set
Top 0.7%
2.6%
8
JMIR Medical Informatics
17 papers in training set
Top 0.7%
1.8%
9
International Journal of Medical Informatics
25 papers in training set
Top 0.9%
1.6%
10
JCO Clinical Cancer Informatics
18 papers in training set
Top 0.5%
1.5%
11
Journal of NeuroEngineering and Rehabilitation
28 papers in training set
Top 0.6%
1.5%
12
PLOS Digital Health
91 papers in training set
Top 2%
1.3%
13
BMC Medical Informatics and Decision Making
39 papers in training set
Top 2%
1.2%
14
The Lancet Digital Health
25 papers in training set
Top 0.7%
1.2%
15
PLOS ONE
4510 papers in training set
Top 62%
1.1%
16
Computers in Biology and Medicine
120 papers in training set
Top 4%
0.9%
17
iScience
1063 papers in training set
Top 25%
0.9%
18
JAMIA Open
37 papers in training set
Top 1%
0.9%
19
Journal of Medical Internet Research
85 papers in training set
Top 4%
0.8%
20
Nature Medicine
117 papers in training set
Top 5%
0.7%
21
Patterns
70 papers in training set
Top 3%
0.7%
22
Biology Methods and Protocols
53 papers in training set
Top 3%
0.7%
23
BMC Medical Research Methodology
43 papers in training set
Top 2%
0.7%
24
Philosophical Transactions of the Royal Society B
51 papers in training set
Top 7%
0.6%
25
Nature Communications
4913 papers in training set
Top 66%
0.6%