Back

Assessing physiological coherence in stress related predictions of large language models: a surrogate based analysis of the Mistral 3 family using wearable HRV data

Bolpagni, M.; Pozza, M.; Gabrielli, S.

2026-04-27 health informatics
10.64898/2026.04.24.26351717 medRxiv
Show abstract

Chronic psychological stress contributes to allostatic load and is associated with cardiovascular, metabolic, and mental health disorders. Wearable devices enable continuous, noninvasive monitoring of autonomic signals such as heart rate variability (HRV), creating new opportunities for real-time stress assessment. Large language models (LLMs) are increasingly explored as interfaces for interpreting such data, but it remains unclear whether their predictions reflect physiologically meaningful patterns or rely on superficial heuristics. In this study, we assess whether LLM-derived stress predictions are physiologically coherent and how this varies with model scale. Using a longitudinal wearable dataset collected in naturalistic conditions (35 participants; 5,100 five-minute windows with HRV and contextual features), we obtained stress pseudoprobabilities from three models in the Mistral 3 family (675B, 14B, 3B) via zero-shot prompting. To make model behavior interpretable, we trained surrogate models to approximate LLM outputs and analyzed feature-response relationships using SHAP. Our results indicate that surrogate models closely reproduced LLM predictions (R{superscript 2} up to 0.915; Cohen's k up to 0.941), enabling high-fidelity characterization of decision patterns and providing a practical framework for auditing the physiological coherence of LLM-derived predictions. Physiological coherence increased with model scale: the largest model exhibited near complete alignment with established HRV stress responses, together with stable, predominantly monotonic feature effects and a balanced integration of physiological and contextual information. This pattern weakened at smaller scales, with the mid scale model showing partial alignment and the smallest model displaying reduced stability, greater feature concentration, and more irregular, non monotonic relationships. These findings indicate that larger LLMs encode more physiologically consistent representations of stress, whereas smaller models rely on simplified and less stable strategies, and highlight the value of surrogate based analysis as a practical framework for evaluating LLM behavior in biomedical applications and supporting their responsible integration into wearable health analytics.

Matching journals

The top 5 journals account for 50% of the predicted probability mass.

1
npj Digital Medicine
97 papers in training set
Top 0.2%
19.6%
2
Scientific Reports
3102 papers in training set
Top 3%
12.6%
3
Journal of Medical Internet Research
85 papers in training set
Top 0.5%
8.5%
4
IEEE Journal of Biomedical and Health Informatics
34 papers in training set
Top 0.2%
6.9%
5
Computers in Biology and Medicine
120 papers in training set
Top 0.5%
4.3%
50% of probability mass above
6
Frontiers in Digital Health
20 papers in training set
Top 0.1%
4.3%
7
PLOS ONE
4510 papers in training set
Top 36%
4.0%
8
PLOS Digital Health
91 papers in training set
Top 0.6%
4.0%
9
Journal of Biomedical Informatics
45 papers in training set
Top 0.4%
3.7%
10
Sensors
39 papers in training set
Top 0.5%
3.6%
11
Patterns
70 papers in training set
Top 0.6%
1.9%
12
Physiological Measurement
12 papers in training set
Top 0.2%
1.7%
13
iScience
1063 papers in training set
Top 14%
1.7%
14
PLOS Computational Biology
1633 papers in training set
Top 16%
1.7%
15
JMIR mHealth and uHealth
10 papers in training set
Top 0.2%
1.7%
16
Advanced Science
249 papers in training set
Top 16%
0.9%
17
Nature Biomedical Engineering
42 papers in training set
Top 2%
0.7%
18
Frontiers in Physiology
93 papers in training set
Top 6%
0.7%
19
Biomedical Signal Processing and Control
18 papers in training set
Top 0.6%
0.6%
20
Translational Psychiatry
219 papers in training set
Top 5%
0.6%
21
JAMIA Open
37 papers in training set
Top 2%
0.5%
22
Communications Biology
886 papers in training set
Top 32%
0.5%
23
Frontiers in Artificial Intelligence
18 papers in training set
Top 1%
0.5%
24
American Journal of Physiology-Regulatory, Integrative and Comparative Physiology
13 papers in training set
Top 0.5%
0.5%
25
Computational and Structural Biotechnology Journal
216 papers in training set
Top 12%
0.5%
26
Journal of Personalized Medicine
28 papers in training set
Top 2%
0.5%
27
eBioMedicine
130 papers in training set
Top 6%
0.5%