Assessing physiological coherence in stress related predictions of large language models: a surrogate based analysis of the Mistral 3 family using wearable HRV data
Bolpagni, M.; Pozza, M.; Gabrielli, S.
Show abstract
Chronic psychological stress contributes to allostatic load and is associated with cardiovascular, metabolic, and mental health disorders. Wearable devices enable continuous, noninvasive monitoring of autonomic signals such as heart rate variability (HRV), creating new opportunities for real-time stress assessment. Large language models (LLMs) are increasingly explored as interfaces for interpreting such data, but it remains unclear whether their predictions reflect physiologically meaningful patterns or rely on superficial heuristics. In this study, we assess whether LLM-derived stress predictions are physiologically coherent and how this varies with model scale. Using a longitudinal wearable dataset collected in naturalistic conditions (35 participants; 5,100 five-minute windows with HRV and contextual features), we obtained stress pseudoprobabilities from three models in the Mistral 3 family (675B, 14B, 3B) via zero-shot prompting. To make model behavior interpretable, we trained surrogate models to approximate LLM outputs and analyzed feature-response relationships using SHAP. Our results indicate that surrogate models closely reproduced LLM predictions (R{superscript 2} up to 0.915; Cohen's k up to 0.941), enabling high-fidelity characterization of decision patterns and providing a practical framework for auditing the physiological coherence of LLM-derived predictions. Physiological coherence increased with model scale: the largest model exhibited near complete alignment with established HRV stress responses, together with stable, predominantly monotonic feature effects and a balanced integration of physiological and contextual information. This pattern weakened at smaller scales, with the mid scale model showing partial alignment and the smallest model displaying reduced stability, greater feature concentration, and more irregular, non monotonic relationships. These findings indicate that larger LLMs encode more physiologically consistent representations of stress, whereas smaller models rely on simplified and less stable strategies, and highlight the value of surrogate based analysis as a practical framework for evaluating LLM behavior in biomedical applications and supporting their responsible integration into wearable health analytics.
Matching journals
The top 5 journals account for 50% of the predicted probability mass.