Back

Automatic pain face analysis in mice: Applied to a varied dataset with non-standardized conditions

Andresen, N.; Wöllhaf, M.; Wilzopolski, J.; Lang, A.; Wolter, A.; Howe-Wittek, L.; Bekemeier, C.; Pawlak, L.-I.; Beyer, S.; Cynis, H.; Hietel, E.; Rieckmann, V.; Rieckmann, M.; Thöne-Reineke, C.; Lewejohann, L.; Hellwich, O.; Hohlbaum, K.

2026-02-18 animal behavior and cognition
10.64898/2026.02.16.706098 bioRxiv
Show abstract

Biomedical research relies on scientifically validated tools to assess pain, suffering, and distress in laboratory animals to ensure their well-being. In mice, the most frequently used laboratory animals, the Mouse Grimace Scale (MGS) provides a reliable tool for the assessment of facial expression changes caused by impaired well-being. However, no automated tool can yet reliably assess all features of the MGS across different mouse strains under varying experimental or housing conditions in real-time, as the variability present in recorded image datasets poses substantial challenges for computer vision models. Despite this technical difficulty, variability across subsets in terms of mouse strain, treatments, laboratory, and image acquisition setup is essential for paving the way toward MGS assessment under non-standardized conditions in the home cage rather than standardized cage-side recording setups. Against this background, a large and diverse dataset containing five subsets is introduced and a deep learning model was trained to predict the average MGS scores ranging between 0 and 2. It achieved a root mean squared error (RMSE) of 0.26 when trained on all subsets of the dataset, outperforming the average human rater in terms of error magnitude. The correlation between human raters and automated MGS scores was very high (Pearsons r=0.85). In the cross-dataset evaluation, one subset was excluded from training and used for testing the model. This approach yielded higher errors compared to models trained and tested on the same subsets. A model restricted to the feature of orbital tightening showed lower performance than one trained on all facial features of the MGS. Overall, the most reliable model for predicting average MGS scores for a novel dataset is the one trained on the combined subsets. Performance may be further enhanced by fine-tuning the model using human-generated MGS scores for a portion of the novel subset.

Matching journals

The top 3 journals account for 50% of the predicted probability mass.

1
PLOS ONE
4510 papers in training set
Top 6%
22.9%
2
Scientific Reports
3102 papers in training set
Top 0.7%
18.9%
3
Scientific Data
174 papers in training set
Top 0.1%
12.5%
50% of probability mass above
4
PLOS Computational Biology
1633 papers in training set
Top 8%
4.4%
5
Sensors
39 papers in training set
Top 0.3%
4.4%
6
iScience
1063 papers in training set
Top 3%
4.0%
7
Journal of Neuroscience Methods
106 papers in training set
Top 0.9%
1.7%
8
Behavior Research Methods
25 papers in training set
Top 0.1%
1.7%
9
Nature Communications
4913 papers in training set
Top 53%
1.5%
10
Frontiers in Human Neuroscience
67 papers in training set
Top 2%
1.4%
11
Frontiers in Bioengineering and Biotechnology
88 papers in training set
Top 2%
1.2%
12
eLife
5422 papers in training set
Top 48%
1.2%
13
Biomedicines
66 papers in training set
Top 2%
1.2%
14
Communications Biology
886 papers in training set
Top 15%
1.1%
15
SoftwareX
15 papers in training set
Top 0.3%
0.9%
16
Science Advances
1098 papers in training set
Top 27%
0.8%
17
Animals
20 papers in training set
Top 0.8%
0.8%
18
Biology Open
130 papers in training set
Top 3%
0.8%
19
Journal of Visualized Experiments
30 papers in training set
Top 0.7%
0.8%
20
Journal of Biophotonics
16 papers in training set
Top 0.7%
0.7%
21
Frontiers in Computational Neuroscience
53 papers in training set
Top 3%
0.5%
22
Cell Reports Methods
141 papers in training set
Top 7%
0.5%
23
Frontiers in Neuroscience
223 papers in training set
Top 9%
0.5%
24
IEEE Access
31 papers in training set
Top 1%
0.5%