Lack of Consensus for Manual Mouse Sleep Scoring Limits Implementation of Automatic Deep Learning Models
Rose, L.; Zahid, A. N.; Ciudad, J. G.; Egebjerg, C.; Piilgaard, L.; Soerensen, F. L.; Andersen, M.; Radovanovic, T.; Tsopanidou, A.; Nedergaard, M.; Arthaud, S.; Maciel, R.; Peyron, C.; Berteotti, C.; Martiere, V. L.; Silvani, A.; Zoccoli, G.; Borsa, M.; Adamantidis, A.; Moerup, M.; Kornum, B. R.
Show abstract
Scientists have for decades attempted to automate the manual sleep staging problem not only for human polysomnography data but also for rodent data. No model has, however, succeeded in fully replacing the manual procedure across clinics and laboratories. We hypothesize that this is due to the models limited ability to generalize to data from unseen laboratories. Our findings show that despite the high performance of four state-of-the-art models reported in initial publications, the published models struggle to generalize to other laboratories. We further show a significant improvement in model performance across labs by re-training them on a diverse dataset from five different sites. To assess the contribution of variability in manual scoring, ten experts from five laboratories all labelled the same nine mouse sleep recordings. The result revealed substantial scoring variability, particularly for rapid eye movement (REM) sleep, both within and between labs. In conclusion our study demonstrates that key challenges in the generalizability of state-of-the-art sleep scoring models are signal variability and label noise. Our study highlights the need for a standardized set of mouse sleep scoring guidelines to enable consistency and collaboration across the field. Until such a consensus is reached, we present four sufficiently robust models trained on diverse datasets that can serve as standardized tools across labs.
Matching journals
The top 9 journals account for 50% of the predicted probability mass.