Automated Sleep Stage and Event Detection Algorithms Using Quality-Controlled PSG Annotations
Kaneda, M.; Ogaki, S.; Nohara, T.; Fujita, S.; Osako, N.; Yagi, T.; Tomita, Y.; Ogata, T.
Show abstract
Study ObjectivesTo develop machine-learning models for sleep stage classification, arousal detection, and respiratory event detection from overnight polysomnography, and to evaluate their performance relative to expert scorers. MethodsOvernight polysomnography recordings were obtained from healthy participants and participants referred for suspected sleep-disordered breathing. Four certified scorers completed calibration sessions and generated reference annotations for sleep stages, arousals, and respiratory events. A subset of recordings was independently annotated by all scorers to support consensus analyses, enabling direct comparison between model outputs and human inter-scorer agreement. Gradient-boosted decision tree models were trained using hand-crafted features derived from standard physiological signals. ResultsSleep stage classification achieved accuracy 0.840, Cohens kappa 0.791, and F1-score 0.841, with limits of agreement for total sleep time of approximately {+/-}0.5 h. Arousal detection achieved an F1-score of 0.733, with limits of agreement for the arousal index of approximately {+/-}15 events/h. Respiratory event detection achieved an F1-score of 0.818, with limits of agreement for the apnea-hypopnea index also within approximately {+/-}15 events/h. In consensus analyses, model performance was comparable to human inter-scorer agreement for sleep stages and arousals, while remaining below human inter-scorer agreement for respiratory events, despite high absolute performance relative to prior studies. ConclusionsThe proposed models achieved performance approaching human-level agreement across major sleep scoring tasks. These findings indicate that high consistency in expert annotations is a key factor underlying robust model performance and support the use of quality-controlled annotations for developing reliable automated sleep analysis systems. Statement of significanceManual scoring of overnight sleep studies remains a major bottleneck in sleep medicine, limiting efficiency, consistency, and large-scale research. This study demonstrates that interpretable automated analysis can achieve performance approaching human-level agreement for core sleep scoring tasks when reference annotations are highly consistent. By directly comparing model outputs with calibrated inter-scorer agreement, the results show that annotation quality is a key determinant of attainable accuracy, rather than model complexity alone. Such systems may provide stable and reproducible reference outputs that support clinical decision making, scorer training, and standardization across centers. Important remaining challenges include validation across institutions and populations, robustness to real-world signal artifacts, and extension to clinically meaningful subtypes of respiratory events.
Matching journals
The top 6 journals account for 50% of the predicted probability mass.