Back

Empirical Review of LLM-driven Classification of Multidimensional Sleep Health Mentions from Free-Text Clinical Notes

Hussain, S.-A.; Calloway, A.; Sirrianni, J.; Fosler-Lussier, E.; Davenport, M.

2025-06-05 pediatrics
10.1101/2025.06.04.25328983 medRxiv
Show abstract

Accurate multidimensional sleep health (MSH) information is often fragmented and inconsistently represented within hospital infrastructures, leaving crucial details buried in unstructured clinical notes rather than discrete fields. This inconsistency complicates large-scale phenotyping, secondary analyses, and clinical decision support regarding sleep-related outcomes. In this work, we systematically explore contemporary natural language processing techniques, prompt-based large language models (LLMs) and fine-tuned discriminative classifiers, to bridge this critical gap. We evaluate performance on extracting nine key MSH dimensions (timing, duration, efficiency, sleep disorders, daytime sleepiness, interventions, medication, behavior, and satisfaction) from clinical narratives using public datasets (MIMIC-III derivatives) and an internally annotated pediatric sleep corpus. Initially, we assess generative LLM performance using dynamic few-shot prompting, analyzing impacts from varying prompt structures, example quantity, and domain-specificity without explicit task-specific fine-tuning. Subsequently, we fine-tune generative LLM architectures on both in-task and out-of-task data to quantify performance improvements and limitations. Lastly, we benchmark these generative approaches against encoder-based discriminative classifiers (ModernBERT), designed to directly estimate binary presence of each MSH class within full clinical notes. Our experiments demonstrate that fine-tuned discriminative models consistently provide higher classification accuracy, lower inference latency, and more robust span-level identification than either prompted or fine-tuned generative LLMs, given adequate training data. Nonetheless, generative LLMs retain moderate utility in low-data scenarios. Importantly, our results highlight persistent challenges, including difficulty extracting subtle sleep constructs such as sleep efficiency and daytime sleepiness, and biases associated with patient demographics and clinical departments. We conclude by suggesting future research directions: refining span extraction methods, mitigating biases in model performance, and exploring advanced chain-of-thought prompting techniques to achieve reliable, scalable MSH phenotyping within real-world clinical systems.

Matching journals

The top 4 journals account for 50% of the predicted probability mass.

1
IEEE Journal of Biomedical and Health Informatics
34 papers in training set
Top 0.1%
33.1%
2
Scientific Reports
3102 papers in training set
Top 10%
8.4%
3
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 11%
6.3%
4
Nature Medicine
117 papers in training set
Top 0.5%
4.4%
50% of probability mass above
5
PLOS Digital Health
91 papers in training set
Top 0.5%
4.3%
6
npj Digital Medicine
97 papers in training set
Top 1.0%
4.3%
7
BioData Mining
15 papers in training set
Top 0.1%
3.6%
8
Computers in Biology and Medicine
120 papers in training set
Top 2%
1.9%
9
Journal of Biomedical Informatics
45 papers in training set
Top 0.7%
1.8%
10
PLOS ONE
4510 papers in training set
Top 53%
1.7%
11
Genome Medicine
154 papers in training set
Top 4%
1.7%
12
Nature Communications
4913 papers in training set
Top 53%
1.5%
13
Nature Human Behaviour
85 papers in training set
Top 3%
1.5%
14
SLEEP
28 papers in training set
Top 0.3%
1.3%
15
Journal of the American Medical Informatics Association
61 papers in training set
Top 1%
1.3%
16
Translational Psychiatry
219 papers in training set
Top 3%
1.2%
17
Advanced Science
249 papers in training set
Top 16%
0.9%
18
Journal of Proteome Research
215 papers in training set
Top 2%
0.9%
19
eBioMedicine
130 papers in training set
Top 3%
0.9%
20
Physiological Measurement
12 papers in training set
Top 0.4%
0.8%
21
PLOS Computational Biology
1633 papers in training set
Top 23%
0.8%
22
Annals of Neurology
57 papers in training set
Top 2%
0.7%
23
Frontiers in Psychiatry
83 papers in training set
Top 4%
0.6%
24
BMC Medical Informatics and Decision Making
39 papers in training set
Top 3%
0.6%
25
eClinicalMedicine
55 papers in training set
Top 2%
0.6%
26
Schizophrenia
19 papers in training set
Top 0.4%
0.6%
27
iScience
1063 papers in training set
Top 37%
0.6%