Observation-process features are associated with larger domain shift in sepsis mortality prediction: a cross-database evaluation using MIMIC-IV and eICU-CRD

Yamamoto, R.; Wu, F.; Sprehe, L. K.; Abeer, A.; Celi, L. A.; Tohyama, T.

2026-04-06 intensive care and critical care medicine

10.64898/2026.04.05.26350209 medRxiv

Show abstract

Clinical prediction models for sepsis frequently degrade when applied outside the development setting. Electronic health record data encode not only patient physiology but also observation processes such as measurement timing and frequency, which may be predictive within a site but unstable across sites. The contribution of these observation-process features to cross-site performance degradation has not been quantified. In this retrospective cohort study, we developed models for in-hospital mortality in adult intensive care unit (ICU) patients meeting Sepsis-3 criteria using Medical Information Mart for Intensive Care IV (MIMIC-IV) (n = 30,218; 16.3% mortality) and externally validated them in eICU Collaborative Research Database (eICU-CRD) (n = 31,403; 13.9% mortality). We compared seven prespecified model specifications representing physiologic summary strategies (a single aggregate severity score, most recent values, extreme values, and within-window variability), each evaluated with and without measurement counts as observation-process features. Models were fit using logistic regression and gradient-boosted trees. Internally, discrimination improved with more detailed physiologic summaries and measurement counts (logistic regression area under the receiver operating characteristic curve [AUROC] from 0.819 to 0.834). In external validation, performance drops were larger for specifications using more complex physiologic representations. Adding measurement counts was associated with larger domain shift (AUROC change, -0.047 versus -0.082 with counts in logistic regression). External calibration deteriorated progressively, with calibration slopes decreasing from 1.007 for the simplest model to 0.417 for the most complex specification in logistic regression. Gradient-boosted trees showed smaller incremental degradation from measurement counts but still exhibited domain shift in complex specifications. Inclusion of observation-process features in sepsis mortality prediction models was associated with improved internal discrimination but worse external calibration and transportability. These findings highlight that feature engineering decisions involve a tradeoff between internal performance and external generalizability, and that calibration assessment provides the most sensitive indicator of reduced transportability.

Observation-process features are associated with larger domain shift in sepsis mortality prediction: a cross-database evaluation using MIMIC-IV and eICU-CRD

Matching journals