Back

Falsification Testing of Sepsis Prediction Models: Evaluating Independent Biological Signal After Controlling for Care-Process Intensity

Dickens, A. R.

2026-03-18 health informatics
10.64898/2026.03.17.26348414 medRxiv
Show abstract

BackgroundAutomated sepsis early-warning systems have attracted substantial research investment, yet a fundamental question remains unresolved: do these models detect independent biological signals, or do they predominantly learn care-process intensity -- the pattern of clinician ordering behavior applied to patients already suspected of being ill? We report a pre-registered falsification study testing this hypothesis across four independent clinical datasets. MethodsA four-phase falsification framework with pre-specified thresholds was registered on OSF (March 11, 2026) before any data access. The primary confirmatory analysis used MIMIC-IV v3.1 (n=65,241 adult ICU stays, Beth Israel Deaconess Medical Center, 2008-2022). Exploratory replication analyses used eICU-CRD v2.0 (n=136,864, 208 US hospitals), MIMIC-III v1.4 (n=44,091), and the PhysioNet/CinC 2019 Sepsis Challenge (n=40,314). Each phase tested a distinct falsification criterion: (1) concordance across Sepsis-2, Sepsis-3, and CMS SEP-1 definitions; (2) model performance degradation when care-intensity proxy features are removed; (3) predictive performance of care-intensity features alone; and (4) discriminability of synthetic records generated to match care-intensity distributions. ResultsThe pre-registered primary analysis (MIMIC-IV) did not confirm the hypothesis (0/4 phases confirmed). Biological features predicted Sepsis-3 labels with AUROC 0.901 (95% CI 0.899-0.904); removing care-intensity features reduced performance by only 0.003 AUROC (drop=0.0027). The pre-specified Phase 3 threshold (care-only AUROC >0.70) was not met by the primary logistic regression model (AUROC 0.660); however, a sensitivity XGBoost model did exceed the threshold (AUROC 0.729), suggesting nonlinear care-intensity signal. However, a clinically significant finding emerged consistently across all four datasets: mean pairwise Jaccard similarity between clinical sepsis definitions and administrative coding (CMS SEP-1) was approximately 0.32 at the primary site and 0.20 across multi-center cohorts, indicating that hospital quality metrics and regulatory reporting systematically measure a different patient population than clinical definitions identify. Exploratory analyses revealed a detectable care-intensity signal in the eICU multi-center cohort (AUC drop=0.076) not present at the single academic center. ConclusionsAt an elite academic medical center, sepsis prediction models detect genuine biological signal. Care-process leakage is not the primary driver of model performance in MIMIC-IV. The more consequential and robust finding is the systematic divergence between clinical and administrative sepsis definitions across all datasets examined, which has direct implications for regulatory reporting, pay-for-performance metrics, and the validity of AI benchmarks built on administrative data.

Matching journals

The top 7 journals account for 50% of the predicted probability mass.

1
npj Digital Medicine
97 papers in training set
Top 0.4%
14.2%
2
Journal of the American Medical Informatics Association
61 papers in training set
Top 0.2%
12.4%
3
Critical Care Explorations
15 papers in training set
Top 0.1%
6.7%
4
JMIR Medical Informatics
17 papers in training set
Top 0.1%
6.3%
5
Journal of Medical Internet Research
85 papers in training set
Top 1.0%
4.8%
6
Scientific Reports
3102 papers in training set
Top 25%
4.8%
7
The Lancet Digital Health
25 papers in training set
Top 0.1%
4.8%
50% of probability mass above
8
Nature Communications
4913 papers in training set
Top 35%
4.3%
9
BMC Medicine
163 papers in training set
Top 1%
3.8%
10
BMC Medical Informatics and Decision Making
39 papers in training set
Top 0.9%
3.0%
11
PLOS ONE
4510 papers in training set
Top 43%
2.9%
12
International Journal of Medical Informatics
25 papers in training set
Top 0.7%
1.9%
13
JAMIA Open
37 papers in training set
Top 0.8%
1.7%
14
Journal of Infection
71 papers in training set
Top 1%
1.7%
15
PLOS Digital Health
91 papers in training set
Top 2%
1.5%
16
Frontiers in Medicine
113 papers in training set
Top 5%
1.2%
17
Critical Care
14 papers in training set
Top 0.5%
0.9%
18
BMJ Health & Care Informatics
13 papers in training set
Top 0.7%
0.9%
19
BMC Medical Research Methodology
43 papers in training set
Top 1%
0.9%
20
JAMA Network Open
127 papers in training set
Top 4%
0.9%
21
BMC Infectious Diseases
118 papers in training set
Top 5%
0.9%
22
BMJ Open
554 papers in training set
Top 12%
0.8%
23
Annals of Internal Medicine
27 papers in training set
Top 0.9%
0.8%
24
European Respiratory Journal
54 papers in training set
Top 2%
0.7%
25
Journal of Biomedical Informatics
45 papers in training set
Top 1%
0.7%
26
JCO Clinical Cancer Informatics
18 papers in training set
Top 0.9%
0.7%
27
Clinical Chemistry
22 papers in training set
Top 0.9%
0.7%
28
PLOS Computational Biology
1633 papers in training set
Top 25%
0.7%
29
Patterns
70 papers in training set
Top 3%
0.7%
30
Annals of Neurology
57 papers in training set
Top 2%
0.7%