Back

Calibration Drift Under Cross-Institutional Deployment: An External Validation Framework for ICU Mortality Prediction Across MIMIC-IV and eICU

Patel, K.; Beedala, P.

2026-05-05 health informatics
10.64898/2026.05.03.26352335 medRxiv
Show abstract

BackgroundMachine learning models for intensive care unit (ICU) mortality prediction achieve strong internal discrimination yet rarely undergo external validation with calibration assessment -- a gap undermining clinical deployment. Calibration, the agreement between predicted probabilities and observed event rates, is prerequisite for threshold-based decisions yet remains underreported. MethodsWe conducted a retrospective cohort study using MIMIC-IV (v2.2; n = 52,028 ICU stays) for model development and eICU (n = 114,060) for independent external validation. Logistic regression, random forest, and gradient boosting (XGBoost) were evaluated on first-24-hour clinical variables. Discrimination was assessed via receiver operating characteristic area (AUROC) and precision-recall area (AUPRC); calibration via slope, intercept, and expected calibration error (ECE). Post-hoc logistic recalibration was applied externally. Clinical utility was evaluated by decision curve analysis benchmarked against Acute Physiology and Chronic Health Evaluation (APACHE) scores. Subgroup analyses examined sex and race/ethnicity; SHapley Additive exPlanations (SHAP) assessed feature importance. Uncertainty was estimated via bootstrap resampling; the study adheres to TRIPOD guidelines. ResultsThe recalibrated XGBoost model achieved internal AUROC 0.847 (95% CI: 0.832-0.860) and external AUROC 0.819 (95% CI: 0.815-0.823). Internal calibration was near-ideal (slope 0.982; intercept 0.001), whereas external validation revealed systematic risk overestimation (intercept -0.678) attributable to prevalence-driven label shift. An intercept-only adjustment reduced ECE by 26%. The model outperformed APACHE (AUROC 0.817 vs. 0.795; p < 0.001). ConclusionsICU mortality models exhibit transportable discrimination but clinically significant calibration drift under cross-institutional deployment. Calibration evaluation and targeted recalibration should be mandatory in any clinical machine learning validation framework.

Matching journals

The top 10 journals account for 50% of the predicted probability mass.

1
JMIR Medical Informatics
17 papers in training set
Top 0.1%
9.8%
2
Scientific Reports
3102 papers in training set
Top 20%
6.2%
3
The Lancet Digital Health
25 papers in training set
Top 0.1%
6.2%
4
Journal of Medical Internet Research
85 papers in training set
Top 1.0%
4.7%
5
Critical Care Explorations
15 papers in training set
Top 0.1%
4.7%
6
PLOS ONE
4510 papers in training set
Top 32%
4.7%
7
npj Digital Medicine
97 papers in training set
Top 1%
4.7%
8
BMC Medical Research Methodology
43 papers in training set
Top 0.2%
4.2%
9
BMC Medical Informatics and Decision Making
39 papers in training set
Top 0.7%
3.9%
10
Journal of the American Medical Informatics Association
61 papers in training set
Top 0.8%
3.5%
50% of probability mass above
11
JAMA Network Open
127 papers in training set
Top 1%
3.5%
12
International Journal of Medical Informatics
25 papers in training set
Top 0.4%
3.5%
13
BMJ Open
554 papers in training set
Top 6%
3.5%
14
Frontiers in Artificial Intelligence
18 papers in training set
Top 0.1%
3.0%
15
JCO Clinical Cancer Informatics
18 papers in training set
Top 0.5%
1.7%
16
Journal of Infection
71 papers in training set
Top 1%
1.7%
17
Nature Communications
4913 papers in training set
Top 53%
1.7%
18
European Respiratory Journal
54 papers in training set
Top 1%
1.4%
19
PLOS Digital Health
91 papers in training set
Top 2%
1.4%
20
BMJ Health & Care Informatics
13 papers in training set
Top 0.6%
1.3%
21
BMC Medicine
163 papers in training set
Top 5%
1.2%
22
Annals of Internal Medicine
27 papers in training set
Top 0.7%
1.1%
23
Frontiers in Medicine
113 papers in training set
Top 5%
0.9%
24
eClinicalMedicine
55 papers in training set
Top 1%
0.9%
25
Frontiers in Digital Health
20 papers in training set
Top 1%
0.9%
26
Critical Care
14 papers in training set
Top 0.5%
0.9%
27
BMJ
49 papers in training set
Top 1%
0.9%
28
European Heart Journal - Digital Health
15 papers in training set
Top 0.6%
0.8%
29
JMIR Public Health and Surveillance
45 papers in training set
Top 4%
0.7%
30
International Journal of Environmental Research and Public Health
124 papers in training set
Top 7%
0.7%