Back

Predictors of COVID-19 hospital outcomes: a machine learning analysis of the National COVID Cohort Collaborative

Vazquez, J.; Taylor, L.; Chen, Y.-Y. K.; Araya, K.; Farnsworth, M. G.; Xue, X.; Hasan, M.; N3C Consortium,

2026-03-09 health informatics
10.64898/2026.03.06.26347822 medRxiv
Show abstract

Predicting hospital outcomes for patients with severe acute respiratory infections is critical for risk stratification and resource planning, yet heterogeneous electronic health record (EHR) data, class imbalance, and evolving clinical practice present persistent methodological challenges for machine learning (ML) approaches. We conducted a retrospective cohort study using EHR data harmonized to the OMOP common data model from the National COVID Cohort Collaborative (N3C; May 2020-June 2025), including 263,619 adults hospitalized with COVID-19 across 51 contributing sites. We developed penalized linear regression (elastic net), random forest, XGBoost, and multilayer perceptron (MLP) models to predict hospital length of stay (LOS) and mortality (in-hospital and 60-day), using demographics, comorbidities, prior healthcare utilization, COVID-19 vaccination status, and hospital site as predictors. Missing data were handled via multiple imputation by chained equations (MICE) and class imbalance was addressed using SMOTE. Model performance was evaluated using area under the ROC curve (AUROC), Brier score, calibration plots, and decision curve analysis, following the TRIPOD reporting framework. Mortality prediction achieved moderate discrimination across all models (test AUROC = 0.71-0.73 for in-hospital mortality; 0.72-0.73 for 60-day all-cause mortality). Models trained without SMOTE achieved the highest AUROCs but assigned virtually no patients to the mortality class at the default 0.5 threshold. SMOTE improved recall and F-1 score at the cost of reduced AUROC and precision. LOS was poorly explained by available structured predictors (best R2 = 0.059). Remdesivir-treated patients (n = 103,536; 39.3%) were older, had higher comorbidity burden, and had higher unadjusted mortality than untreated patients. Common structured EHR features offer moderate utility for mortality risk stratification in hospitalized COVID-19 patients but are insufficient for LOS prediction. The consistent SMOTE-related tradeoff between discrimination and calibration underscores the need to report threshold-dependent metrics alongside AUROC in clinical ML studies, with implications for operational planning during future respiratory disease emergencies.

Matching journals

The top 5 journals account for 50% of the predicted probability mass.

1
npj Digital Medicine
97 papers in training set
Top 0.4%
14.3%
2
Scientific Reports
3102 papers in training set
Top 4%
12.4%
3
The Lancet Digital Health
25 papers in training set
Top 0.1%
10.0%
4
International Journal of Medical Informatics
25 papers in training set
Top 0.1%
9.1%
5
Journal of the American Medical Informatics Association
61 papers in training set
Top 0.4%
7.1%
50% of probability mass above
6
Journal of Medical Internet Research
85 papers in training set
Top 1.0%
4.8%
7
European Respiratory Journal
54 papers in training set
Top 0.4%
3.9%
8
Nature Communications
4913 papers in training set
Top 38%
3.8%
9
JMIR Medical Informatics
17 papers in training set
Top 0.3%
3.6%
10
JCO Clinical Cancer Informatics
18 papers in training set
Top 0.3%
3.1%
11
PLOS ONE
4510 papers in training set
Top 48%
2.1%
12
Patterns
70 papers in training set
Top 0.8%
1.8%
13
PLOS Digital Health
91 papers in training set
Top 1%
1.8%
14
Med
38 papers in training set
Top 0.3%
1.7%
15
Journal of Biomedical Informatics
45 papers in training set
Top 1%
1.1%
16
Frontiers in Digital Health
20 papers in training set
Top 1.0%
1.1%
17
BMC Medical Informatics and Decision Making
39 papers in training set
Top 2%
0.9%
18
JAMIA Open
37 papers in training set
Top 1%
0.9%
19
Annals of Internal Medicine
27 papers in training set
Top 1.0%
0.7%
20
eBioMedicine
130 papers in training set
Top 4%
0.7%
21
JMIR Public Health and Surveillance
45 papers in training set
Top 4%
0.7%
22
eLife
5422 papers in training set
Top 60%
0.7%
23
Science Advances
1098 papers in training set
Top 33%
0.6%
24
BMC Medical Research Methodology
43 papers in training set
Top 2%
0.6%