Back

Clinician-Informed Feature Engineering Improves Machine Learning Assignment of Molecular Endotypes in the Intensive Care Unit

Sines, B. J.; Hagan, R. S.; Jiang, X.; Pavlechko, E.; McClain, S.; Hunt, X.; Florou-Moreno, J.; Acquardo, J.; Risa, G.; Valsaraj, V.; Schisler, J. C.; Wolfgang, M. C.

2026-04-07 intensive care and critical care medicine
10.64898/2026.04.06.26350248 medRxiv
Show abstract

Objective: To develop a workflow that transforms electronic health record data into machine learning-ready features for molecular endotype assignment and to evaluate whether clinician-informed feature engineering improves model performance and interpretability. Materials and Methods: We developed parallel clinician-informed and clinician-agnostic feature engineering pipelines to prepare raw EHR data from mechanically ventilated patients with respiratory failure. Molecular endotype labels derived from paired deep lung and blood profiling of subjects with acute lung injury were used to train candidate machine learning classifiers. Champion models from each pipeline were compared on predefined performance metrics. Results: Bayesian network classifiers were the top-performing models in both pipelines. The clinician-informed pipeline generated fewer features than the clinician-agnostic pipeline (645 vs 1,127) and produced a lower misclassification rate in the final Bayesian network model (0.047 vs 0.14). In an independent cohort of subjects with acute lung injury, the clinician-informed model better distinguished corticosteroid-responsive from non-responsive subgroups. Discussion: Clinical context improved feature engineering efficiency, model interpretability, and classification performance. These findings support the integration of domain expertise into machine learning workflows intended for critical care implementation. Conclusions: Clinician-informed feature engineering can simplify machine learning models while improving performance and preserving clinical relevance. AI tools developed for healthcare should incorporate subject matter expertise early in the feature engineering and analytic workflow.

Matching journals

The top 5 journals account for 50% of the predicted probability mass.

1
Journal of the American Medical Informatics Association
61 papers in training set
Top 0.1%
22.4%
2
Critical Care Explorations
15 papers in training set
Top 0.1%
12.3%
3
BMC Medical Informatics and Decision Making
39 papers in training set
Top 0.5%
6.3%
4
Scientific Reports
3102 papers in training set
Top 18%
6.3%
5
Clinical Chemistry
22 papers in training set
Top 0.1%
4.8%
50% of probability mass above
6
npj Digital Medicine
97 papers in training set
Top 1%
4.1%
7
JAMIA Open
37 papers in training set
Top 0.4%
3.8%
8
Journal of Biomedical Informatics
45 papers in training set
Top 0.5%
3.6%
9
PLOS ONE
4510 papers in training set
Top 40%
3.6%
10
JAMA Network Open
127 papers in training set
Top 1%
2.7%
11
PLOS Digital Health
91 papers in training set
Top 0.9%
2.7%
12
Biology Methods and Protocols
53 papers in training set
Top 1.0%
1.7%
13
Journal of General Internal Medicine
20 papers in training set
Top 0.5%
1.7%
14
JMIR Public Health and Surveillance
45 papers in training set
Top 2%
1.3%
15
eBioMedicine
130 papers in training set
Top 2%
1.3%
16
European Respiratory Journal
54 papers in training set
Top 1%
1.3%
17
Frontiers in Physiology
93 papers in training set
Top 4%
1.2%
18
JMIR Medical Informatics
17 papers in training set
Top 1%
1.2%
19
PLOS Computational Biology
1633 papers in training set
Top 21%
0.9%
20
Bioinformatics
1061 papers in training set
Top 8%
0.9%
21
American Journal of Respiratory Cell and Molecular Biology
38 papers in training set
Top 0.6%
0.9%
22
BMC Medicine
163 papers in training set
Top 6%
0.8%
23
Wellcome Open Research
57 papers in training set
Top 2%
0.7%
24
Heliyon
146 papers in training set
Top 7%
0.7%
25
Physiological Measurement
12 papers in training set
Top 0.4%
0.7%
26
Frontiers in Medicine
113 papers in training set
Top 8%
0.6%