Back

Machine Learning Analysis of Electronic Health Records Identifies Interstitial Lung Disease and Predicts Mortality in Patients with Systemic Sclerosis

Peltekian, A. K.; Grudzinski, K. M.; Bemiss, B. C.; Dematte, J. E.; Richardson, C.; Markov, N. S.; Carns, M.; Field, N. S.; Zhu, M.; Soriano, A.; Dapas, M.; Perlman, H.; Gundersheimer, A.; Selvan, K. C.; Moore, D. F.; Rasmussen, L. V.; Varga, J.; Hinchcliff, M.; Warrior, K.; Gao, C. A.; Wunderink, R. G.; Budinger, G. S.; Choudhary, A. N.; Misharin, A. V.; Agrawal, A.; Esposito, A. J.

2025-06-04 respiratory medicine
10.1101/2025.06.02.25328786 medRxiv
Show abstract

BackgroundInterstitial lung disease (ILD) affects >40% of patients with systemic sclerosis (SSc/scleroderma) and is the leading cause of disease-related mortality. Although therapies may slow progression, outcomes remain poor, partly because ILD is often detected after irreversible lung injury has occurred. Although chest computed tomography (CT) is a sensitive tool for ILD detection and is recommended at SSc diagnosis, it is oftentimes not performed and even less often performed serially. We sought to develop tools to predict ILD and mortality in patients with SSc using data routinely available in the electronic health record (EHR) to inform medical decision-making. MethodsWe analyzed longitudinal EHR data from two SSc cohorts: Northwestern University (1,169 participants; derivation cohort) and Yale University (376 participants; validation cohort). We identified clinical features from existing cohort-linked EHR queries composing a convenience sample of data from participants spanning decades rather than employing a single unified data collection effort. Three ILD experts independently reviewed CT reports and classified each as having or lacking ILD. To explore derivation cohort data structure, patients with >=3 forced vital capacity (FVC) results available were identified and stratified according to prevalent or absent ILD. Using unsupervised trajectory-based clustering exploratory analyses, we determined standardized patterns across groups. ML models were then developed using clinical EHR data as predictor variables and prevalent ILD and all-cause mortality as outcome variables. Model performance was assessed using area under the receiver operating characteristic curve (AUC). ResultsSeventy-four clinical features with low missingness, including demographic, vital sign, laboratory, and pulmonary function test data, were utilized for analyses. Four robust PFT trajectory clusters were identified that were associated with ILD prevalence and mortality in exploratory analyses. A ML model for ILD detection achieved an AUC of 0.832 and retained performance in the Yale cohort (AUC 0.754). In addition to established predictors such as autoantibodies and pulmonary function, the model identified routine laboratory measurements, including red cell distribution width (RDW), white blood cell count, and serum chloride, as important contributors. One-year mortality prediction achieved AUCs of 0.904 in the North-western cohort and 0.910 in the Yale cohort. Among patients with SSc-ILD, one-year mortality was predicted with AUCs of 0.744 and 0.902 in the Northwestern and Yale cohorts, respectively. Unexpectedly, we found that subtle laboratory abnormalities (such as change in RDW) contributed to predicting mortality. ConclusionsOur prediction models comprised of widely available EHR data are useful tools to identify SSc patients at high risk for prevalent ILD and all-cause mortality. Integration of these models into clinical practice could enable scalable risk stratification and inform individualized ILD screening and monitoring strategies for SSc patients.

Matching journals

The top 2 journals account for 50% of the predicted probability mass.

1
European Respiratory Journal
54 papers in training set
Top 0.1%
28.4%
2
American Journal of Respiratory and Critical Care Medicine
39 papers in training set
Top 0.1%
23.1%
50% of probability mass above
3
BMJ Open Respiratory Research
32 papers in training set
Top 0.1%
5.0%
4
Arthritis & Rheumatology
33 papers in training set
Top 0.1%
5.0%
5
Respiratory Research
19 papers in training set
Top 0.1%
4.5%
6
Thorax
32 papers in training set
Top 0.2%
3.7%
7
Scientific Reports
3102 papers in training set
Top 45%
2.7%
8
American Journal of Respiratory Cell and Molecular Biology
38 papers in training set
Top 0.3%
2.7%
9
ERJ Open Research
44 papers in training set
Top 0.3%
2.4%
10
Leukemia
39 papers in training set
Top 0.4%
1.9%
11
Clinical Immunology
21 papers in training set
Top 0.2%
1.7%
12
Frontiers in Immunology
586 papers in training set
Top 4%
1.7%
13
Annals of Clinical and Translational Neurology
29 papers in training set
Top 0.6%
1.7%
14
JCI Insight
241 papers in training set
Top 3%
1.7%
15
Journal of Allergy and Clinical Immunology
25 papers in training set
Top 0.4%
1.5%
16
Frontiers in Medicine
113 papers in training set
Top 5%
1.3%
17
The American Journal of Pathology
31 papers in training set
Top 0.5%
0.7%
18
Journal of Translational Medicine
46 papers in training set
Top 3%
0.7%
19
eLife
5422 papers in training set
Top 58%
0.7%
20
Human Molecular Genetics
130 papers in training set
Top 4%
0.7%
21
Metabolites
50 papers in training set
Top 2%
0.5%
22
Critical Care Explorations
15 papers in training set
Top 0.6%
0.5%