Back

Data Heterogeneity in Federated Learning with Electronic Health Records: Case Studies of Risk Prediction for Acute Kidney Injury and Sepsis Diseases in Critical Care

Rajendran, S.; Xu, Z.; Pan, W.; Ghosh, A.; Wang, F.

2022-09-01 health informatics
10.1101/2022.08.30.22279382 medRxiv
Show abstract

With the wider availability of healthcare data such as Electronic Health Records (EHR), more and more data-driven based approaches have been proposed to improve the quality of care delivery. Predictive modeling, which aims at building computational models for predicting clinical risk, is a popular research topic in healthcare analytics. However, concerns about privacy of healthcare data may hinder the development of effective predictive models that are generalizable because this often requires rich diverse data from multiple clinical institutions. Recently, federated learning (FL) has demonstrated promise in addressing this concern. However, data heterogeneity from different local participating sites may affect prediction performance. Exploring such heterogeneity of data sources would aid in building accurate risk prediction models in FL. Due to acute kidney injury (AKI) and sepsis high prevalence among patients admitted to intensive care units (ICU), the early prediction of these conditions based on AI is an important topic in critical care medicine. In this study, we take AKI and sepsis onset risk prediction in ICU as two examples to explore the impact of data heterogeneity in the FL framework for risk prediction using EHR data across multiple hospitals. In particular, we built predictive models based on local, pooled, and FL frameworks. The local framework only used data from each site itself. The pooled framework combined data from all sites. In the FL framework, each local site did not have access to other sites data. A model was trained locally and its parameters were shared to a central aggregator, which was used to update the federated models weights and then subsequently, shared with each site. We found models built within a FL framework outperformed local counterparts. Then, we analyzed variable importance discrepancies across sites and frameworks. Finally, we explored potential sources of the heterogeneity within the EHR data. The different distributions of demographic profiles, medication use, and site information contributed to data heterogeneity. Author SummaryThe availability of a large amount of healthcare data such as Electronic Health Records (EHR) and advances of artificial intelligence (AI) techniques provides opportunities to build predictive models for disease risk prediction. Due to the sensitive nature of healthcare data, it is challenging to collect the data together from different hospitals and train a unified model on the combined data. Recent federated learning (FL) demonstrates promise in addressing the fragmented healthcare data sources with privacy-preservation. However, data heterogeneity in the FL framework may influence prediction performance. Exploring the heterogeneity of data sources would contribute to building accurate disease risk prediction models in FL. In this study, we take acute kidney injury (AKI) and sepsis prediction in intensive care units (ICU) as two examples to explore the effects of data heterogeneity in the FL framework for disease risk prediction using EHR data across multiple hospital sites. In particular, multiple predictive models were built based on local, pooled, and FL frameworks. The local framework only used data from each site itself. The pooled framework combined data from all sites. In the FL framework, each local site did not have access to other sites data. We found models built within a FL framework outperformed local counterparts. Then, we analyzed variable importance discrepancies across sites and frameworks. Finally, we explored potential sources of the heterogeneity within EHR data. The different distributions of demographic profiles, medication use, site information such as the type of ICU at admission contributed to data heterogeneity.

Matching journals

The top 2 journals account for 50% of the predicted probability mass.

1
BMC Medical Informatics and Decision Making
39 papers in training set
Top 0.1%
37.1%
2
Journal of the American Medical Informatics Association
61 papers in training set
Top 0.1%
17.9%
50% of probability mass above
3
Journal of Biomedical Informatics
45 papers in training set
Top 0.2%
7.0%
4
International Journal of Medical Informatics
25 papers in training set
Top 0.2%
4.8%
5
JMIR Medical Informatics
17 papers in training set
Top 0.3%
3.9%
6
JAMIA Open
37 papers in training set
Top 0.4%
3.5%
7
npj Digital Medicine
97 papers in training set
Top 1%
3.5%
8
PLOS Digital Health
91 papers in training set
Top 0.8%
3.2%
9
Journal of Medical Internet Research
85 papers in training set
Top 2%
2.4%
10
Patterns
70 papers in training set
Top 1.0%
1.7%
11
Scientific Reports
3102 papers in training set
Top 65%
1.3%
12
BMJ Health & Care Informatics
13 papers in training set
Top 0.6%
1.2%
13
Artificial Intelligence in Medicine
15 papers in training set
Top 0.5%
0.9%
14
Bioinformatics
1061 papers in training set
Top 9%
0.9%
15
Frontiers in Artificial Intelligence
18 papers in training set
Top 0.6%
0.9%
16
IEEE Journal of Biomedical and Health Informatics
34 papers in training set
Top 2%
0.7%
17
PLOS ONE
4510 papers in training set
Top 69%
0.7%
18
Frontiers in Digital Health
20 papers in training set
Top 1%
0.7%
19
Biology Methods and Protocols
53 papers in training set
Top 3%
0.7%
20
Heliyon
146 papers in training set
Top 8%
0.6%
21
GigaScience
172 papers in training set
Top 4%
0.6%