Back

An efficient distributed algorithm with application to COVID-19 data from heterogeneous clinical sites

Tong, J.; Luo, C.; Islam, M. N.; Sheils, N.; Buresh, J.; Edmondson, M.; Merkel, P. A.; Lautenbach, E.; Duan, R.; Chen, Y.

2020-11-18 epidemiology
10.1101/2020.11.17.20220681 medRxiv
Show abstract

ObjectivesIntegrating electronic health records (EHR) data from several clinical sites offers great opportunities to improve estimation with a more general population compared to analyses based on a single clinical site. However, sharing patient-level data across sites is practically challenging due to concerns about maintaining patient privacy. The objective of this study is to develop a novel distributed algorithm to integrate heterogeneous EHR data from multiple clinical sites without sharing patient-level data. Materials and MethodsThe proposed distributed algorithm for binary regression can effectively account for between-site heterogeneity and is communication-efficient. Our method is built on a pairwise likelihood function in the extended Mantel-Haenszel regression, which is known to be statistically highly efficient. We construct a surrogate pairwise likelihood function through approximating the target pairwise likelihood by its surrogate. We show that the proposed surrogate pairwise likelihood leads to a consistent and asymptotically normal estimator by effective communication without sharing individual patient-level data. We study the empirical performance of the proposed method through a systematic simulation study and an application with data of 14,215 COVID-19 patients from 230 clinical sites at UnitedHealth Group Clinical Research Database. ResultsThe proposed method was shown to perform close to the gold standard approach under extensive simulation settings. When the event rate is <5%, the relative bias of the proposed estimator is 30% smaller than that of the meta-analysis estimator. The proposed method retained high accuracy across different sample sizes and event rates compared with meta-analysis. In the data evaluation, the proposed estimate has a relative bias <9% when the event rate is <1%, whereas the meta-analysis estimate has a relative bias at least 10% higher than that of the proposed method. ConclusionsOur simulation study and data application demonstrate that the proposed distributed algorithm provides an estimator that is robust to heterogeneity in event rates when effectively integrating data from multiple clinical sites. Our algorithm is therefore an effective alternative to both meta-analysis and existing distributed algorithms for modeling heterogeneous multi-site binary outcomes.

Matching journals

The top 5 journals account for 50% of the predicted probability mass.

1
Statistics in Medicine
34 papers in training set
Top 0.1%
22.3%
2
BMC Medical Research Methodology
43 papers in training set
Top 0.1%
14.2%
3
PLOS ONE
4510 papers in training set
Top 25%
6.8%
4
Journal of Biomedical Informatics
45 papers in training set
Top 0.3%
4.8%
5
Scientific Reports
3102 papers in training set
Top 28%
4.3%
50% of probability mass above
6
BMC Medical Informatics and Decision Making
39 papers in training set
Top 0.8%
3.6%
7
Bioinformatics
1061 papers in training set
Top 6%
3.0%
8
Epidemiology
26 papers in training set
Top 0.2%
3.0%
9
IEEE Journal of Biomedical and Health Informatics
34 papers in training set
Top 0.6%
3.0%
10
Journal of the American Medical Informatics Association
61 papers in training set
Top 0.9%
2.6%
11
PLOS Computational Biology
1633 papers in training set
Top 13%
2.3%
12
American Journal of Epidemiology
57 papers in training set
Top 0.5%
2.1%
13
Genetic Epidemiology
46 papers in training set
Top 0.3%
2.1%
14
International Journal of Epidemiology
74 papers in training set
Top 1%
1.7%
15
Journal of Medical Internet Research
85 papers in training set
Top 3%
1.6%
16
BMC Bioinformatics
383 papers in training set
Top 5%
1.3%
17
Medical Decision Making
10 papers in training set
Top 0.2%
1.3%
18
Nature Communications
4913 papers in training set
Top 57%
1.2%
19
PLOS Genetics
756 papers in training set
Top 11%
1.2%
20
BMC Research Notes
29 papers in training set
Top 0.5%
0.8%
21
International Journal of Medical Informatics
25 papers in training set
Top 2%
0.8%
22
JAMIA Open
37 papers in training set
Top 1%
0.8%
23
The Annals of Applied Statistics
15 papers in training set
Top 0.1%
0.8%
24
npj Digital Medicine
97 papers in training set
Top 3%
0.8%
25
BioData Mining
15 papers in training set
Top 0.9%
0.7%
26
Biometrics
22 papers in training set
Top 0.2%
0.7%
27
JMIR Public Health and Surveillance
45 papers in training set
Top 4%
0.7%
28
IEEE/ACM Transactions on Computational Biology and Bioinformatics
32 papers in training set
Top 0.7%
0.6%
29
Briefings in Bioinformatics
326 papers in training set
Top 8%
0.6%