Back

Virtual Pooling Enables Accurate, End-to-End Multi-Institutional Study Execution and Causal Inference Without Centralized Data Sharing

Ahmad, I.; Ayati, A.; Liu, K.; Ko, S.; Bonine, N.; Tabano, D.; Malik, N.; Lyu, T.; Zheng, K.; Rudrapatna, V. A.; Gupta, T.

2026-03-26 health informatics
10.64898/2026.03.24.26349123 medRxiv
Show abstract

Background: Multicenter retrospective studies often rely on bringing patient-level data together into a single repository, introducing substantial regulatory and operational barriers. Federated analytics provides a privacy-preserving alternative; however, existing implementations are complex to use, require extensive manual effort for data cleaning, preprocessing, and harmonization, and produce approximate rather than ground-truth results for many biostatistical methods. Virtual Pooling (VP) is a recently developed multicenter study execution platform designed to overcome these limitations. In this study, we evaluate whether VP can replicate a published multicenter retrospective study end-to-end---including data preprocessing, regression analysis, and causal inference---without centralized data aggregation. Methods: We deployed VP at the University of California, San Francisco (UCSF) and the University of California, Irvine (UCI) and attempted to replicate a published study of diabetic eye disease screening practices (UCSF N = 2,592; UCI N = 5,642). VP supported all phases of this two-center study, including data cleaning, harmonization, feature engineering, imputation, propensity score estimation, patient matching, and model estimation, all conducted through a single interface without manual coordination between centers. We verified preprocessing correctness and compared descriptive statistics and causal effect estimates with those from the original study, which relied on data transfers across the centers. We also measured the latency overhead introduced by VP. Results: VP was deployed without hospital infrastructure changes, new or non-standard governance agreements, or dedicated IT support. All preprocessing steps executed correctly, with individual preprocessing operations and descriptive statistics completing in under 1 second, logistic regression in under 10 seconds, and propensity score matching in under 30 seconds. Descriptive statistics for all 30 baseline covariates were numerically identical to the original study. Univariate regression results identifying predictors of completed screening were also identical, with recent eye clinic referral (OR = 56.7; 95% CI: 42.1-76.4) and history of eye disease (OR = 6.4; 95% CI: 5.6-7.4) as the strongest predictors. VP also reproduced pooled causal estimates of automated referrals, showing an increase in screening completion from 21% to 36% at UCSF and from 13% to 34% at UCI. Conclusion: VP enables accurate, end-to-end multicenter clinical studies without centralized data sharing. By providing a single interface that supports the full analytical workflow, from uncleaned and unharmonized data through statistical results, and by exactly reproducing pooled results, VP eliminates manual coordination and data transfers across centers. These findings validate its practical potential to transform multicenter retrospective studies, particularly in contexts where data sharing is time-consuming, bureaucratic, or restricted.

Matching journals

The top 2 journals account for 50% of the predicted probability mass.

1
Journal of the American Medical Informatics Association
61 papers in training set
Top 0.1%
40.1%
2
npj Digital Medicine
97 papers in training set
Top 0.2%
19.0%
50% of probability mass above
3
JAMIA Open
37 papers in training set
Top 0.1%
6.9%
4
Journal of Medical Internet Research
85 papers in training set
Top 0.9%
4.9%
5
PLOS Digital Health
91 papers in training set
Top 0.9%
2.9%
6
JMIR Public Health and Surveillance
45 papers in training set
Top 2%
1.7%
7
BMJ Health & Care Informatics
13 papers in training set
Top 0.5%
1.5%
8
Patterns
70 papers in training set
Top 1%
1.5%
9
JMIR Medical Informatics
17 papers in training set
Top 0.8%
1.5%
10
PLOS ONE
4510 papers in training set
Top 58%
1.4%
11
Nature Communications
4913 papers in training set
Top 54%
1.4%
12
JCO Clinical Cancer Informatics
18 papers in training set
Top 0.6%
1.2%
13
Scientific Reports
3102 papers in training set
Top 66%
1.2%
14
BMC Medical Informatics and Decision Making
39 papers in training set
Top 2%
1.2%
15
BMC Medical Research Methodology
43 papers in training set
Top 1%
0.9%
16
The Lancet Digital Health
25 papers in training set
Top 0.8%
0.9%
17
Med
38 papers in training set
Top 0.8%
0.8%
18
GigaScience
172 papers in training set
Top 3%
0.8%
19
Frontiers in Public Health
140 papers in training set
Top 8%
0.7%
20
Computer Methods and Programs in Biomedicine
27 papers in training set
Top 1%
0.7%
21
British Journal of Ophthalmology
14 papers in training set
Top 0.3%
0.7%
22
Pharmacoepidemiology and Drug Safety
13 papers in training set
Top 0.5%
0.7%
23
Journal of Clinical and Translational Science
11 papers in training set
Top 0.6%
0.5%
24
Frontiers in Digital Health
20 papers in training set
Top 2%
0.5%
25
Annals of Internal Medicine
27 papers in training set
Top 1%
0.5%