Virtual Pooling Enables Accurate, End-to-End Multi-Institutional Study Execution and Causal Inference Without Centralized Data Sharing
Ahmad, I.; Ayati, A.; Liu, K.; Ko, S.; Bonine, N.; Tabano, D.; Malik, N.; Lyu, T.; Zheng, K.; Rudrapatna, V. A.; Gupta, T.
Show abstract
Background: Multicenter retrospective studies often rely on bringing patient-level data together into a single repository, introducing substantial regulatory and operational barriers. Federated analytics provides a privacy-preserving alternative; however, existing implementations are complex to use, require extensive manual effort for data cleaning, preprocessing, and harmonization, and produce approximate rather than ground-truth results for many biostatistical methods. Virtual Pooling (VP) is a recently developed multicenter study execution platform designed to overcome these limitations. In this study, we evaluate whether VP can replicate a published multicenter retrospective study end-to-end---including data preprocessing, regression analysis, and causal inference---without centralized data aggregation. Methods: We deployed VP at the University of California, San Francisco (UCSF) and the University of California, Irvine (UCI) and attempted to replicate a published study of diabetic eye disease screening practices (UCSF N = 2,592; UCI N = 5,642). VP supported all phases of this two-center study, including data cleaning, harmonization, feature engineering, imputation, propensity score estimation, patient matching, and model estimation, all conducted through a single interface without manual coordination between centers. We verified preprocessing correctness and compared descriptive statistics and causal effect estimates with those from the original study, which relied on data transfers across the centers. We also measured the latency overhead introduced by VP. Results: VP was deployed without hospital infrastructure changes, new or non-standard governance agreements, or dedicated IT support. All preprocessing steps executed correctly, with individual preprocessing operations and descriptive statistics completing in under 1 second, logistic regression in under 10 seconds, and propensity score matching in under 30 seconds. Descriptive statistics for all 30 baseline covariates were numerically identical to the original study. Univariate regression results identifying predictors of completed screening were also identical, with recent eye clinic referral (OR = 56.7; 95% CI: 42.1-76.4) and history of eye disease (OR = 6.4; 95% CI: 5.6-7.4) as the strongest predictors. VP also reproduced pooled causal estimates of automated referrals, showing an increase in screening completion from 21% to 36% at UCSF and from 13% to 34% at UCI. Conclusion: VP enables accurate, end-to-end multicenter clinical studies without centralized data sharing. By providing a single interface that supports the full analytical workflow, from uncleaned and unharmonized data through statistical results, and by exactly reproducing pooled results, VP eliminates manual coordination and data transfers across centers. These findings validate its practical potential to transform multicenter retrospective studies, particularly in contexts where data sharing is time-consuming, bureaucratic, or restricted.
Matching journals
The top 2 journals account for 50% of the predicted probability mass.