Graph-Based Synthetic EHR Generation with Improved Quality-Privacy Trade-offs for Opioid Use Disorder Prediction
Alam, M. A. U.; Shalhout, S. Z.
Show abstract
Electronic health record (EHR) data are critical for clinical research but are challenging to share due to privacy and re-identification risks, particularly in sensitive domains such as opioid use disorder (OUD). Synthetic data generation offers a promising alternative; however, existing methods often struggle to preserve complex multivariate dependencies while maintaining a strong balance between data utility and privacy. The recently proposed MIIC-SDG framework leverages multivariate information theory and Bayesian network modeling to capture dependency structures and introduces Quality-Privacy Scores (QPS) to evaluate this trade-off, yet its capacity to model nonlinear relationships and support multi-task predictive settings remains limited. In this work, we propose a multi-task extension of TabGraphSyn, a graph-based generative framework for privacy-preserving EHR synthesis. The method constructs patient similarity graphs from high-dimensional tabular data and learns topology-aware embeddings via a graph convolutional network, which are then incorporated into a conditional variational autoencoder for synthetic data generation. Unlike prior approaches, our framework jointly models multiple clinically relevant OUD targets, including 180-day opioid abuse outcome, opioid concept group, and opioid source concept group, enabling preservation of label-dependent relationships across tasks. We evaluate TabGraphSyn against MIIC-SDG under a unified framework including multi-task predictive utility, distributional similarity, identifiability risk, membership inference risk, and QPS-based metrics. Results on the NIH All of Us dataset show that TabGraphSyn achieves a stronger overall utility-privacy balance, outperforming MIIC in most headline metrics, including higher synthetic multi-task ROC-AUC (0.5278 vs 0.4932), MetaQPS (AM: 0.0215 vs 0.0115; HM: 0.0391 vs 0.0223), while slightly underperforming in macro F1 (0.2321 vs 0.2840). These findings demonstrate improved modeling of nonlinear dependencies and more favorable quality-privacy trade-offs in multi-task settings, supporting its use for realistic and privacy-aware synthetic EHR data generation.
Matching journals
The top 7 journals account for 50% of the predicted probability mass.