Back

Graph-Based Synthetic EHR Generation with Improved Quality-Privacy Trade-offs for Opioid Use Disorder Prediction

Alam, M. A. U.; Shalhout, S. Z.

2026-04-27 pain medicine
10.64898/2026.04.24.26351704 medRxiv
Show abstract

Electronic health record (EHR) data are critical for clinical research but are challenging to share due to privacy and re-identification risks, particularly in sensitive domains such as opioid use disorder (OUD). Synthetic data generation offers a promising alternative; however, existing methods often struggle to preserve complex multivariate dependencies while maintaining a strong balance between data utility and privacy. The recently proposed MIIC-SDG framework leverages multivariate information theory and Bayesian network modeling to capture dependency structures and introduces Quality-Privacy Scores (QPS) to evaluate this trade-off, yet its capacity to model nonlinear relationships and support multi-task predictive settings remains limited. In this work, we propose a multi-task extension of TabGraphSyn, a graph-based generative framework for privacy-preserving EHR synthesis. The method constructs patient similarity graphs from high-dimensional tabular data and learns topology-aware embeddings via a graph convolutional network, which are then incorporated into a conditional variational autoencoder for synthetic data generation. Unlike prior approaches, our framework jointly models multiple clinically relevant OUD targets, including 180-day opioid abuse outcome, opioid concept group, and opioid source concept group, enabling preservation of label-dependent relationships across tasks. We evaluate TabGraphSyn against MIIC-SDG under a unified framework including multi-task predictive utility, distributional similarity, identifiability risk, membership inference risk, and QPS-based metrics. Results on the NIH All of Us dataset show that TabGraphSyn achieves a stronger overall utility-privacy balance, outperforming MIIC in most headline metrics, including higher synthetic multi-task ROC-AUC (0.5278 vs 0.4932), MetaQPS (AM: 0.0215 vs 0.0115; HM: 0.0391 vs 0.0223), while slightly underperforming in macro F1 (0.2321 vs 0.2840). These findings demonstrate improved modeling of nonlinear dependencies and more favorable quality-privacy trade-offs in multi-task settings, supporting its use for realistic and privacy-aware synthetic EHR data generation.

Matching journals

The top 7 journals account for 50% of the predicted probability mass.

1
npj Digital Medicine
97 papers in training set
Top 0.2%
22.4%
2
Science Advances
1098 papers in training set
Top 0.1%
10.4%
3
Clinical Pharmacology & Therapeutics
25 papers in training set
Top 0.1%
6.3%
4
Nature Biomedical Engineering
42 papers in training set
Top 0.3%
3.6%
5
Nature Computational Science
50 papers in training set
Top 0.2%
3.6%
6
Human Brain Mapping
295 papers in training set
Top 2%
3.2%
7
Advanced Science
249 papers in training set
Top 6%
3.2%
50% of probability mass above
8
Journal of Biomedical Informatics
45 papers in training set
Top 0.6%
2.6%
9
Nature Medicine
117 papers in training set
Top 1%
2.6%
10
Scientific Reports
3102 papers in training set
Top 48%
2.3%
11
Nature Communications
4913 papers in training set
Top 49%
1.9%
12
Frontiers in Digital Health
20 papers in training set
Top 0.6%
1.7%
13
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 33%
1.7%
14
IEEE Journal of Biomedical and Health Informatics
34 papers in training set
Top 1%
1.5%
15
Journal of Medical Internet Research
85 papers in training set
Top 3%
1.5%
16
Nature Machine Intelligence
61 papers in training set
Top 2%
1.3%
17
Journal of the American Medical Informatics Association
61 papers in training set
Top 1%
1.3%
18
Genome Medicine
154 papers in training set
Top 6%
1.2%
19
Science Translational Medicine
111 papers in training set
Top 4%
1.2%
20
eLife
5422 papers in training set
Top 50%
1.1%
21
Clinical and Translational Science
21 papers in training set
Top 0.7%
0.9%
22
Communications Medicine
85 papers in training set
Top 0.6%
0.9%
23
Bioinformatics
1061 papers in training set
Top 8%
0.9%
24
Patterns
70 papers in training set
Top 2%
0.9%
25
PLOS ONE
4510 papers in training set
Top 64%
0.9%
26
Nature Biotechnology
147 papers in training set
Top 7%
0.7%
27
PLOS Digital Health
91 papers in training set
Top 3%
0.7%
28
The American Journal of Human Genetics
206 papers in training set
Top 4%
0.7%
29
Brain
154 papers in training set
Top 5%
0.7%
30
Nature Human Behaviour
85 papers in training set
Top 4%
0.7%