Reward-Guided Generation Improves the Scientific Utility of Synthetic Biomedical Data
Jackson, N. J.; Espinosa-Dice, N.; Yan, C.; Malin, B. A.
Show abstract
Synthetic data generation is a promising approach for biomedical data sharing and dataset augmentation, yet existing methods lack mechanisms to preserve statistical properties necessary for scientific analysis. To address this, we introduce RLSYN+REG, a reinforcement learning-driven generative model, which encourages that regression models trained on synthetic data reproduce the coefficients and predictions of their real-data counterparts. We evaluate RL-SO_SCPLOWYNC_SCPLOW+RO_SCPLOWEGC_SCPLOW on MIMIC-III and the American Community Survey (ACS) across regression model reproduction, fidelity to real data, and privacy. Synthetic data from RLSO_SCPLOWYNC_SCPLOW+RO_SCPLOWEGC_SCPLOW substantially improves upon that of RLSO_SCPLOWYNC_SCPLOW, raising correlations between real and synthetic regression coefficients from 0.054 to 0.600 on MIMIC-III and from 0.160 to 0.376 on ACS. Predictive performance also improves, reducing the gap between real-data baselines by 81.4% and 97.6% on MIMIC-III and ACS, respectively. These improvements come with negligible cost to fidelity or privacy and are robust to reductions in training data.
Matching journals
The top 6 journals account for 50% of the predicted probability mass.