Reward-Guided Generation Improves the Scientific Utility of Synthetic Biomedical Data

Jackson, N. J.; Espinosa-Dice, N.; Yan, C.; Malin, B. A.

2026-03-16 health informatics

10.64898/2026.03.11.26348077 medRxiv

Show abstract

Synthetic data generation is a promising approach for biomedical data sharing and dataset augmentation, yet existing methods lack mechanisms to preserve statistical properties necessary for scientific analysis. To address this, we introduce RLSYN+REG, a reinforcement learning-driven generative model, which encourages that regression models trained on synthetic data reproduce the coefficients and predictions of their real-data counterparts. We evaluate RL-SO_SCPLOWYNC_SCPLOW+RO_SCPLOWEGC_SCPLOW on MIMIC-III and the American Community Survey (ACS) across regression model reproduction, fidelity to real data, and privacy. Synthetic data from RLSO_SCPLOWYNC_SCPLOW+RO_SCPLOWEGC_SCPLOW substantially improves upon that of RLSO_SCPLOWYNC_SCPLOW, raising correlations between real and synthetic regression coefficients from 0.054 to 0.600 on MIMIC-III and from 0.160 to 0.376 on ACS. Predictive performance also improves, reducing the gap between real-data baselines by 81.4% and 97.6% on MIMIC-III and ACS, respectively. These improvements come with negligible cost to fidelity or privacy and are robust to reductions in training data.

Matching journals

●Non-profit ◐University press ○Commercial

The top 6 journals account for 50% of the predicted probability mass.

Only show non-profit

npj Digital Medicine

○ 97 papers in training set

Nature Communications

○ 4913 papers in training set

Nature Computational Science

○ 50 papers in training set

Nature Biomedical Engineering

○ 42 papers in training set

○ 70 papers in training set

Nature Machine Intelligence

○ 61 papers in training set

50% of probability mass above

◐ 1061 papers in training set

Scientific Reports

○ 3102 papers in training set

Journal of the American Medical Informatics Association

◐ 61 papers in training set

○ 336 papers in training set

○ 38 papers in training set

Science Translational Medicine

● 111 papers in training set

Nature Medicine

○ 117 papers in training set

Science Advances

● 1098 papers in training set

○ 167 papers in training set

Journal of Biomedical Informatics

○ 45 papers in training set

○ 1063 papers in training set

○ 575 papers in training set

Nature Biotechnology

○ 147 papers in training set

● 4510 papers in training set

Communications Biology

○ 886 papers in training set

JCO Clinical Cancer Informatics

● 18 papers in training set

Advanced Science

○ 249 papers in training set

● 5422 papers in training set

Genome Research

● 409 papers in training set

Nature Genetics

○ 240 papers in training set

IEEE Journal of Biomedical and Health Informatics

● 34 papers in training set

● 429 papers in training set

○ 555 papers in training set

JMIR Medical Informatics

◐ 17 papers in training set