Back

Reward-Guided Generation Improves the Scientific Utility of Synthetic Biomedical Data

Jackson, N. J.; Espinosa-Dice, N.; Yan, C.; Malin, B. A.

2026-03-16 health informatics
10.64898/2026.03.11.26348077 medRxiv
Show abstract

Synthetic data generation is a promising approach for biomedical data sharing and dataset augmentation, yet existing methods lack mechanisms to preserve statistical properties necessary for scientific analysis. To address this, we introduce RLSYN+REG, a reinforcement learning-driven generative model, which encourages that regression models trained on synthetic data reproduce the coefficients and predictions of their real-data counterparts. We evaluate RL-SO_SCPLOWYNC_SCPLOW+RO_SCPLOWEGC_SCPLOW on MIMIC-III and the American Community Survey (ACS) across regression model reproduction, fidelity to real data, and privacy. Synthetic data from RLSO_SCPLOWYNC_SCPLOW+RO_SCPLOWEGC_SCPLOW substantially improves upon that of RLSO_SCPLOWYNC_SCPLOW, raising correlations between real and synthetic regression coefficients from 0.054 to 0.600 on MIMIC-III and from 0.160 to 0.376 on ACS. Predictive performance also improves, reducing the gap between real-data baselines by 81.4% and 97.6% on MIMIC-III and ACS, respectively. These improvements come with negligible cost to fidelity or privacy and are robust to reductions in training data.

Matching journals

The top 6 journals account for 50% of the predicted probability mass.

1
npj Digital Medicine
97 papers in training set
Top 0.4%
12.4%
2
Nature Communications
4913 papers in training set
Top 17%
10.1%
3
Nature Computational Science
50 papers in training set
Top 0.1%
10.1%
4
Nature Biomedical Engineering
42 papers in training set
Top 0.1%
10.1%
5
Patterns
70 papers in training set
Top 0.1%
4.4%
6
Nature Machine Intelligence
61 papers in training set
Top 0.8%
3.7%
50% of probability mass above
7
Bioinformatics
1061 papers in training set
Top 5%
3.6%
8
Scientific Reports
3102 papers in training set
Top 41%
3.1%
9
Journal of the American Medical Informatics Association
61 papers in training set
Top 0.9%
2.7%
10
Nature Methods
336 papers in training set
Top 3%
2.4%
11
Med
38 papers in training set
Top 0.1%
2.4%
12
Science Translational Medicine
111 papers in training set
Top 2%
2.1%
13
Nature Medicine
117 papers in training set
Top 2%
1.9%
14
Science Advances
1098 papers in training set
Top 17%
1.7%
15
Cell Systems
167 papers in training set
Top 7%
1.7%
16
Journal of Biomedical Informatics
45 papers in training set
Top 0.9%
1.5%
17
iScience
1063 papers in training set
Top 19%
1.3%
18
Nature
575 papers in training set
Top 13%
1.0%
19
Nature Biotechnology
147 papers in training set
Top 6%
1.0%
20
PLOS ONE
4510 papers in training set
Top 62%
1.0%
21
Communications Biology
886 papers in training set
Top 18%
0.9%
22
JCO Clinical Cancer Informatics
18 papers in training set
Top 0.7%
0.9%
23
Advanced Science
249 papers in training set
Top 16%
0.9%
24
eLife
5422 papers in training set
Top 55%
0.8%
25
Genome Research
409 papers in training set
Top 4%
0.8%
26
Nature Genetics
240 papers in training set
Top 7%
0.8%
27
IEEE Journal of Biomedical and Health Informatics
34 papers in training set
Top 2%
0.8%
28
Science
429 papers in training set
Top 20%
0.7%
29
Genome Biology
555 papers in training set
Top 8%
0.7%
30
JMIR Medical Informatics
17 papers in training set
Top 2%
0.7%