Back

A unified benchmark of synthetic data generation for clinical transcriptomic cancer cohorts

Trinh, T.-C.; Woillard, J.-B.; Uguzzoni, G.; Battail, C.

2026-05-16 bioinformatics
10.64898/2026.05.13.724858 bioRxiv
Show abstract

Achieving a trade-off between biological utility and patient privacy remains a key challenge for secure data sharing when applying transcriptomic clinical datasets to artificial intelligence in precision oncology. Here, we introduce the first benchmarking study tailored to high-dimensional clinical transcriptomic cancer data, comparing synthetic data generation methods across three clinical cancer trials. Our framework, SynOmicsBench, combines standardized preprocessing with multidimensional evaluation, prioritizing downstream biological validation alongside statistical fidelity and attack-based privacy assessment. Results indicate that no single method dominated all dimensions, with Gaussian Copula achieving the most balanced performance, followed by Avatar, demonstrating that metric-based similarity alone is insufficient to ensure preservation of higher-order molecular dependencies. Synthetic data consistently reproduced biomedical signal directionality but with attenuated effect sizes and inter-replicate variability, supporting hypothesis generation when multi-seed synthesis is adopted. Collectively, this framework provides a reproducible decision-support tool for method selection and promotes biologically informed, privacy-aware adoption of synthetic data in precision oncology.

Matching journals

The top 8 journals account for 50% of the predicted probability mass.

1
Advanced Science
249 papers in training set
Top 1%
9.9%
2
Nature Communications
4913 papers in training set
Top 24%
8.2%
3
Nature Biomedical Engineering
42 papers in training set
Top 0.1%
6.7%
4
Nature Machine Intelligence
61 papers in training set
Top 0.5%
6.2%
5
npj Digital Medicine
97 papers in training set
Top 0.9%
6.2%
6
Cell Systems
167 papers in training set
Top 2%
6.2%
7
Genome Medicine
154 papers in training set
Top 1%
4.7%
8
Nature Medicine
117 papers in training set
Top 0.7%
3.9%
50% of probability mass above
9
PLOS Computational Biology
1633 papers in training set
Top 10%
3.5%
10
Briefings in Bioinformatics
326 papers in training set
Top 2%
2.7%
11
Nucleic Acids Research
1128 papers in training set
Top 7%
2.7%
12
Nature Biotechnology
147 papers in training set
Top 3%
2.5%
13
Scientific Reports
3102 papers in training set
Top 54%
1.8%
14
eLife
5422 papers in training set
Top 43%
1.7%
15
Science Advances
1098 papers in training set
Top 18%
1.7%
16
Cell Reports Medicine
140 papers in training set
Top 4%
1.7%
17
npj Precision Oncology
48 papers in training set
Top 0.6%
1.6%
18
Cell Genomics
162 papers in training set
Top 4%
1.6%
19
Cancer Research
116 papers in training set
Top 2%
1.6%
20
Nature Methods
336 papers in training set
Top 5%
1.5%
21
Patterns
70 papers in training set
Top 1%
1.5%
22
npj Systems Biology and Applications
99 papers in training set
Top 1%
1.3%
23
Cancer Research Communications
46 papers in training set
Top 0.7%
1.2%
24
PLOS ONE
4510 papers in training set
Top 60%
1.2%
25
Communications Biology
886 papers in training set
Top 15%
1.2%
26
GigaScience
172 papers in training set
Top 3%
0.8%
27
Cancer Cell
38 papers in training set
Top 2%
0.7%
28
Bioinformatics
1061 papers in training set
Top 10%
0.6%