Synthetic RNA-seq cohorts for data sharing: a discovery-aware benchmark at transcriptome scale
NANDA, A.; Saha, S.
Show abstract
BackgroundSharing patient-level gene expression data is essential for translational discovery but carries documented re-identification risks. Bulk RNA-seq count matrices can retain genotypic signals and paired clinical metadata compounds this through quasi-identifier matching. Synthetic RNA-seq cohorts offer a complementary path for privacy-preserving data sharing, but the field lacks a multi-axis benchmark that probes biological fidelity and empirical privacy risk at transcriptome scale. Here we present a multi-axis benchmark framework that reflects how transcriptomic cohorts are used in translational practice. MethodsWe benchmarked three generative models across four cohorts drawn from datasets spanning oncology (TCGA-LUAD), sepsis (GSE184900), and pediatric IBD (RISK/GSE57945): dbTwin (a non-deep-learning, target-conditioned method that operates natively at RNA-seq scale), class-MVN (a low-rank target-conditioned multivariate Gaussian model), and PCA-CTGAN (a tabular GAN trained in PCA-compressed space). Synthetic cohorts were generated from training folds of a five-fold stratified design. We evaluated DE genes recovery, log2FC and significance (padj) concordance, held-out AUC (TSTR) and SHAP concordance and distance-based memorization risk. Resultsclass-MVN recovered 64.8% and 43.1% of real DE genes in the two binary cohorts with high fold-change correlation but lower significance concordance (r = 0.24-0.68) and inflated DE gene counts. dbTwin recovered 78.7% and 91.8% of real DE genes in the same cohorts, with high fold-change correlation and stronger significance concordance (r [≥] 0.88). Both methods matched held-out real AUC under TSTR, but SHAP agreement differed substantially: dbTwin preserved feature attribution patterns across cohorts (SHAP top-50 genes r = 0.84-0.99 across two binary and two multiclass cohorts), whereas class-MVN showed moderate performance for majority classes but degraded in multiclass and imbalanced settings (SHAP r = 0.31-0.79). PCA-CTGAN performed poorly across most DE and ML metrics. Distance-toclosest-record analysis did not indicate memorization by any of the models. ConclusionsWe introduced a multi-axis, transcriptome-scale, discovery-aware benchmark to validate synthetic RNA-seq cohorts for translational workflows and evaluated three generative models across four real-world cohorts. These results support the use of synthetic RNA-seq cohorts for exploratory analysis and method development, while emphasizing the need for careful validation before use in higher-stakes applications. All benchmark code and data are available at https://github.com/Nanda-Aditya/rna-syn-bench.
Matching journals
The top 6 journals account for 50% of the predicted probability mass.