Back

Synthetic Data Generation and Nonparametric Techniques for Assessing Multivariate Similarity to Address Small-Sample Size Challenges

Heine, J.; Fowler, E.; Eschrich, S. A.; Schell, M.

2026-05-07 bioinformatics
10.64898/2026.05.04.722226 bioRxiv
Show abstract

Data modeling in biomedical research often operates in the small-sample regime, where the number of observations is small relative to the data dimensionality; the detrimental effects of limited sample sizes are well documented in cancer studies. Synthetic data offers a potential solution to data shortfalls provided that the data generated is an adequate facsimile of the underlying distribution; the adequacy of such synthetic data remains an open-ended problem. In this work, we evaluate a synthetic generator proposed previously. The generator applies a series of transformations to the observed data to accommodate the small-sample size resulting in an uncoupled representation, where uncorrelated marginal distributions are modeled with optimized univariate kernel density estimation. In this report, (1) we develop a nonparametric method for assessing multivariate similarity based on the Cramer-Wold theorem and random projection testing, (2) investigate when the absence of bivariate correlation approximates independence in a non-normal setting, and (3) evaluate artifacts induced by data compression. The presentation is primarily methodological; low-dimensional data were used so each stage of the generation process could be analyzed explicitly. A formal testing framework was developed by comparing random projection level outcomes with a two-sample test, modeling these outcomes as Bernoulli trials, aggregating replicate outcomes within each projection direction, and pooling outcomes across many directions, yielding a scalable standardized normal test-statistic. The key innovation was decoupling the two-sample test significance level from that governing finalized normal inference. We showed the same projection framework also evaluates the full multivariate covariance structure. The generator produced high-fidelity multivariate synthetic data when the bivariate correlation approximates independence in the non-normal setting; in highly compressed data, residual modes were best modeled as normally distributed regardless of their intrinsic distributional form. Ongoing work includes applying these methods to higher-dimensional, diverse data.

Matching journals

The top 6 journals account for 50% of the predicted probability mass.

1
PLOS Computational Biology
1633 papers in training set
Top 1%
18.6%
2
Bioinformatics
1061 papers in training set
Top 2%
14.3%
3
Statistics in Medicine
34 papers in training set
Top 0.1%
8.4%
4
PLOS ONE
4510 papers in training set
Top 38%
3.7%
5
Computer Methods and Programs in Biomedicine
27 papers in training set
Top 0.1%
3.7%
6
BioData Mining
15 papers in training set
Top 0.1%
3.6%
50% of probability mass above
7
Scientific Reports
3102 papers in training set
Top 37%
3.6%
8
Frontiers in Genetics
197 papers in training set
Top 2%
3.6%
9
BMC Bioinformatics
383 papers in training set
Top 3%
2.6%
10
GigaScience
172 papers in training set
Top 0.9%
2.4%
11
Journal of Biomedical Informatics
45 papers in training set
Top 0.7%
1.9%
12
Biostatistics
21 papers in training set
Top 0.1%
1.7%
13
Journal of Neuroscience Methods
106 papers in training set
Top 1.0%
1.5%
14
Patterns
70 papers in training set
Top 1%
1.3%
15
Biology Methods and Protocols
53 papers in training set
Top 2%
1.1%
16
Bioinformatics Advances
184 papers in training set
Top 4%
0.8%
17
IEEE Journal of Biomedical and Health Informatics
34 papers in training set
Top 2%
0.8%
18
The Annals of Applied Statistics
15 papers in training set
Top 0.1%
0.7%
19
IEEE Access
31 papers in training set
Top 1.0%
0.7%
20
Computational and Structural Biotechnology Journal
216 papers in training set
Top 9%
0.7%
21
Frontiers in Bioinformatics
45 papers in training set
Top 0.9%
0.7%
22
Genetic Epidemiology
46 papers in training set
Top 1.0%
0.6%
23
PLOS Genetics
756 papers in training set
Top 17%
0.6%
24
BMC Medical Genomics
36 papers in training set
Top 2%
0.6%