Back

Deep Generative Models for Discrete Genotype Simulation

Xie, S.; Tribout, T.; Boichard, D.; Hanczar, B.; Chiquet, J.; Barrey, E.

2025-08-12 bioinformatics
10.1101/2025.08.08.669289 bioRxiv
Show abstract

Deep generative models open new avenues for simulating realistic genomic data while preserving privacy and addressing data accessibility constraints. While previous studies have primarily focused on generating gene expression or haplotype data, this study explores generating genotype data in both unconditioned and phenotype-conditioned settings, which is inherently more challenging due to the discrete nature of genotype data. In this work, we developed and evaluated commonly used generative models, including Variational Autoencoders (VAEs), Diffusion Models, and Generative Adversarial Networks (GANs), and proposed adaptation tailored to discrete genotype data. We conducted extensive experiments on large-scale datasets, including all chromosomes from cow and multiple chromosomes from human. Model performance was assessed using a well-established set of metrics drawn from both deep learning and quantitative genetics literature. Our results show that these models can effectively capture genetic patterns and preserve genotype-phenotype association. Our findings provide a comprehensive comparison of these models and offer practical guidelines for future research in genotype simulation. We have made our code publicly available at https://github.com/SihanXXX/DiscreteGenoGen.

Matching journals

The top 7 journals account for 50% of the predicted probability mass.

1
Bioinformatics
1061 papers in training set
Top 2%
14.8%
2
PLOS Computational Biology
1633 papers in training set
Top 4%
8.5%
3
IEEE Transactions on Computational Biology and Bioinformatics
17 papers in training set
Top 0.1%
8.5%
4
Frontiers in Genetics
197 papers in training set
Top 0.8%
6.4%
5
The American Journal of Human Genetics
206 papers in training set
Top 1.0%
4.4%
6
Briefings in Bioinformatics
326 papers in training set
Top 1%
4.3%
7
Cell Systems
167 papers in training set
Top 3%
4.0%
50% of probability mass above
8
Nature Communications
4913 papers in training set
Top 38%
3.7%
9
IEEE Journal of Biomedical and Health Informatics
34 papers in training set
Top 0.5%
3.3%
10
Scientific Reports
3102 papers in training set
Top 50%
2.1%
11
Nature Machine Intelligence
61 papers in training set
Top 1%
2.1%
12
BioData Mining
15 papers in training set
Top 0.2%
2.1%
13
iScience
1063 papers in training set
Top 12%
1.9%
14
Journal of Computational Biology
37 papers in training set
Top 0.1%
1.9%
15
PLOS Genetics
756 papers in training set
Top 7%
1.9%
16
PLOS ONE
4510 papers in training set
Top 53%
1.7%
17
BMC Bioinformatics
383 papers in training set
Top 4%
1.7%
18
Genome Research
409 papers in training set
Top 3%
1.3%
19
Genome Medicine
154 papers in training set
Top 5%
1.3%
20
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 38%
1.2%
21
Bioinformatics Advances
184 papers in training set
Top 4%
1.1%
22
European Journal of Human Genetics
49 papers in training set
Top 0.9%
1.1%
23
Nucleic Acids Research
1128 papers in training set
Top 14%
1.1%
24
Nature Computational Science
50 papers in training set
Top 1%
0.9%
25
IEEE/ACM Transactions on Computational Biology and Bioinformatics
32 papers in training set
Top 0.5%
0.9%
26
Science Advances
1098 papers in training set
Top 26%
0.9%
27
Communications Biology
886 papers in training set
Top 21%
0.8%
28
Advanced Science
249 papers in training set
Top 17%
0.8%
29
Computational and Structural Biotechnology Journal
216 papers in training set
Top 9%
0.8%
30
PNAS Nexus
147 papers in training set
Top 2%
0.7%