Back

Pangenome-Informed Language Models for Privacy-Preserving Synthetic Genome Sequence Generation

Huang, P.; Charton, F.; Schmelzle, J.-N. M.; Darnell, S. S.; Prins, P.; Garrison, E.; Suh, G. E.

2024-09-20 bioinformatics
10.1101/2024.09.18.612131 bioRxiv
Show abstract

Language Models (LM) have been extensively utilized for learning DNA sequence patterns and generating synthetic sequences. In this paper, we present a novel approach for the generation of synthetic DNA data using pangenomes in combination with LM. We introduce three innovative pangenome-based tokenization schemes that enhance DNA sequence generation. Our experimental results demonstrate the superiority of pangenome-based tokenization over classical methods in generating high-utility synthetic DNA sequences, highlighting significant improvements in training efficiency and sequence quality.

Matching journals

The top 9 journals account for 50% of the predicted probability mass.

1
Bioinformatics
1061 papers in training set
Top 3%
10.0%
2
IEEE Transactions on Computational Biology and Bioinformatics
17 papers in training set
Top 0.1%
8.3%
3
Nature Communications
4913 papers in training set
Top 27%
6.7%
4
Cell Systems
167 papers in training set
Top 2%
6.3%
5
Scientific Reports
3102 papers in training set
Top 25%
4.8%
6
IEEE Journal of Biomedical and Health Informatics
34 papers in training set
Top 0.4%
4.2%
7
Briefings in Bioinformatics
326 papers in training set
Top 2%
3.9%
8
Advanced Science
249 papers in training set
Top 5%
3.9%
9
BMC Bioinformatics
383 papers in training set
Top 3%
3.6%
50% of probability mass above
10
PLOS ONE
4510 papers in training set
Top 40%
3.5%
11
Genome Research
409 papers in training set
Top 1%
3.5%
12
Genomics, Proteomics & Bioinformatics
171 papers in training set
Top 2%
3.0%
13
iScience
1063 papers in training set
Top 7%
3.0%
14
PLOS Computational Biology
1633 papers in training set
Top 12%
2.7%
15
Nucleic Acids Research
1128 papers in training set
Top 8%
2.6%
16
Journal of Chemical Information and Modeling
207 papers in training set
Top 2%
1.8%
17
IEEE/ACM Transactions on Computational Biology and Bioinformatics
32 papers in training set
Top 0.2%
1.8%
18
Communications Biology
886 papers in training set
Top 9%
1.7%
19
Nature Computational Science
50 papers in training set
Top 0.6%
1.7%
20
Genome Biology
555 papers in training set
Top 5%
1.5%
21
Journal of Computational Biology
37 papers in training set
Top 0.3%
1.2%
22
Computational and Structural Biotechnology Journal
216 papers in training set
Top 6%
1.2%
23
Nature Biotechnology
147 papers in training set
Top 6%
0.9%
24
Patterns
70 papers in training set
Top 2%
0.9%
25
Bioinformatics Advances
184 papers in training set
Top 4%
0.9%
26
GigaScience
172 papers in training set
Top 3%
0.9%
27
BioData Mining
15 papers in training set
Top 0.7%
0.9%
28
Nature Machine Intelligence
61 papers in training set
Top 3%
0.8%
29
Frontiers in Genetics
197 papers in training set
Top 12%
0.6%