Pangenome-Informed Language Models for Privacy-Preserving Synthetic Genome Sequence Generation
Huang, P.; Charton, F.; Schmelzle, J.-N. M.; Darnell, S. S.; Prins, P.; Garrison, E.; Suh, G. E.
Show abstract
Language Models (LM) have been extensively utilized for learning DNA sequence patterns and generating synthetic sequences. In this paper, we present a novel approach for the generation of synthetic DNA data using pangenomes in combination with LM. We introduce three innovative pangenome-based tokenization schemes that enhance DNA sequence generation. Our experimental results demonstrate the superiority of pangenome-based tokenization over classical methods in generating high-utility synthetic DNA sequences, highlighting significant improvements in training efficiency and sequence quality.
Matching journals
The top 9 journals account for 50% of the predicted probability mass.