TandemTwister: Scalable genotyping and advanced visualization of tandem repeats
Al Raei, L. W.; Ghareghani, M.; Moeinzadeh, H.; Vingron, M.
Show abstract
Tandem repeats are genomic regions consisting of consecutively repeated units with variable copy numbers and possible mutations. They are used in DNA fingerprinting and have been implicated in complex traits and genetic disorders, including neurodegenerative and developmental diseases. The vast and expanding number of tandem repeat loci in the human genome underscores the need for fast and scalable tools for accurate genotyping and visualization. An accurate tool for characterizing these variants is essential for understanding their functional impacts and associations with phenotypes. We developed TandemTwister, a novel algorithm implemented in C++, as a highly scalable and parallelized tool for tandem repeat copy number genotyping. Additionally, we created an interactive visualization tool to facilitate quick manual inspection, displaying exact motif occurrences, counts, and population information across haplotypes. TandemTwister demonstrates high accuracy and runtime efficiency for tandem repeat genotyping across all long-read sequencing technologies and assembled genomes. We evaluated the performance of TandemTwister in Ashkenazim trio on different sequencing technologies on a set of 1.2 million annotated tandem repeat regions. TandemTwister was the fastest and most accurate genotyping tool available for tandem repeats in comparison to the state of the art tools. For PacBio Hifi data as an example, TandemTwister was run in 17 minutes on 32 CPU cores resulting in 99.4% recall, 98.0% Mendelian consistency, and 94% sequence accuracy. We also showed a successful super-population clustering and examined inheritance patterns of tandem repeats and haplotype blocks in three trio sets. TandemTwister demonstrated its ability to detect pathogenic repeat expansions. We applied it in a cohort of 31 individuals with neurodegenerative and developmental disorders, successfully distinguishing healthy from pathogenic copy numbers.
Matching journals
The top 5 journals account for 50% of the predicted probability mass.