Back

TandemTwister: Scalable genotyping and advanced visualization of tandem repeats

Al Raei, L. W.; Ghareghani, M.; Moeinzadeh, H.; Vingron, M.

2026-01-31 genomics
10.64898/2026.01.28.702315 bioRxiv
Show abstract

Tandem repeats are genomic regions consisting of consecutively repeated units with variable copy numbers and possible mutations. They are used in DNA fingerprinting and have been implicated in complex traits and genetic disorders, including neurodegenerative and developmental diseases. The vast and expanding number of tandem repeat loci in the human genome underscores the need for fast and scalable tools for accurate genotyping and visualization. An accurate tool for characterizing these variants is essential for understanding their functional impacts and associations with phenotypes. We developed TandemTwister, a novel algorithm implemented in C++, as a highly scalable and parallelized tool for tandem repeat copy number genotyping. Additionally, we created an interactive visualization tool to facilitate quick manual inspection, displaying exact motif occurrences, counts, and population information across haplotypes. TandemTwister demonstrates high accuracy and runtime efficiency for tandem repeat genotyping across all long-read sequencing technologies and assembled genomes. We evaluated the performance of TandemTwister in Ashkenazim trio on different sequencing technologies on a set of 1.2 million annotated tandem repeat regions. TandemTwister was the fastest and most accurate genotyping tool available for tandem repeats in comparison to the state of the art tools. For PacBio Hifi data as an example, TandemTwister was run in 17 minutes on 32 CPU cores resulting in 99.4% recall, 98.0% Mendelian consistency, and 94% sequence accuracy. We also showed a successful super-population clustering and examined inheritance patterns of tandem repeats and haplotype blocks in three trio sets. TandemTwister demonstrated its ability to detect pathogenic repeat expansions. We applied it in a cohort of 31 individuals with neurodegenerative and developmental disorders, successfully distinguishing healthy from pathogenic copy numbers.

Matching journals

The top 5 journals account for 50% of the predicted probability mass.

1
Genome Research
409 papers in training set
Top 0.1%
12.1%
2
Bioinformatics
1061 papers in training set
Top 2%
12.1%
3
Nucleic Acids Research
1128 papers in training set
Top 2%
9.9%
4
Genome Biology
555 papers in training set
Top 0.4%
9.9%
5
Nature Communications
4913 papers in training set
Top 19%
9.9%
50% of probability mass above
6
Genome Medicine
154 papers in training set
Top 1%
6.2%
7
Nature Methods
336 papers in training set
Top 2%
6.2%
8
Bioinformatics Advances
184 papers in training set
Top 0.8%
4.7%
9
Nature Biotechnology
147 papers in training set
Top 2%
3.8%
10
NAR Genomics and Bioinformatics
214 papers in training set
Top 2%
1.8%
11
Nature Computational Science
50 papers in training set
Top 0.7%
1.6%
12
Scientific Reports
3102 papers in training set
Top 61%
1.6%
13
GigaScience
172 papers in training set
Top 2%
1.5%
14
PLOS ONE
4510 papers in training set
Top 57%
1.5%
15
Cell Genomics
162 papers in training set
Top 4%
1.3%
16
BMC Bioinformatics
383 papers in training set
Top 6%
1.1%
17
Briefings in Bioinformatics
326 papers in training set
Top 6%
0.9%
18
Science
429 papers in training set
Top 18%
0.9%
19
The American Journal of Human Genetics
206 papers in training set
Top 3%
0.9%
20
Cell Reports Methods
141 papers in training set
Top 5%
0.8%
21
Communications Biology
886 papers in training set
Top 22%
0.8%
22
PLOS Genetics
756 papers in training set
Top 14%
0.8%
23
BMC Genomics
328 papers in training set
Top 5%
0.8%
24
Nature
575 papers in training set
Top 16%
0.7%
25
PLOS Computational Biology
1633 papers in training set
Top 28%
0.6%
26
Nature Machine Intelligence
61 papers in training set
Top 4%
0.6%