Back

Long-read analysis of tetrameric microsatellites with vmwhere supports GGAA repeat length-dependent chromatin state association in Ewing sarcoma

Peterson, S. K.; Massie, A. M.; Rubinsteyn, A.; Wang, J. R.; Davis, I. J.

2026-04-10 cancer biology
10.64898/2026.04.08.717017 bioRxiv
Show abstract

Microsatellites are abundant genomic elements that contribute to genetic diversity and disease-associated regulatory variation. Although long-read sequencing enables accurate resolution of repetitive regions, computational methods for fully resolved microsatellite genotyping remain limited. Here, we introduce variant motif where (vmwhere), a computational framework for identifying, genotyping, decomposing, and visualizing complex tetrameric microsatellites from long-read sequencing data. Using simulated error-free reads, vmwhere accurately measures several genotyping metrics, including allele length, repeat length, maximum consecutive repeat length, and motif density. Applied to long-read whole-genome sequencing data, vmwhere identified sequence interruptions, motif-specific differences in repeat architecture, and ancestry-associated allele variation, including long repeat alleles that exceed short-read sequencing limitations. We applied vmwhere to GGAA microsatellites in Ewing sarcoma, an aggressive pediatric cancer driven by EWS-FLI1 fusion oncoprotein, which binds to microsatellites and remodels chromatin. Genome-wide integration of long-read-defined microsatellite architecture with chromatin accessibility and EWS-FLI1 binding revealed that GGAA repeat structure was associated with chromatin state, with longer consecutive repeat microsatellites exhibiting increased EWS-FLI1 binding and chromatin accessibility. Cell line-specific expansions and contractions of GGAA microsatellite repeat length were associated with gains and losses of chromatin accessibility. Further, we identified haplotype-specific chromatin states, with preferential binding and accessibility at longer alleles. Together, these results establish vmwhere as a scalable framework for resolving population-level microsatellite variation and linking repeat architecture to chromatin state. Repeat structure and length characteristics provides insights into genotype-function relationships at microsatellite repeats in cancer.

Matching journals

The top 4 journals account for 50% of the predicted probability mass.

1
Nature Communications
4913 papers in training set
Top 4%
21.7%
2
Genome Medicine
154 papers in training set
Top 0.2%
17.9%
3
Nucleic Acids Research
1128 papers in training set
Top 2%
9.7%
4
Genome Biology
555 papers in training set
Top 1%
6.1%
50% of probability mass above
5
PLOS Computational Biology
1633 papers in training set
Top 7%
4.7%
6
Cell Genomics
162 papers in training set
Top 1%
3.8%
7
Cancer Research
116 papers in training set
Top 1%
3.1%
8
Molecular Cancer
14 papers in training set
Top 0.2%
2.5%
9
Cell Systems
167 papers in training set
Top 6%
1.8%
10
Nature Genetics
240 papers in training set
Top 5%
1.4%
11
The American Journal of Human Genetics
206 papers in training set
Top 3%
1.4%
12
Cell Reports Methods
141 papers in training set
Top 3%
1.3%
13
Nature Biotechnology
147 papers in training set
Top 5%
1.3%
14
Briefings in Bioinformatics
326 papers in training set
Top 5%
1.2%
15
Cancer Discovery
61 papers in training set
Top 1%
1.2%
16
Genome Research
409 papers in training set
Top 3%
1.1%
17
Cell Reports
1338 papers in training set
Top 30%
0.9%
18
Communications Biology
886 papers in training set
Top 18%
0.9%
19
Bioinformatics
1061 papers in training set
Top 9%
0.9%
20
Bioinformatics Advances
184 papers in training set
Top 4%
0.9%
21
Science Translational Medicine
111 papers in training set
Top 6%
0.8%
22
Cell Reports Medicine
140 papers in training set
Top 8%
0.7%
23
eLife
5422 papers in training set
Top 59%
0.7%
24
Science
429 papers in training set
Top 20%
0.7%
25
Scientific Reports
3102 papers in training set
Top 76%
0.7%
26
PLOS ONE
4510 papers in training set
Top 70%
0.7%
27
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 48%
0.6%