Back

Using the DNA language model, GROVER, to parse effects of sequence, chromatin and regulatory features on genome stability

Joubert, P. M.; Sanabria, M.; Poetsch, A. R.

2026-04-04 genomics
10.1101/2025.07.23.666402 bioRxiv
Show abstract

Genome stability is shaped by DNA sequence and chromatin context, but their relative contributions to double-strand break (DSB) sensitivity remain unclear. We show that the DNA language model, GROVER, can infer DSB location based on sequence. DSB hotspots tend to contain GC-rich sequences that belong to promoters, genes and short interspersed nuclear elements (SINEs). Additionally, we identified several specific short sequences (tokens) that are associated with modulating DSB sensitivity. Another model using chromatin and genome regulatory features outperforms the sequence-only model, highlighting complementary and cell-type specific information. Integrating sequence and genome biological features yields the best performance, demonstrating their synergy. Analyzing this model revealed that, dependent on the sample, genome stability information encoded in H3K36me3 and DNase-seq can be learned from the sequence, but not H3K27ac or H3K9me3. Embedding chromatin data directly into the GROVER architecture enabled cell-type specific modeling with performance matching the full chromatin feature model. Our results suggest that while chromatin and regulatory context provides important information, such as cell-type specificity, much of the information shaping DSB patterns is already encoded in the DNA sequence itself. Our integrative modeling approach not only reveals DSB patterns but also provides a generalizable strategy for tracing predictions in genomic data.

Matching journals

The top 4 journals account for 50% of the predicted probability mass.

1
PLOS Computational Biology
1633 papers in training set
Top 0.6%
22.7%
2
Nucleic Acids Research
1128 papers in training set
Top 1%
12.6%
3
Genome Biology
555 papers in training set
Top 0.6%
8.5%
4
Computational and Structural Biotechnology Journal
216 papers in training set
Top 0.5%
6.4%
50% of probability mass above
5
Frontiers in Genetics
197 papers in training set
Top 2%
3.6%
6
Bioinformatics
1061 papers in training set
Top 5%
3.6%
7
Cell Systems
167 papers in training set
Top 4%
3.6%
8
Nature Communications
4913 papers in training set
Top 43%
2.8%
9
Nature Genetics
240 papers in training set
Top 3%
2.4%
10
Cell Genomics
162 papers in training set
Top 2%
2.4%
11
NAR Genomics and Bioinformatics
214 papers in training set
Top 1%
2.1%
12
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 29%
1.9%
13
Scientific Reports
3102 papers in training set
Top 53%
1.9%
14
iScience
1063 papers in training set
Top 13%
1.8%
15
Cell Reports
1338 papers in training set
Top 24%
1.7%
16
eLife
5422 papers in training set
Top 41%
1.7%
17
Briefings in Bioinformatics
326 papers in training set
Top 4%
1.7%
18
Genome Research
409 papers in training set
Top 3%
1.3%
19
PLOS Genetics
756 papers in training set
Top 10%
1.3%
20
The American Journal of Human Genetics
206 papers in training set
Top 3%
1.0%
21
BMC Biology
248 papers in training set
Top 3%
0.9%
22
npj Systems Biology and Applications
99 papers in training set
Top 2%
0.8%
23
BMC Bioinformatics
383 papers in training set
Top 7%
0.8%
24
Communications Biology
886 papers in training set
Top 29%
0.6%
25
Bioinformatics Advances
184 papers in training set
Top 5%
0.6%