Back

Genomic Foundation Models Reveal Chromatin-Domain-Scale Transposable Element Impacts on Rice Genome Architecture

fan, j.; Zhao, H.; lv, q.; wang, x.; Man, r.; xie, n.; zhao, z.

2026-05-13 plant biology
10.64898/2026.05.11.724192 bioRxiv
Show abstract

Alignment-based detection of transposable element (TE) insertion polymorphisms suffers from reference bias and multi-mapping errors in repetitive genomic regions, creating a fundamental validation bottleneck for population-scale structural variant catalogs. Here, we demonstrate that the OneGenome-Rice (OGR) genomic foundation model (GFM)--a 1.25 billion parameter Mixtral architecture trained on 422 rice genomes without TE annotations--provides an entirely orthogonal, alignment-free approach that resolves TE-mediated structural divergence at chromatin-domain resolution. At the CTB4a cold-tolerance locus on chromosome 4, OGR embeddings revealed that the aus subpopulation (NONA_BOKRA) carries 2.2-fold higher structural divergence from indica than japonica, consistent with its 728 subpopulation-exclusive cold-protective TE insertions. Sliding-window analysis across 4.4 megabases identified a 25.6-fold divergence enhancement at TE clusters relative to the conserved CTB4a gene body. Critically, the minimal effective resolution was established at approximately 20 kilobases--corresponding to the median size of topologically associating domains (TADs) in the rice genome--while individual TE sites at 500 base pairs were undetectable (P = 0.94). Non-neural baselines confirmed the signal derives from learned representations of genomic context rather than simple nucleotide statistics. These findings establish GFMs as orthogonal validation tools for population-scale TE genotyping and provide computational evidence that TE functional effects are organized at the chromatin-domain level, with direct implications for prioritizing functional TE variants in crop breeding.

Matching journals

The top 6 journals account for 50% of the predicted probability mass.

1
Nature Communications
4913 papers in training set
Top 18%
10.2%
2
Cell Systems
167 papers in training set
Top 1%
9.8%
3
Science
429 papers in training set
Top 3%
9.8%
4
Nature
575 papers in training set
Top 4%
8.0%
5
Nature Plants
84 papers in training set
Top 0.2%
7.0%
6
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 13%
6.1%
50% of probability mass above
7
Genome Biology
555 papers in training set
Top 2%
4.7%
8
Nature Genetics
240 papers in training set
Top 2%
4.2%
9
Cell
370 papers in training set
Top 5%
3.9%
10
Science Advances
1098 papers in training set
Top 7%
3.5%
11
Advanced Science
249 papers in training set
Top 7%
3.0%
12
Developmental Cell
168 papers in training set
Top 7%
2.4%
13
Molecular Cell
308 papers in training set
Top 6%
2.3%
14
Nature Biotechnology
147 papers in training set
Top 4%
1.8%
15
eLife
5422 papers in training set
Top 44%
1.6%
16
New Phytologist
309 papers in training set
Top 3%
1.6%
17
Cell Reports
1338 papers in training set
Top 26%
1.4%
18
Nucleic Acids Research
1128 papers in training set
Top 13%
1.3%
19
Nature Ecology & Evolution
113 papers in training set
Top 3%
1.2%
20
The Plant Cell
141 papers in training set
Top 2%
1.2%
21
Plant Physiology
217 papers in training set
Top 2%
1.2%
22
Molecular Plant
36 papers in training set
Top 1%
0.9%
23
Cell Genomics
162 papers in training set
Top 6%
0.8%
24
Plant Biotechnology Journal
56 papers in training set
Top 1%
0.7%
25
Plant Communications
35 papers in training set
Top 1%
0.7%
26
Nature Cell Biology
99 papers in training set
Top 5%
0.7%
27
Nature Methods
336 papers in training set
Top 6%
0.7%
28
The Plant Journal
197 papers in training set
Top 3%
0.7%
29
Communications Biology
886 papers in training set
Top 27%
0.7%
30
Development
440 papers in training set
Top 4%
0.7%