Back

Disentangling RNA evolution and thermodynamics in genomic language models

Xu, Y.; Pai, N.; Wayment-Steele, H. K.

2026-05-30 biophysics
10.64898/2026.05.28.728275 bioRxiv
Show abstract

Genomic language models (gLMs) trained only on large-scale nucleic acid sequence data seem to capture signals of RNA structure, yet the specifics of how remain unclear. Using the categorical Jacobian (CJ) operation, a model-agnostic operation for querying pairwise dependencies, we systematically compared three flagship gLMs: RNA-FM, Evo 2, and gLM2. We found that CJ signals recover base pairs supported by evolutionary covariation analyses, consistent with findings in protein language models. Surprisingly, CJ also recovers base pairs lacking evolutionary support but predicted by biophysical nearest-neighbor models. Is it possible gLMs have "learned" RNA thermodynamics? We noticed nearest-neighbor RNA folding models often predict reflected structures when given reversed sequences, consistent with these models modular and grammar-like nature. We leveraged this observation to create a simple "mirror test" that we found gLMs routinely fail, indicating they have not learned generalizable biophysics-based rules for RNA structure. Nevertheless, their apparent thermodynamic signal potentially confounds interpreting gLM pairwise dependencies as evidence of evolutionary conservation. We therefore introduce a method using synthetic sequences as a control for detecting significant learned signal. Our results demonstrate that gLMs can mimic thermodynamics through learned sequence context rather than general physical principles, but solutions exist for disentangling patterns in language models.

Matching journals

The top 3 journals account for 50% of the predicted probability mass.

1
PLOS Computational Biology
1633 papers in training set
Top 0.2%
32.7%
2
Cell Systems
167 papers in training set
Top 0.7%
14.2%
3
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 6%
10.0%
50% of probability mass above
4
eLife
5422 papers in training set
Top 14%
6.3%
5
Nucleic Acids Research
1128 papers in training set
Top 4%
4.8%
6
Nature Communications
4913 papers in training set
Top 43%
3.0%
7
Bioinformatics Advances
184 papers in training set
Top 2%
1.9%
8
Biophysical Journal
545 papers in training set
Top 3%
1.5%
9
Nature Methods
336 papers in training set
Top 5%
1.2%
10
RNA
169 papers in training set
Top 0.3%
1.2%
11
Virus Evolution
140 papers in training set
Top 1%
1.1%
12
Molecular Biology and Evolution
488 papers in training set
Top 4%
0.9%
13
Structure
175 papers in training set
Top 3%
0.9%
14
Cell Reports
1338 papers in training set
Top 32%
0.8%
15
Genome Biology
555 papers in training set
Top 7%
0.8%
16
Neuron
282 papers in training set
Top 8%
0.8%
17
Nature
575 papers in training set
Top 16%
0.7%
18
Journal of Chemical Information and Modeling
207 papers in training set
Top 3%
0.7%
19
iScience
1063 papers in training set
Top 32%
0.7%
20
Scientific Reports
3102 papers in training set
Top 75%
0.7%
21
Molecular Systems Biology
142 papers in training set
Top 2%
0.7%
22
Journal of Chemical Theory and Computation
126 papers in training set
Top 0.9%
0.7%
23
Protein Science
221 papers in training set
Top 2%
0.7%
24
Science
429 papers in training set
Top 22%
0.6%
25
PLOS ONE
4510 papers in training set
Top 71%
0.6%
26
Proteins: Structure, Function, and Bioinformatics
82 papers in training set
Top 1%
0.6%