Back

Genomic language models improve cross-species gene expression prediction and accurately capture regulatory variant effects in Brachypodium mutant lines

Vahedi Torghabeh, B.; Moslemi, C.; Dybdal Jensen, J.; Hentrup, S.; Li, T.; Yu, X.; Wang, H.; Asp, T.; Ramstein, G. P.

2026-03-02 bioinformatics
10.64898/2026.02.27.708524 bioRxiv
Show abstract

Predicting gene expression from cis-regulatory DNA sequences at the promoter and terminator regions is a central challenge in plant genomics. This capability is also a prerequisite for assessing the effects of regulatory mutations on gene expression. Here, we developed deep learning sequence-to-expression (S2E) models that leverage context-aware sequence embeddings from the PlantCaduceus genomic language model instead of one-hot encoding of sequences, to predict gene expression across 17 plant species. To further improve predictions, we integrated chromatin accessibility data as auxiliary regulatory features. First, we evaluated our models to predict gene expression on unseen gene families via cross-validation, demonstrating our models prediction accuracy across all species outperforms PhytoExpr, the current state-of-the-art (SOTA) S2E model in plants (Pearson R=0.82 vs. R=0.74). We then validated variant effect predictions using an experimental dataset across 796 Brachypodium mutant lines, specifically designed to test predictions at single-base resolution. Our models outperformed SOTA S2E models in predicting between-gene expression differences (regression coefficient {beta}=0.78 vs. {beta}=0.57). Remarkably, they also accurately predicted the effects of single-nucleotide mutations on within-gene expression, while SOTA S2E models showed only weak associations (regression coefficient {beta}=0.38 vs. {beta}=0.08). Our results demonstrated the value of context-aware DNA sequence embeddings for predicting regulatory variant effects in plants. They also reveal a persistent accuracy gap in S2E models when moving from between-gene to allelic variation, a challenge that needs to be addressed in future S2E studies.

Matching journals

The top 8 journals account for 50% of the predicted probability mass.

1
Genome Biology
555 papers in training set
Top 0.1%
18.0%
2
Nature Communications
4913 papers in training set
Top 24%
8.1%
3
Bioinformatics
1061 papers in training set
Top 4%
6.1%
4
NAR Genomics and Bioinformatics
214 papers in training set
Top 0.4%
4.7%
5
Molecular Plant
36 papers in training set
Top 0.3%
4.2%
6
Plant Communications
35 papers in training set
Top 0.3%
4.2%
7
Horticulture Research
43 papers in training set
Top 0.5%
3.8%
8
Briefings in Bioinformatics
326 papers in training set
Top 2%
3.5%
50% of probability mass above
9
PLOS Computational Biology
1633 papers in training set
Top 11%
3.0%
10
Nucleic Acids Research
1128 papers in training set
Top 7%
2.6%
11
Genome Medicine
154 papers in training set
Top 3%
2.5%
12
Cell Systems
167 papers in training set
Top 5%
2.5%
13
New Phytologist
309 papers in training set
Top 2%
2.4%
14
Advanced Science
249 papers in training set
Top 9%
2.0%
15
Nature Plants
84 papers in training set
Top 0.9%
2.0%
16
Genome Research
409 papers in training set
Top 2%
1.8%
17
Nature Genetics
240 papers in training set
Top 5%
1.6%
18
Cell Genomics
162 papers in training set
Top 4%
1.6%
19
Computational and Structural Biotechnology Journal
216 papers in training set
Top 5%
1.6%
20
Journal of Genetics and Genomics
36 papers in training set
Top 1%
1.3%
21
Nature Machine Intelligence
61 papers in training set
Top 2%
1.3%
22
Plant Physiology
217 papers in training set
Top 2%
1.3%
23
PLOS ONE
4510 papers in training set
Top 61%
1.2%
24
Nature Methods
336 papers in training set
Top 5%
1.2%
25
Bioinformatics Advances
184 papers in training set
Top 4%
1.2%
26
Genomics, Proteomics & Bioinformatics
171 papers in training set
Top 5%
0.9%
27
Frontiers in Genetics
197 papers in training set
Top 8%
0.9%
28
Communications Biology
886 papers in training set
Top 20%
0.9%
29
Plant Biotechnology Journal
56 papers in training set
Top 1%
0.9%
30
BMC Bioinformatics
383 papers in training set
Top 7%
0.8%