Back

Evolutionary-scale protein language models uncover beneficial variants in a Sorghum bicolor diversity panel

Johansen, N. H.; Sendowski, J. S.-O.; Nikolaidou, E.; Chatzivasileiou, S.; Wang, S.; Song, B.; Olson, A.; Bataillon, T.; Ramstein, G. P.

2026-04-13 genetics
10.64898/2026.04.10.717708 bioRxiv
Show abstract

Quantitative genetic approaches such as genome-wide association studies and genomic prediction are widely used to identify favourable genetic variation, but they have limited resolution due to linkage disequilibrium. Comparative genomics approaches, especially Protein Language Models (PLMs), have emerged as powerful alternatives, by detecting phylogenetic residue conservation (PRC) across evolutionary time scales. However, the extent to which these tools can guide the detection of impactful variants for field agronomic traits is still unclear. In this study, we used the pre-trained PLM ESM2 to predict PRC scores of nonsynonymous mutations segregating within a diverse panel of 387 accessions in sorghum (SAP). The distribution of fitness effects (DFE) of the same set of nonsynonymous mutations was inferred using unfolded site frequency spectra to assess whether the DFE distribution covaried with PRC scores. Furthermore, we estimated the load of putatively nonneutral mutations of SAP accessions and evaluated associations between this mutation load and phenotypic performance across multiple agronomic traits. Our results show that ESM2 can detect mutations associated with fitness-enhancing effects in SAP, as indicated by enrichments in positive selection signatures among the variants with positive PRC scores. Significant associations were also detected between phenotypic performance and mutation load for several agronomic traits, indicating that PLMs can identify functionally important genetic variation. However, these signals were not consistent across all traits in the SAP population. Altogether, our findings suggest that large language models may support breeding efforts, as PLM predictions covaried with fitness effects and captured agronomic performance for some traits in plant populations.

Matching journals

The top 9 journals account for 50% of the predicted probability mass.

1
Journal of Genetics and Genomics
36 papers in training set
Top 0.1%
8.2%
2
PLOS Genetics
756 papers in training set
Top 2%
7.0%
3
Plant Communications
35 papers in training set
Top 0.1%
7.0%
4
New Phytologist
309 papers in training set
Top 0.9%
6.7%
5
Horticulture Research
43 papers in training set
Top 0.3%
6.2%
6
The Plant Genome
53 papers in training set
Top 0.1%
4.8%
7
Plant Biotechnology Journal
56 papers in training set
Top 0.3%
3.9%
8
Frontiers in Plant Science
240 papers in training set
Top 2%
3.5%
9
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 21%
3.5%
50% of probability mass above
10
Journal of Experimental Botany
195 papers in training set
Top 1%
3.5%
11
Plant Physiology
217 papers in training set
Top 1%
3.0%
12
Genomics, Proteomics & Bioinformatics
171 papers in training set
Top 2%
3.0%
13
The Plant Journal
197 papers in training set
Top 2%
3.0%
14
Theoretical and Applied Genetics
46 papers in training set
Top 0.1%
2.0%
15
Frontiers in Genetics
197 papers in training set
Top 4%
1.9%
16
Scientific Reports
3102 papers in training set
Top 54%
1.9%
17
Communications Biology
886 papers in training set
Top 7%
1.7%
18
Nature Communications
4913 papers in training set
Top 52%
1.7%
19
eLife
5422 papers in training set
Top 43%
1.7%
20
Genome Biology
555 papers in training set
Top 5%
1.6%
21
PLOS ONE
4510 papers in training set
Top 56%
1.6%
22
Genetics
225 papers in training set
Top 3%
1.3%
23
Molecular Plant
36 papers in training set
Top 1%
1.1%
24
Molecular Biology and Evolution
488 papers in training set
Top 3%
1.1%
25
GENETICS
189 papers in training set
Top 1%
0.9%
26
G3 Genes|Genomes|Genetics
351 papers in training set
Top 2%
0.9%
27
Molecular Ecology
304 papers in training set
Top 4%
0.9%
28
in silico Plants
24 papers in training set
Top 0.3%
0.9%
29
Plant Phenomics
17 papers in training set
Top 0.3%
0.9%
30
PLOS Computational Biology
1633 papers in training set
Top 22%
0.9%