Back

Simple baselines rival protein language models in mutation-dense design tasks

Talpir, I.; Fleishman, S. J.

2026-05-06 bioinformatics
10.64898/2026.05.01.722313 bioRxiv
Show abstract

Computational protein design demands generally applicable models that reliably predict or generate unmeasured variants with superior functional properties. Although protein language models (pLMs) have been used in zero-shot and transfer-learning design studies, they have generally not been assessed in benchmarks that explicitly test combinatorial extrapolation from lower- to higher-order variants. Here we benchmark widely used pLMs against conventional baseline methods in recently described dense, experimentally validated multi-mutant landscapes. We find that regardless of architecture and parameter count, pLMs are statistically similar to one another, and none consistently outperforms conventional baseline methods. Furthermore, their ability to distinguish functional from non-functional variants in zero-shot prediction is comparable to that of conventional homology-based methods. We suggest that to contribute significantly to the design of protein function, pLMs may need to encode biophysical and structural priors or be combined with structure-based approaches.

Matching journals

The top 5 journals account for 50% of the predicted probability mass.

1
PLOS Computational Biology
1633 papers in training set
Top 2%
14.0%
2
Journal of Chemical Information and Modeling
207 papers in training set
Top 0.5%
12.0%
3
Cell Systems
167 papers in training set
Top 1%
9.8%
4
Proteins: Structure, Function, and Bioinformatics
82 papers in training set
Top 0.1%
8.2%
5
Journal of Chemical Theory and Computation
126 papers in training set
Top 0.2%
6.2%
50% of probability mass above
6
Nature Communications
4913 papers in training set
Top 38%
3.9%
7
The Journal of Physical Chemistry B
158 papers in training set
Top 0.6%
3.5%
8
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 22%
3.5%
9
Protein Science
221 papers in training set
Top 0.4%
3.5%
10
Bioinformatics
1061 papers in training set
Top 6%
2.5%
11
Bioinformatics Advances
184 papers in training set
Top 2%
2.5%
12
Computational and Structural Biotechnology Journal
216 papers in training set
Top 3%
2.3%
13
Journal of Cheminformatics
25 papers in training set
Top 0.2%
2.3%
14
Briefings in Bioinformatics
326 papers in training set
Top 3%
1.8%
15
Chemical Science
71 papers in training set
Top 1%
1.7%
16
Scientific Reports
3102 papers in training set
Top 60%
1.7%
17
Biophysical Journal
545 papers in training set
Top 3%
1.4%
18
eLife
5422 papers in training set
Top 46%
1.4%
19
Structure
175 papers in training set
Top 3%
0.9%
20
PLOS ONE
4510 papers in training set
Top 64%
0.9%
21
Frontiers in Molecular Biosciences
100 papers in training set
Top 4%
0.9%
22
Journal of Molecular Biology
217 papers in training set
Top 3%
0.9%
23
BMC Bioinformatics
383 papers in training set
Top 7%
0.8%
24
ACS Synthetic Biology
256 papers in training set
Top 3%
0.8%
25
Nature Biotechnology
147 papers in training set
Top 7%
0.8%
26
International Journal of Molecular Sciences
453 papers in training set
Top 14%
0.8%
27
Nucleic Acids Research
1128 papers in training set
Top 18%
0.7%
28
NAR Genomics and Bioinformatics
214 papers in training set
Top 4%
0.7%
29
Nature Machine Intelligence
61 papers in training set
Top 4%
0.7%
30
The American Journal of Human Genetics
206 papers in training set
Top 5%
0.6%