Back

Combining evolutionary and assay-labelled data for protein fitness prediction

Hsu, C.; Nisonoff, H.; Fannjiang, C.; Listgarten, J.

2021-03-29 synthetic biology
10.1101/2021.03.28.437402 bioRxiv
Show abstract

Predictive modelling of protein properties has become increasingly important to the field of machine-learning guided protein engineering. In one of the two existing approaches, evolutionarily-related sequences to a query protein drive the modelling process, without any property measurements from the laboratory. In the other, a set of protein variants of interest are assayed, and then a supervised regression model is estimated with the assay-labelled data. Although a handful of recent methods have shown promise in combining the evolutionary and supervised approaches, this hybrid problem has not been examined in depth, leaving it unclear how practitioners should proceed, and how method developers should build on existing work. Herein, we present a systematic assessment of methods for protein fitness prediction when evolutionary and assay-labelled data are available. We find that a simple baseline approach we introduce is competitive with and often outperforms more sophisticated methods. Moreover, our simple baseline is plug-and-play with a wide variety of established methods, and does not add any substantial computational burden. Our analysis highlights the importance of systematic evaluations and sufficient baselines.

Matching journals

The top 7 journals account for 50% of the predicted probability mass.

1
Protein Engineering, Design and Selection
14 papers in training set
Top 0.1%
21.9%
2
Bioinformatics
1061 papers in training set
Top 3%
8.9%
3
Journal of Chemical Information and Modeling
207 papers in training set
Top 0.9%
6.1%
4
ACS Synthetic Biology
256 papers in training set
Top 0.8%
4.7%
5
Journal of Molecular Biology
217 papers in training set
Top 0.4%
4.2%
6
PLOS Computational Biology
1633 papers in training set
Top 9%
3.9%
7
Physical Biology
43 papers in training set
Top 0.5%
3.2%
50% of probability mass above
8
Nature Communications
4913 papers in training set
Top 43%
3.0%
9
Computational and Structural Biotechnology Journal
216 papers in training set
Top 2%
2.7%
10
International Journal of Molecular Sciences
453 papers in training set
Top 4%
2.5%
11
Protein Science
221 papers in training set
Top 0.6%
2.4%
12
Briefings in Bioinformatics
326 papers in training set
Top 3%
2.0%
13
Synthetic Biology
21 papers in training set
Top 0.1%
2.0%
14
Journal of Cheminformatics
25 papers in training set
Top 0.3%
1.8%
15
Frontiers in Molecular Biosciences
100 papers in training set
Top 1%
1.8%
16
Scientific Reports
3102 papers in training set
Top 55%
1.8%
17
Biology Methods and Protocols
53 papers in training set
Top 0.7%
1.8%
18
Proteins: Structure, Function, and Bioinformatics
82 papers in training set
Top 0.4%
1.8%
19
Chemical Science
71 papers in training set
Top 0.8%
1.8%
20
Journal of The Royal Society Interface
189 papers in training set
Top 2%
1.7%
21
Cell Systems
167 papers in training set
Top 7%
1.6%
22
PLOS ONE
4510 papers in training set
Top 55%
1.6%
23
Communications Chemistry
39 papers in training set
Top 0.4%
1.3%
24
eLife
5422 papers in training set
Top 48%
1.3%
25
ACS Omega
90 papers in training set
Top 4%
0.8%
26
Communications Biology
886 papers in training set
Top 26%
0.7%
27
PeerJ
261 papers in training set
Top 16%
0.7%
28
Nucleic Acids Research
1128 papers in training set
Top 19%
0.7%
29
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 48%
0.6%
30
BMC Bioinformatics
383 papers in training set
Top 8%
0.6%