Back

Transformer protein language models are unsupervised structure learners

Rao, R. M.; Meier, J.; Sercu, T.; Ovchinnikov, S.; Rives, A.

2020-12-15 synthetic biology
10.1101/2020.12.15.422761 bioRxiv
Show abstract

AO_SCPLOWBSTRACTC_SCPLOWUnsupervised contact prediction is central to uncovering physical, structural, and functional constraints for protein structure determination and design. For decades, the predominant approach has been to infer evolutionary constraints from a set of related sequences. In the past year, protein language models have emerged as a potential alternative, but performance has fallen short of state-of-the-art approaches in bioinformatics. In this paper we demonstrate that Transformer attention maps learn contacts from the unsupervised language modeling objective. We find the highest capacity models that have been trained to date already outperform a state-of-the-art unsupervised contact prediction pipeline, suggesting these pipelines can be replaced with a single forward pass of an end-to-end model.1

Matching journals

The top 7 journals account for 50% of the predicted probability mass.

1
Bioinformatics
1061 papers in training set
Top 1%
18.6%
2
Cell Systems
167 papers in training set
Top 1%
9.1%
3
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 11%
6.3%
4
Nature Communications
4913 papers in training set
Top 29%
6.3%
5
Protein Engineering, Design and Selection
14 papers in training set
Top 0.1%
4.3%
6
Nature Computational Science
50 papers in training set
Top 0.1%
4.3%
7
PLOS Computational Biology
1633 papers in training set
Top 8%
4.2%
50% of probability mass above
8
Communications Biology
886 papers in training set
Top 2%
3.7%
9
Nature Methods
336 papers in training set
Top 3%
3.6%
10
Structure
175 papers in training set
Top 0.8%
3.6%
11
Journal of Molecular Biology
217 papers in training set
Top 0.8%
2.9%
12
eLife
5422 papers in training set
Top 32%
2.6%
13
Nucleic Acids Research
1128 papers in training set
Top 9%
2.1%
14
Protein Science
221 papers in training set
Top 0.9%
1.7%
15
Scientific Reports
3102 papers in training set
Top 58%
1.7%
16
Journal of Chemical Information and Modeling
207 papers in training set
Top 2%
1.7%
17
Proteins: Structure, Function, and Bioinformatics
82 papers in training set
Top 0.5%
1.7%
18
Science
429 papers in training set
Top 15%
1.5%
19
Nature
575 papers in training set
Top 12%
1.5%
20
Nature Biotechnology
147 papers in training set
Top 5%
1.5%
21
iScience
1063 papers in training set
Top 23%
1.1%
22
Neuron
282 papers in training set
Top 8%
0.9%
23
Molecular Systems Biology
142 papers in training set
Top 1%
0.9%
24
PLOS ONE
4510 papers in training set
Top 68%
0.7%
25
Nature Machine Intelligence
61 papers in training set
Top 3%
0.7%
26
Frontiers in Molecular Biosciences
100 papers in training set
Top 5%
0.7%
27
Journal of Cheminformatics
25 papers in training set
Top 0.6%
0.7%
28
Synthetic Biology
21 papers in training set
Top 0.2%
0.7%
29
Journal of The Royal Society Interface
189 papers in training set
Top 5%
0.6%
30
ACS Synthetic Biology
256 papers in training set
Top 4%
0.6%