Transformer protein language models are unsupervised structure learners

Rao, R. M.; Meier, J.; Sercu, T.; Ovchinnikov, S.; Rives, A.

2020-12-15 synthetic biology

10.1101/2020.12.15.422761 bioRxiv

Show abstract

AO_SCPLOWBSTRACTC_SCPLOWUnsupervised contact prediction is central to uncovering physical, structural, and functional constraints for protein structure determination and design. For decades, the predominant approach has been to infer evolutionary constraints from a set of related sequences. In the past year, protein language models have emerged as a potential alternative, but performance has fallen short of state-of-the-art approaches in bioinformatics. In this paper we demonstrate that Transformer attention maps learn contacts from the unsupervised language modeling objective. We find the highest capacity models that have been trained to date already outperform a state-of-the-art unsupervised contact prediction pipeline, suggesting these pipelines can be replaced with a single forward pass of an end-to-end model.1

Matching journals

●Non-profit ◐University press ○Commercial

The top 7 journals account for 50% of the predicted probability mass.

Only show non-profit

◐ 1061 papers in training set

○ 167 papers in training set

Proceedings of the National Academy of Sciences

● 2130 papers in training set

Nature Communications

○ 4913 papers in training set

Protein Engineering, Design and Selection

◐ 14 papers in training set

Nature Computational Science

○ 50 papers in training set

PLOS Computational Biology

● 1633 papers in training set

50% of probability mass above

Communications Biology

○ 886 papers in training set

○ 336 papers in training set

○ 175 papers in training set

Journal of Molecular Biology

○ 217 papers in training set

● 5422 papers in training set

Nucleic Acids Research

◐ 1128 papers in training set

Protein Science

○ 221 papers in training set

Scientific Reports

○ 3102 papers in training set

Journal of Chemical Information and Modeling

● 207 papers in training set

Proteins: Structure, Function, and Bioinformatics

○ 82 papers in training set

● 429 papers in training set

○ 575 papers in training set

Nature Biotechnology

○ 147 papers in training set

○ 1063 papers in training set

○ 282 papers in training set

Molecular Systems Biology

○ 142 papers in training set

● 4510 papers in training set

Nature Machine Intelligence

○ 61 papers in training set

Frontiers in Molecular Biosciences

○ 100 papers in training set

Journal of Cheminformatics

○ 25 papers in training set

Synthetic Biology

◐ 21 papers in training set

Journal of The Royal Society Interface

● 189 papers in training set

ACS Synthetic Biology

● 256 papers in training set