Back

Evaluation of Methods for Protein Representation Learning: A Quantitative Analysis

Unsal, S.; Atas, H.; Albayrak, M.; Turhan, K.; Acar, A. C.; Dogan, T.

2020-10-28 bioinformatics
10.1101/2020.10.28.359828 bioRxiv
Show abstract

Data-centric approaches have been utilized to develop predictive methods for elucidating uncharacterized aspects of proteins such as their functions, biophysical properties, subcellular locations and interactions. However, studies indicate that the performance of these methods should be further improved to effectively solve complex problems in biomedicine and biotechnology. A data representation method can be defined as an algorithm that calculates numerical feature vectors for samples in a dataset, to be later used in quantitative modelling tasks. Data representation learning methods do this by training and using a model that employs statistical and machine/deep learning algorithms. These novel methods mostly take inspiration from the data-driven language models that have yielded ground-breaking improvements in the field of natural language processing. Lately, these learned data representations have been applied to the field of protein informatics and have displayed highly promising results in terms of extracting complex traits of proteins regarding sequence-structure-function relations. In this study, we conducted a detailed investigation over protein representation learning methods, by first categorizing and explaining each approach, and then conducting benchmark analyses on; (i) inferring semantic similarities between proteins, (ii) predicting ontology-based protein functions, and (iii) classifying drug target protein families. We examine the advantages and disadvantages of each representation approach over the benchmark results. Finally, we discuss current challenges and suggest future directions. We believe the conclusions of this study will help researchers in applying machine/deep learning-based representation techniques on protein data for various types of predictive tasks. Furthermore, we hope it will demonstrate the potential of machine learning-based data representations for protein science and inspire the development of novel methods/tools to be utilized in the fields of biomedicine and biotechnology.

Matching journals

The top 12 journals account for 50% of the predicted probability mass.

1
BMC Bioinformatics
383 papers in training set
Top 1%
6.9%
2
Computational and Structural Biotechnology Journal
216 papers in training set
Top 0.5%
6.5%
3
Bioinformatics
1061 papers in training set
Top 4%
4.9%
4
Frontiers in Bioinformatics
45 papers in training set
Top 0.1%
4.9%
5
Briefings in Bioinformatics
326 papers in training set
Top 1%
4.2%
6
Computational Biology and Chemistry
23 papers in training set
Top 0.1%
4.0%
7
PLOS ONE
4510 papers in training set
Top 35%
4.0%
8
Scientific Reports
3102 papers in training set
Top 33%
3.7%
9
Computers in Biology and Medicine
120 papers in training set
Top 0.8%
3.7%
10
Database
51 papers in training set
Top 0.2%
2.6%
11
PLOS Computational Biology
1633 papers in training set
Top 12%
2.5%
12
NAR Genomics and Bioinformatics
214 papers in training set
Top 1%
2.5%
50% of probability mass above
13
Journal of Proteome Research
215 papers in training set
Top 1.0%
2.4%
14
Journal of Chemical Information and Modeling
207 papers in training set
Top 2%
2.1%
15
BMC Medical Informatics and Decision Making
39 papers in training set
Top 1%
2.1%
16
Biology Methods and Protocols
53 papers in training set
Top 0.6%
2.1%
17
BioData Mining
15 papers in training set
Top 0.2%
1.9%
18
PeerJ
261 papers in training set
Top 5%
1.9%
19
IEEE/ACM Transactions on Computational Biology and Bioinformatics
32 papers in training set
Top 0.2%
1.8%
20
Frontiers in Genetics
197 papers in training set
Top 4%
1.8%
21
GigaScience
172 papers in training set
Top 1%
1.7%
22
Molecules
37 papers in training set
Top 1%
1.4%
23
Bioinformatics Advances
184 papers in training set
Top 3%
1.4%
24
International Journal of Molecular Sciences
453 papers in training set
Top 10%
1.2%
25
PROTEOMICS
35 papers in training set
Top 0.5%
1.2%
26
Artificial Intelligence in Medicine
15 papers in training set
Top 0.4%
1.2%
27
ACS Omega
90 papers in training set
Top 3%
1.0%
28
Frontiers in Molecular Biosciences
100 papers in training set
Top 4%
0.9%
29
Genomics, Proteomics & Bioinformatics
171 papers in training set
Top 5%
0.8%
30
Protein Science
221 papers in training set
Top 2%
0.8%