Back

Large protein databases reveal structural complementarity and functional locality

Szczerbiak, P.; Szydlowski, L. M.; Wydmanski, W.; Renfrew, P. D.; Koehler Leman, J.; Kosciolek, T.

2024-10-16 bioinformatics
10.1101/2024.08.14.607935 bioRxiv
Show abstract

Recent breakthroughs in protein structure prediction have led to an unprecedented surge in high-quality 3D models, highlighting the need for efficient computational solutions to manage and analyze this wealth of structural data. In our work, we comprehensively examine the structural clusters obtained from the AlphaFold Protein Structure Database (AFDB), a high-quality subset of ESMAtlas, and the Microbiome Immunity Project (MIP). We create a single cohesive low-dimensional representation of the resulting protein space. Our results show that, while each database occupies distinct regions within the protein structure space, they collectively exhibit significant overlap in their functional profiles. High-level biological functions tend to cluster in particular regions, revealing a shared functional landscape despite the diverse sources of data. By creating a single, cohesive low-dimensional representation of protein structure space integrating data from diverse sources, localizing functional annotations within this space, and providing an open-access web-server for exploration, this work offers insights for future research concerning protein sequence-structure-function relationships, enabling various biological questions to be asked about taxonomic assignments, environmental factors, or functional specificity. This approach is generalizable to other or future datasets, enabling further discovery beyond findings presented here.

Matching journals

The top 6 journals account for 50% of the predicted probability mass.

1
PLOS Computational Biology
1633 papers in training set
Top 2%
12.4%
2
Bioinformatics
1061 papers in training set
Top 2%
12.4%
3
Computational and Structural Biotechnology Journal
216 papers in training set
Top 0.2%
9.1%
4
NAR Genomics and Bioinformatics
214 papers in training set
Top 0.2%
6.3%
5
Journal of Chemical Information and Modeling
207 papers in training set
Top 0.9%
6.3%
6
Frontiers in Bioinformatics
45 papers in training set
Top 0.1%
4.8%
50% of probability mass above
7
Bioinformatics Advances
184 papers in training set
Top 0.9%
4.3%
8
GigaScience
172 papers in training set
Top 0.4%
3.9%
9
PLOS ONE
4510 papers in training set
Top 44%
2.7%
10
Scientific Reports
3102 papers in training set
Top 44%
2.7%
11
Briefings in Bioinformatics
326 papers in training set
Top 3%
2.4%
12
BMC Bioinformatics
383 papers in training set
Top 4%
2.1%
13
Protein Science
221 papers in training set
Top 0.7%
1.9%
14
PeerJ
261 papers in training set
Top 6%
1.9%
15
Nucleic Acids Research
1128 papers in training set
Top 11%
1.7%
16
Journal of Structural Biology
58 papers in training set
Top 0.7%
1.7%
17
Journal of Molecular Biology
217 papers in training set
Top 2%
1.5%
18
Frontiers in Immunology
586 papers in training set
Top 6%
1.2%
19
Frontiers in Genetics
197 papers in training set
Top 7%
1.1%
20
Scientific Data
174 papers in training set
Top 2%
1.1%
21
Journal of Proteome Research
215 papers in training set
Top 2%
0.9%
22
Communications Biology
886 papers in training set
Top 19%
0.9%
23
Database
51 papers in training set
Top 0.7%
0.9%
24
International Journal of Molecular Sciences
453 papers in training set
Top 14%
0.8%
25
Computers in Biology and Medicine
120 papers in training set
Top 5%
0.7%
26
Frontiers in Molecular Biosciences
100 papers in training set
Top 5%
0.7%
27
Cell Systems
167 papers in training set
Top 12%
0.7%