Back

Adding 3Di characters to amino acid datasets can improve resolution, but the effect is weaker in shorter and alpha-helical proteins

Fullmer, M. S.; Puente-Lelievre, C.; Matzke, N. J.

2025-08-11 evolutionary biology
10.1101/2025.06.30.662300 bioRxiv
Show abstract

The recent introduction of Foldseeks 3Di character alphabet to encode 3D protein structure has opened up new possibilities for structural phylogenetics. These characters, like protein structure, are more conserved than amino acids, raising the possibility of better resolution of very deep branches on the tree of life. As 3Di characters have a 20-letter alphabet, they are readily treatable with off-the-shelf algorithms for model-based phylogenetic inference and related methods such as bootstrapping. However, it remains to be seen if 3Di phylogenies are broadly more resolved than sequence-based phylogenies. We present data using samples from nine protein superfamilies showing that 3Di combines with sequence to produce better resolved phylogenies than either sequence or 3Di alone. We also show that information-theoretic measures, applied to superfamily alignments, significantly correlate with resolution in phylogenies derived from these alignments. Further, we identify the proportion of alpha helices in proteins as a major driver in reducing the information carried by 3Di character alignments, explaining the relatively poor performance of 3Di characters on superfamilies with highly-conserved structure but high alpha helical content. Our results provide encouragement for the further use of 3Di to address challenging questions in deep history, but also sound a note of caution about which proteins it is most suitable for. SIGNIFICANCE3Di characters have been suggested as a method to generate well-resolved deep phylogenies. Our results show that 3Di characters combined with sequences can improve resolution in the deepest nodes of protein superfamily trees. However, our results also show that 3Di characters may not be suitable for all protein types.

Matching journals

The top 5 journals account for 50% of the predicted probability mass.

1
Molecular Biology and Evolution
488 papers in training set
Top 0.1%
21.8%
2
PLOS Computational Biology
1633 papers in training set
Top 3%
12.1%
3
Journal of Molecular Evolution
21 papers in training set
Top 0.1%
8.2%
4
Bioinformatics
1061 papers in training set
Top 4%
7.0%
5
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 18%
3.8%
50% of probability mass above
6
BMC Ecology and Evolution
49 papers in training set
Top 0.4%
3.8%
7
Scientific Reports
3102 papers in training set
Top 38%
3.6%
8
Genome Biology and Evolution
280 papers in training set
Top 0.5%
3.5%
9
PeerJ
261 papers in training set
Top 3%
3.5%
10
Protein Science
221 papers in training set
Top 0.5%
3.0%
11
Molecular Phylogenetics and Evolution
61 papers in training set
Top 0.1%
3.0%
12
Systematic Biology
121 papers in training set
Top 0.2%
2.5%
13
BMC Genomics
328 papers in training set
Top 3%
1.6%
14
F1000Research
79 papers in training set
Top 2%
1.6%
15
BMC Bioinformatics
383 papers in training set
Top 5%
1.6%
16
NAR Genomics and Bioinformatics
214 papers in training set
Top 2%
1.6%
17
Journal of Theoretical Biology
144 papers in training set
Top 1%
1.3%
18
PLOS ONE
4510 papers in training set
Top 61%
1.2%
19
Journal of Computational Biology
37 papers in training set
Top 0.6%
0.8%
20
Methods in Ecology and Evolution
160 papers in training set
Top 2%
0.7%
21
eLife
5422 papers in training set
Top 60%
0.7%
22
Frontiers in Ecology and Evolution
60 papers in training set
Top 4%
0.6%
23
Communications Biology
886 papers in training set
Top 30%
0.6%