Back

Multiple versus pairwise sequence alignments for protein phylogenetics using foundation models

Alibutud, R. F.; Kumar, S.

2026-05-29 bioinformatics
10.64898/2026.05.26.727927 bioRxiv
Show abstract

Phylogenetic inference is a common task in molecular and evolutionary biology and has conventionally required a multiple sequence alignment (MSA), a statistical model of amino acid substitutions, and an optimality principle. Recently, global models of amino acid substitutions have been inferred from millions of MSAs using transformer-based deep learning, resulting in protein foundation models (pFMs), also known as protein language models (PLMs). Training pFMs on MSAs hypothetically enables them to encode residue dependencies and the phylogenetic structure of the MSA collection. In contrast, pFMs trained on individual sequences lack access to such phylogenetic structure. Here, we assess the phylogeny inference gains offered by the use of MSA for training pFMs by comparing the relative accuracies of phylogenies inferred using two types of pFMs: one trained on a large collection of MSAs (msat-pFM, [1]) and the other trained using a collection of single sequences (esm-pFM). For msat-pFM analysis, we inferred neighbor-joining trees using pairwise distances estimated directly from the sequence attention matrices. For esm-pFM [2], pairwise distances were obtained using the correlation of attentions of homologous residues, where pairwise sequence alignments (PSA) were used to establish residue homologies. Surprisingly, MSA phylogenies inferred using the msat-pFM were less accurate than esm-pFMs. This pattern was seen across datasets spanning both small and large numbers of species and proteins. Also, PSA phylogenies obtained using residue attentions from early ESM-PFM layers were much more accurate. These results suggest that the multiple sequence alignment step, which is obligatory to establish residue homologies across multiple sequences, may not add information when using evolutionary distances based on attentions in pFMs.

Matching journals

The top 2 journals account for 50% of the predicted probability mass.

1
Bioinformatics
1061 papers in training set
Top 0.3%
44.3%
2
PLOS Computational Biology
1633 papers in training set
Top 5%
6.8%
50% of probability mass above
3
BMC Bioinformatics
383 papers in training set
Top 1%
6.7%
4
Bioinformatics Advances
184 papers in training set
Top 0.6%
5.2%
5
Briefings in Bioinformatics
326 papers in training set
Top 1%
4.2%
6
Scientific Reports
3102 papers in training set
Top 32%
3.8%
7
Molecular Biology and Evolution
488 papers in training set
Top 2%
2.9%
8
NAR Genomics and Bioinformatics
214 papers in training set
Top 1%
2.2%
9
Journal of Computational Biology
37 papers in training set
Top 0.1%
1.8%
10
Nature Communications
4913 papers in training set
Top 50%
1.8%
11
Cell Systems
167 papers in training set
Top 8%
1.4%
12
Nature Methods
336 papers in training set
Top 5%
1.3%
13
Protein Science
221 papers in training set
Top 1%
1.0%
14
Genome Research
409 papers in training set
Top 3%
0.9%
15
PeerJ
261 papers in training set
Top 13%
0.8%
16
Journal of Molecular Biology
217 papers in training set
Top 3%
0.8%
17
Nature Computational Science
50 papers in training set
Top 1%
0.8%
18
Biomolecules
95 papers in training set
Top 2%
0.8%
19
Communications Biology
886 papers in training set
Top 21%
0.8%
20
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 44%
0.8%
21
Journal of Proteome Research
215 papers in training set
Top 2%
0.7%
22
Nature Machine Intelligence
61 papers in training set
Top 4%
0.5%
23
Proteins: Structure, Function, and Bioinformatics
82 papers in training set
Top 1%
0.5%