Back

A deep-learning-based score to evaluate multiple sequence alignments

Serok, N.; Polonsky, K.; Ashkenazy, H.; Mayrose, I.; Thorne, J. L.; Pupko, T.

2026-02-05 bioinformatics
10.64898/2026.02.02.703429 bioRxiv
Show abstract

Multiple sequence alignment (MSA) inference is a central task in molecular evolution and comparative genomics, and the reliability of downstream analyses, including phylogenetic inference, depends critically on alignment quality. Despite this importance, most widely used MSA methods optimize the sum-of-pairs (SP) score, and relatively little attention has been paid to whether this objective function accurately reflects alignment accuracy. Here, we evaluate the performance of the SP score using simulated and empirical benchmark alignments. For each dataset, we compare alternative MSAs derived from the same unaligned sequences and quantify the relationship between their SP scores and their distances from a reference alignment. We show that the alignment with the optimal SP score often does not correspond to the most accurate alignment. To address this limitation, we develop deep-learning-based scoring functions that integrate a collection of MSA features. We first introduce Model 1, a regression model that predicts the distance of a given MSA from the reference alignment. Across simulated and empirical datasets, this learned score correlates more strongly with true alignment accuracy than the SP score. However, Model 1 is less effective at identifying the best alignment among alternatives. We therefore develop Model 2, which takes as input a set of alternative MSAs generated from the same sequences and predicts their relative ranking. Model 2 more accurately identifies the top-ranking MSA than the SP score, Model 1, and several widely used alignment programs. Using simulations, we show that selecting MSAs based on our approach leads to more accurate phylogenetic reconstructions.

Matching journals

The top 3 journals account for 50% of the predicted probability mass.

1
Molecular Biology and Evolution
488 papers in training set
Top 0.1%
26.0%
2
Systematic Biology
121 papers in training set
Top 0.1%
18.7%
3
Bioinformatics
1061 papers in training set
Top 3%
8.5%
50% of probability mass above
4
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 11%
6.3%
5
Nature Communications
4913 papers in training set
Top 35%
4.3%
6
PLOS Computational Biology
1633 papers in training set
Top 8%
4.0%
7
Science
429 papers in training set
Top 11%
2.6%
8
Methods in Ecology and Evolution
160 papers in training set
Top 1%
2.5%
9
BMC Bioinformatics
383 papers in training set
Top 4%
1.9%
10
Cell Systems
167 papers in training set
Top 6%
1.9%
11
Nature Methods
336 papers in training set
Top 5%
1.5%
12
Scientific Reports
3102 papers in training set
Top 64%
1.3%
13
Nature Computational Science
50 papers in training set
Top 1%
1.0%
14
Bioinformatics Advances
184 papers in training set
Top 4%
0.9%
15
Virus Evolution
140 papers in training set
Top 1%
0.8%
16
Structure
175 papers in training set
Top 3%
0.8%
17
Nature Biotechnology
147 papers in training set
Top 7%
0.8%
18
Genome Research
409 papers in training set
Top 4%
0.7%
19
Genetics
225 papers in training set
Top 4%
0.7%
20
NAR Genomics and Bioinformatics
214 papers in training set
Top 4%
0.7%
21
The American Journal of Human Genetics
206 papers in training set
Top 4%
0.6%
22
eLife
5422 papers in training set
Top 61%
0.6%
23
Briefings in Bioinformatics
326 papers in training set
Top 8%
0.5%
24
Journal of Computational Biology
37 papers in training set
Top 0.8%
0.5%
25
Communications Biology
886 papers in training set
Top 32%
0.5%