A deep-learning-based score to evaluate multiple sequence alignments
Serok, N.; Polonsky, K.; Ashkenazy, H.; Mayrose, I.; Thorne, J. L.; Pupko, T.
Show abstract
Multiple sequence alignment (MSA) inference is a central task in molecular evolution and comparative genomics, and the reliability of downstream analyses, including phylogenetic inference, depends critically on alignment quality. Despite this importance, most widely used MSA methods optimize the sum-of-pairs (SP) score, and relatively little attention has been paid to whether this objective function accurately reflects alignment accuracy. Here, we evaluate the performance of the SP score using simulated and empirical benchmark alignments. For each dataset, we compare alternative MSAs derived from the same unaligned sequences and quantify the relationship between their SP scores and their distances from a reference alignment. We show that the alignment with the optimal SP score often does not correspond to the most accurate alignment. To address this limitation, we develop deep-learning-based scoring functions that integrate a collection of MSA features. We first introduce Model 1, a regression model that predicts the distance of a given MSA from the reference alignment. Across simulated and empirical datasets, this learned score correlates more strongly with true alignment accuracy than the SP score. However, Model 1 is less effective at identifying the best alignment among alternatives. We therefore develop Model 2, which takes as input a set of alternative MSAs generated from the same sequences and predicts their relative ranking. Model 2 more accurately identifies the top-ranking MSA than the SP score, Model 1, and several widely used alignment programs. Using simulations, we show that selecting MSAs based on our approach leads to more accurate phylogenetic reconstructions.
Matching journals
The top 3 journals account for 50% of the predicted probability mass.