Quartet-based Genome-scale Species Tree Inference using Multicopy Gene Family Trees
Rafi, A.; Rumi, A. M. S.; Hakim, S. A.; Bayzid, M. S.
Show abstract
Species tree estimation from multi-copy gene family trees, including both paralogs and orthologs, is a challenging task due to the gene tree discordance caused by biological processes such as incomplete lineage sorting (ILS) and gene duplication and loss (GDL). Quartet-based species tree estimation methods, such as ASTRAL, Quartet Max-Cut (QMC), and Quartet Fiduccia-Mattheyses (QFM) frameworks have gained substantial popularity for their accuracy and statistical guarantee. However, most of these methods rely on single-copy gene trees and model only ILS, which limits their applicability to large genomic datasets. ASTRAL-Pro incorporates both orthology and paralogy for species tree inference under GDL by employing a refined quartet similarity measure based on the concept of species-driven quartets (SQs). In this study, we show that these SQ-based techniques can be effectively leveraged within the QFM framework. This required substantial algorithmic re-engineering, including the development of efficient techniques for computing the initial bipartition in QFM and novel combinatorial methods for computing refined quartet scores directly from gene family trees. We extensively evaluated our method, wQFM-GDL, on benchmark simulated and real biological datasets and compared it with ASTRAL-Pro3, SpeciesRax, and DupLoss-2. wQFM-GDL outperforms all other methods in 113 out of 124 model conditions considered in this study, with performance differences becoming more pronounced as dataset size increases. In particular, for larger datasets with 200 and 500 taxa, wQFM-GDL significantly outperforms all leading methods in all 72 out of 72 model conditions and achieves, on average, nearly a 25% reduction in reconstruction error compared with ASTRAL-Pro3. wQFM-GDL is freely available in open source form at https://github.com/abdur-rafi/wQFM-GDL.
Matching journals
The top 3 journals account for 50% of the predicted probability mass.