Back

Quartet-based Genome-scale Species Tree Inference using Multicopy Gene Family Trees

Rafi, A.; Rumi, A. M. S.; Hakim, S. A.; Bayzid, M. S.

2025-04-10 evolutionary biology
10.1101/2025.04.04.647228 bioRxiv
Show abstract

Species tree estimation from multi-copy gene family trees, including both paralogs and orthologs, is a challenging task due to the gene tree discordance caused by biological processes such as incomplete lineage sorting (ILS) and gene duplication and loss (GDL). Quartet-based species tree estimation methods, such as ASTRAL, Quartet Max-Cut (QMC), and Quartet Fiduccia-Mattheyses (QFM) frameworks have gained substantial popularity for their accuracy and statistical guarantee. However, most of these methods rely on single-copy gene trees and model only ILS, which limits their applicability to large genomic datasets. ASTRAL-Pro incorporates both orthology and paralogy for species tree inference under GDL by employing a refined quartet similarity measure based on the concept of species-driven quartets (SQs). In this study, we show that these SQ-based techniques can be effectively leveraged within the QFM framework. This required substantial algorithmic re-engineering, including the development of efficient techniques for computing the initial bipartition in QFM and novel combinatorial methods for computing refined quartet scores directly from gene family trees. We extensively evaluated our method, wQFM-GDL, on benchmark simulated and real biological datasets and compared it with ASTRAL-Pro3, SpeciesRax, and DupLoss-2. wQFM-GDL outperforms all other methods in 113 out of 124 model conditions considered in this study, with performance differences becoming more pronounced as dataset size increases. In particular, for larger datasets with 200 and 500 taxa, wQFM-GDL significantly outperforms all leading methods in all 72 out of 72 model conditions and achieves, on average, nearly a 25% reduction in reconstruction error compared with ASTRAL-Pro3. wQFM-GDL is freely available in open source form at https://github.com/abdur-rafi/wQFM-GDL.

Matching journals

The top 3 journals account for 50% of the predicted probability mass.

1
Genome Research
409 papers in training set
Top 0.1%
33.6%
2
Bioinformatics
1061 papers in training set
Top 2%
14.6%
3
Systematic Biology
121 papers in training set
Top 0.1%
10.3%
50% of probability mass above
4
Molecular Biology and Evolution
488 papers in training set
Top 0.9%
4.9%
5
Bioinformatics Advances
184 papers in training set
Top 2%
2.8%
6
PLOS Computational Biology
1633 papers in training set
Top 12%
2.7%
7
Nature Communications
4913 papers in training set
Top 45%
2.5%
8
Journal of Computational Biology
37 papers in training set
Top 0.1%
2.4%
9
Genome Biology
555 papers in training set
Top 3%
2.1%
10
Nature Computational Science
50 papers in training set
Top 0.5%
1.8%
11
iScience
1063 papers in training set
Top 24%
1.0%
12
The Plant Journal
197 papers in training set
Top 3%
0.9%
13
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 40%
0.9%
14
Molecular Ecology Resources
161 papers in training set
Top 1.0%
0.8%
15
Nucleic Acids Research
1128 papers in training set
Top 16%
0.8%
16
Genomics, Proteomics & Bioinformatics
171 papers in training set
Top 5%
0.8%
17
Genome Biology and Evolution
280 papers in training set
Top 2%
0.8%
18
PLOS Genetics
756 papers in training set
Top 14%
0.8%
19
NAR Genomics and Bioinformatics
214 papers in training set
Top 3%
0.8%
20
Briefings in Bioinformatics
326 papers in training set
Top 6%
0.8%
21
PLOS ONE
4510 papers in training set
Top 67%
0.8%
22
Cell Systems
167 papers in training set
Top 12%
0.7%
23
Peer Community Journal
254 papers in training set
Top 4%
0.7%
24
Nature Genetics
240 papers in training set
Top 8%
0.7%
25
eLife
5422 papers in training set
Top 61%
0.7%
26
Communications Biology
886 papers in training set
Top 28%
0.7%
27
Plant Communications
35 papers in training set
Top 2%
0.5%