Back

Disentangling the Impacts of Incomplete Lineage Sorting and Gene Tree Estimation Error on Species Tree Inference

Tahmid, N.; Rhythm, S. I.; Bayzid, M. S.

2026-02-21 bioinformatics
10.64898/2026.02.21.707162 bioRxiv
Show abstract

Accurate species tree inference from genome-scale data is complicated by gene tree discordance, which can arise both from biological processes such as incomplete lineage sorting (ILS) and from technical factors such as gene tree estimation error (GTEE). While both factors reduce the accuracy of summary methods that infer species trees from gene trees, their relative impact and characteristic patterns remain poorly understood. Here, we systematically disentangle the effects of ILS and GTEE by simulating gene tree datasets with comparable overall discordance levels, but with discordance arising exclusively from either ILS or GTEE. Using widely employed summary methods such as ASTRAL and wQFM, we show that GTEE typically has a stronger detrimental effect on species tree accuracy than ILS, even at matched discordance levels. We further characterize the structure of gene tree distributions under these two sources of discordance and show that ILS induces a structured, constrained skew in quartet distributions, whereas GTEE generates more uniform, high-entropy noise that does not diminish with additional genes. Our results provide an empirical framework for a nuanced understanding of how ILS and GTEE shape gene tree distributions and influence species tree inference, and highlight the importance of appropriately distinguishing biological and estimation-driven discordance when inferring species trees from limited or noisy datasets.

Matching journals

The top 5 journals account for 50% of the predicted probability mass.

1
Molecular Biology and Evolution
488 papers in training set
Top 0.2%
14.8%
2
Systematic Biology
121 papers in training set
Top 0.1%
14.8%
3
PLOS Computational Biology
1633 papers in training set
Top 3%
10.1%
4
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 9%
7.2%
5
Nature Communications
4913 papers in training set
Top 26%
6.8%
50% of probability mass above
6
Cell Systems
167 papers in training set
Top 2%
4.9%
7
Genome Research
409 papers in training set
Top 0.8%
4.0%
8
Genome Biology
555 papers in training set
Top 3%
3.3%
9
Science
429 papers in training set
Top 10%
3.1%
10
The American Journal of Human Genetics
206 papers in training set
Top 2%
2.7%
11
eLife
5422 papers in training set
Top 34%
2.4%
12
Scientific Reports
3102 papers in training set
Top 50%
2.1%
13
Genetics
225 papers in training set
Top 2%
2.1%
14
Nature Genetics
240 papers in training set
Top 4%
1.7%
15
PLOS ONE
4510 papers in training set
Top 63%
1.0%
16
PLOS Genetics
756 papers in training set
Top 12%
1.0%
17
Nature Computational Science
50 papers in training set
Top 1%
0.8%
18
PNAS Nexus
147 papers in training set
Top 2%
0.7%
19
Journal of Computational Biology
37 papers in training set
Top 0.6%
0.7%
20
Science Advances
1098 papers in training set
Top 31%
0.7%
21
Bioinformatics
1061 papers in training set
Top 10%
0.7%
22
Nucleic Acids Research
1128 papers in training set
Top 20%
0.6%
23
Genome Biology and Evolution
280 papers in training set
Top 2%
0.6%
24
Peer Community Journal
254 papers in training set
Top 4%
0.6%