Disentangling the Impacts of Incomplete Lineage Sorting and Gene Tree Estimation Error on Species Tree Inference
Tahmid, N.; Rhythm, S. I.; Bayzid, M. S.
Show abstract
Accurate species tree inference from genome-scale data is complicated by gene tree discordance, which can arise both from biological processes such as incomplete lineage sorting (ILS) and from technical factors such as gene tree estimation error (GTEE). While both factors reduce the accuracy of summary methods that infer species trees from gene trees, their relative impact and characteristic patterns remain poorly understood. Here, we systematically disentangle the effects of ILS and GTEE by simulating gene tree datasets with comparable overall discordance levels, but with discordance arising exclusively from either ILS or GTEE. Using widely employed summary methods such as ASTRAL and wQFM, we show that GTEE typically has a stronger detrimental effect on species tree accuracy than ILS, even at matched discordance levels. We further characterize the structure of gene tree distributions under these two sources of discordance and show that ILS induces a structured, constrained skew in quartet distributions, whereas GTEE generates more uniform, high-entropy noise that does not diminish with additional genes. Our results provide an empirical framework for a nuanced understanding of how ILS and GTEE shape gene tree distributions and influence species tree inference, and highlight the importance of appropriately distinguishing biological and estimation-driven discordance when inferring species trees from limited or noisy datasets.
Matching journals
The top 5 journals account for 50% of the predicted probability mass.