Back

k-Nearest Common Leaves algorithm for phylogenetic tree completion

Koshkarov, A.; Tahiri, N.

2026-04-04 evolutionary biology
10.64898/2026.04.02.716144 bioRxiv
Show abstract

Phylogenetic trees represent the evolutionary histories of taxa and support tasks such as clustering and Tree of Life reconstruction. Many established comparison methods, including the Robinson-Foulds (RF) distance, assume identical taxon sets. A methodological gap remains for trees with distinct but overlapping taxa. Existing approaches either prune non-common leaves, which can discard information, or complete both trees such that they share the same taxa. Completion is more comprehensive, but current methods typically ignore branch lengths, which are essential for identifying evolutionary patterns. This paper introduces k-Nearest Common Leaves (k-NCL), an algorithm for completing rooted phylogenetic trees defined on different but overlapping taxa. The method uses branch lengths and topological characteristics and does not rely on a specific distance measure. The k-NCL algorithm is designed to preserve evolutionary relationships in the trees under comparison. The running time is O(n2), where n is the size of the union of the two leaf sets. Additional properties include preservation of original distances and topology, symmetry, and uniqueness of the completion. Implemented in Python, k-NCL is evaluated on biological datasets of amphibians, birds, mammals, and sharks. Experimental results show that RF combined with k-NCL improves phylogenetic tree clustering performance compared to the RF(+) tree completion approach. Availability and implementationAn open-source implementation of k-NCL in Python and the datasets used in this study are available at https://github.com/tahiri-lab/KNCL.

Matching journals

The top 3 journals account for 50% of the predicted probability mass.

1
Methods in Ecology and Evolution
160 papers in training set
Top 0.1%
26.1%
2
Bioinformatics
1061 papers in training set
Top 1%
18.9%
3
Journal of Computational Biology
37 papers in training set
Top 0.1%
6.5%
50% of probability mass above
4
BMC Bioinformatics
383 papers in training set
Top 2%
4.9%
5
Bioinformatics Advances
184 papers in training set
Top 1%
3.7%
6
PLOS ONE
4510 papers in training set
Top 38%
3.7%
7
PLOS Computational Biology
1633 papers in training set
Top 12%
2.8%
8
Systematic Biology
121 papers in training set
Top 0.2%
2.6%
9
BMC Ecology and Evolution
49 papers in training set
Top 0.7%
2.4%
10
Molecular Biology and Evolution
488 papers in training set
Top 2%
2.1%
11
Scientific Reports
3102 papers in training set
Top 55%
1.8%
12
Journal of Systematics and Evolution
11 papers in training set
Top 0.1%
1.8%
13
BMC Genomics
328 papers in training set
Top 2%
1.7%
14
Ecological Informatics
29 papers in training set
Top 0.4%
1.7%
15
Molecular Ecology Resources
161 papers in training set
Top 1.0%
0.8%
16
Peer Community Journal
254 papers in training set
Top 3%
0.8%
17
PeerJ
261 papers in training set
Top 13%
0.8%
18
Ecology and Evolution
232 papers in training set
Top 4%
0.8%
19
Nature Communications
4913 papers in training set
Top 63%
0.7%
20
Genome Biology and Evolution
280 papers in training set
Top 2%
0.7%
21
Briefings in Bioinformatics
326 papers in training set
Top 7%
0.7%
22
Data in Brief
13 papers in training set
Top 0.6%
0.7%
23
Systematic Entomology
11 papers in training set
Top 0.1%
0.5%
24
NAR Genomics and Bioinformatics
214 papers in training set
Top 5%
0.5%
25
Communications Biology
886 papers in training set
Top 32%
0.5%
26
SoftwareX
15 papers in training set
Top 0.6%
0.5%
27
Nature Genetics
240 papers in training set
Top 9%
0.5%