Back

A new clustering method for building multiple trees using deep learning.

Tahiri, N.

2019-10-04 evolutionary biology
10.1101/781252 bioRxiv
Show abstract

Each gene has its own evolutionary history which can substantially differ from the evolutionary histories of other genes. For example, some individual genes or operons can be affected by specific horizontal gene transfer or hybridization events. Thus, the evolutionary history of each gene should be represented by its own phylogenetic tree which may display different evolutionary patterns from the species tree, or Tree of Life, that represents the main patterns of vertical descent. Here, we present a new efficient method for inferring single or multiple consensus trees and supertrees for a given set of phylogenetic trees (i.e. additive trees or X-trees). The output of the traditional tree consensus methods is a unique consensus tree or supertree. Here, we show how Machine Learning (ML) models, based on some interesting properties of the Robinson and Foulds topological distance, can be used to partition a given set of trees into one (when the data are homogeneous) or multiple (when the data are heterogeneous) cluster(s) of trees. We adapt the popular Accuracy, Precision, Sensitivity, and F1 scores to the tree clustering. A special attention is paid to the relevant, but very challenging, problem of inferring alternative supertrees that are built from phylogenies defined on different, but mutually overlapping, sets of species. The use of an approximate objective function in clustering makes the new method faster than the existing tree clustering techniques and thus suitable for the analysis of large genomic datasets.

Matching journals

The top 4 journals account for 50% of the predicted probability mass.

1
Bioinformatics
1061 papers in training set
Top 0.9%
26.1%
2
Journal of Computational Biology
37 papers in training set
Top 0.1%
14.5%
3
PLOS Computational Biology
1633 papers in training set
Top 4%
8.5%
4
Bioinformatics Advances
184 papers in training set
Top 0.8%
4.4%
50% of probability mass above
5
BMC Bioinformatics
383 papers in training set
Top 2%
4.4%
6
Genome Research
409 papers in training set
Top 0.8%
4.0%
7
PLOS ONE
4510 papers in training set
Top 36%
4.0%
8
BMC Genomics
328 papers in training set
Top 0.8%
3.6%
9
Systematic Biology
121 papers in training set
Top 0.2%
2.6%
10
Molecular Biology and Evolution
488 papers in training set
Top 2%
2.4%
11
NAR Genomics and Bioinformatics
214 papers in training set
Top 1%
2.1%
12
Scientific Reports
3102 papers in training set
Top 59%
1.7%
13
Methods in Ecology and Evolution
160 papers in training set
Top 1%
1.5%
14
eLife
5422 papers in training set
Top 50%
1.1%
15
PLOS Genetics
756 papers in training set
Top 12%
1.0%
16
BMC Ecology and Evolution
49 papers in training set
Top 2%
0.9%
17
Journal of Genetics and Genomics
36 papers in training set
Top 2%
0.8%
18
Genetics
225 papers in training set
Top 4%
0.8%
19
iScience
1063 papers in training set
Top 32%
0.8%
20
Communications Biology
886 papers in training set
Top 23%
0.8%
21
PeerJ
261 papers in training set
Top 16%
0.7%
22
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 47%
0.6%
23
Genome Biology and Evolution
280 papers in training set
Top 2%
0.5%
24
Nature Communications
4913 papers in training set
Top 67%
0.5%
25
Molecular Ecology Resources
161 papers in training set
Top 1%
0.5%
26
Journal of Systematics and Evolution
11 papers in training set
Top 0.4%
0.5%
27
Journal of Molecular Evolution
21 papers in training set
Top 0.5%
0.5%