Back

Weighted Centroid Trees: A general approach for summarizing phylogenies in tumor mutation tree inference

Vasei, H.; Foroughmand-Araabi, M.-H.; Daneshgar, A.

2023-09-15 bioinformatics
10.1101/2023.09.11.557167 bioRxiv
Show abstract

Tumor mutation trees are the primary tools to model the evolution of cancer. Not only some tumor phylogeny inference methods may produce a set of trees having potential and parallel evolutionary histories, but also mutation trees from different patients may also exhibit similar evolutionary processes. When a set of correlated mutation trees is available, compressing the data into a single best-fit tree, exhibiting the shared evolutionary processes, is definitely of great importance and can be beneficial in many applications. In this study, we present a general setup to study and analyse the problem of finding a best-fit (centroid) tree to a given set of trees and we use our general setup to analyse mutation trees as our main motivation. For this let{varepsilon} :[T] n [->] [R]nxn be an embedding of labeled rooted trees into the space of real square matrices and also let L be a norm on this space. We introduce the nearest mapped tree problem as the problem of finding a closest tree to a given matrix A with respect to{varepsilon} and L, i.e., a tree T *(A) for which L({varepsilon}(T *(A)) - A) is minimized. Within this setup, our potential candidates for the embedding are adjacency, ancestry, and distance matrices of trees, where we consider the cases of L1 and L2 norms in our analysis. We show that the function d(T1, T2) = L({varepsilon}(T1) -{varepsilon} (T2)) defines a family of dissimilarity measures, covering previously studied parent-child and ancestor-descendent metrics. Also, we show that the nearest mapped tree problem is polynomial-time solvable for the adjacency matrix embedding and is[N][P] -hard for the ancestry and the distance embeddings. The weighted centroid tree problem for a given set of trees of size k is naturally defined as a nearest mapped tree solution to a weighted sum of the corresponding matrix set. In this article we consider uniform weighted-sums for which all weights are equal, where in particular, the (classical) centroid tree is defined to be a solution when all weights are chosen to be equal to 1/k (i.e., the mean case). Similarly, the{omega} -weighted centroid tree is a solution when all weights are equal to{omega} /k. To show the generality of our setup, we prove that the solution-set of the centroid tree problem for the adjacency and the ancestry matrices are identical to the solution-set of the consensus tree problem for parent-child and ancestor-descendent distances already handled by the algorithms GraPhyC(2018) and TuELiP(2023), respectively. Next, to tackle this problem for some new cases, we provide integer linear programs to handle the nearest mapped tree problem for the ancestry and the distance embeddings, giving rise to solutions of the weighted centroid tree problem in these cases. To show the effectiveness of this approach, we provide an algorithm, WAncILP2, to solvethe 2-weighted centroid tree problem for the case of the ancestry matrix and we justify the importance of the weighted setup by showing the pioneering performance of WAncILP2 both in a comprehensive simulation analysis as well as on a real breast cancer dataset, in which, by finding the centroids as representatives of data clusters, we provide supporting evidence for the fact that some common aspects of these centroids can be considered as suitable candidates for reliable evolutionary information in relation to the original data. metrics.

Matching journals

The top 4 journals account for 50% of the predicted probability mass.

1
Journal of Computational Biology
37 papers in training set
Top 0.1%
18.6%
2
Bioinformatics
1061 papers in training set
Top 2%
17.5%
3
Algorithms for Molecular Biology
15 papers in training set
Top 0.1%
8.4%
4
PLOS Computational Biology
1633 papers in training set
Top 5%
6.8%
50% of probability mass above
5
Bulletin of Mathematical Biology
84 papers in training set
Top 0.5%
4.0%
6
BMC Bioinformatics
383 papers in training set
Top 2%
4.0%
7
Journal of Mathematical Biology
37 papers in training set
Top 0.1%
3.1%
8
PLOS ONE
4510 papers in training set
Top 43%
2.9%
9
Scientific Reports
3102 papers in training set
Top 50%
2.1%
10
Theoretical Population Biology
47 papers in training set
Top 0.1%
1.7%
11
Journal of The Royal Society Interface
189 papers in training set
Top 3%
1.5%
12
Entropy
20 papers in training set
Top 0.2%
1.3%
13
IEEE Transactions on Computational Biology and Bioinformatics
17 papers in training set
Top 0.4%
1.2%
14
IEEE/ACM Transactions on Computational Biology and Bioinformatics
32 papers in training set
Top 0.4%
0.9%
15
Bioinformatics Advances
184 papers in training set
Top 4%
0.9%
16
Cell Systems
167 papers in training set
Top 10%
0.9%
17
Biometrics
22 papers in training set
Top 0.1%
0.9%
18
Systematic Biology
121 papers in training set
Top 0.4%
0.8%
19
Briefings in Bioinformatics
326 papers in training set
Top 6%
0.8%
20
iScience
1063 papers in training set
Top 29%
0.8%
21
Journal of Theoretical Biology
144 papers in training set
Top 2%
0.7%
22
Genome Research
409 papers in training set
Top 4%
0.7%
23
Mathematics
11 papers in training set
Top 0.5%
0.7%
24
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 46%
0.7%
25
Peer Community Journal
254 papers in training set
Top 4%
0.7%
26
Royal Society Open Science
193 papers in training set
Top 6%
0.6%
27
IEEE Access
31 papers in training set
Top 1%
0.6%
28
Biostatistics
21 papers in training set
Top 0.2%
0.6%