Weighted Centroid Trees: A general approach for summarizing phylogenies in tumor mutation tree inference

Vasei, H.; Foroughmand-Araabi, M.-H.; Daneshgar, A.

2023-09-15 bioinformatics

10.1101/2023.09.11.557167 bioRxiv

Show abstract

Tumor mutation trees are the primary tools to model the evolution of cancer. Not only some tumor phylogeny inference methods may produce a set of trees having potential and parallel evolutionary histories, but also mutation trees from different patients may also exhibit similar evolutionary processes. When a set of correlated mutation trees is available, compressing the data into a single best-fit tree, exhibiting the shared evolutionary processes, is definitely of great importance and can be beneficial in many applications. In this study, we present a general setup to study and analyse the problem of finding a best-fit (centroid) tree to a given set of trees and we use our general setup to analyse mutation trees as our main motivation. For this let{varepsilon} :[T] n [->] [R]nxn be an embedding of labeled rooted trees into the space of real square matrices and also let L be a norm on this space. We introduce the nearest mapped tree problem as the problem of finding a closest tree to a given matrix A with respect to{varepsilon} and L, i.e., a tree T *(A) for which L({varepsilon}(T *(A)) - A) is minimized. Within this setup, our potential candidates for the embedding are adjacency, ancestry, and distance matrices of trees, where we consider the cases of L1 and L2 norms in our analysis. We show that the function d(T1, T2) = L({varepsilon}(T1) -{varepsilon} (T2)) defines a family of dissimilarity measures, covering previously studied parent-child and ancestor-descendent metrics. Also, we show that the nearest mapped tree problem is polynomial-time solvable for the adjacency matrix embedding and is[N][P] -hard for the ancestry and the distance embeddings. The weighted centroid tree problem for a given set of trees of size k is naturally defined as a nearest mapped tree solution to a weighted sum of the corresponding matrix set. In this article we consider uniform weighted-sums for which all weights are equal, where in particular, the (classical) centroid tree is defined to be a solution when all weights are chosen to be equal to 1/k (i.e., the mean case). Similarly, the{omega} -weighted centroid tree is a solution when all weights are equal to{omega} /k. To show the generality of our setup, we prove that the solution-set of the centroid tree problem for the adjacency and the ancestry matrices are identical to the solution-set of the consensus tree problem for parent-child and ancestor-descendent distances already handled by the algorithms GraPhyC(2018) and TuELiP(2023), respectively. Next, to tackle this problem for some new cases, we provide integer linear programs to handle the nearest mapped tree problem for the ancestry and the distance embeddings, giving rise to solutions of the weighted centroid tree problem in these cases. To show the effectiveness of this approach, we provide an algorithm, WAncILP2, to solvethe 2-weighted centroid tree problem for the case of the ancestry matrix and we justify the importance of the weighted setup by showing the pioneering performance of WAncILP2 both in a comprehensive simulation analysis as well as on a real breast cancer dataset, in which, by finding the centroids as representatives of data clusters, we provide supporting evidence for the fact that some common aspects of these centroids can be considered as suitable candidates for reliable evolutionary information in relation to the original data. metrics.

Weighted Centroid Trees: A general approach for summarizing phylogenies in tumor mutation tree inference

Matching journals