Deconvolving Phylogenetic Distance Mixtures
Arasti, S.; Sapci, A. O. B.; Rachtman, E.; El-Kebir, M.; Mirarab, S.
Show abstract
Mixtures of multiple constituent organisms are sequenced in several widely used applications, including metagenomics and metabarcoding. Characterizing the elements of the sequence mixture and their abundance with respect to a reference set of known organisms has been the subject of intense research across several domains, including microbiome analyses, and methods must overcome two key challenges. First, the mixture constituents are related to each other through an evolutionary history, and hence, should not be considered independent entities. Second, sequence data is noisy, with each short read providing a limited signal. While existing approaches attempt to address these challenges, addressing both challenges simultaneously has proved challenging. For evolutionary dependencies, methods either define hierarchical clusters (e.g., taxonomies or operational taxonomic/genomic units) or use phylogenetic trees. For the second challenge, they either assemble reads into contigs, use statistical priors to summarize read placements, or attempt to analyze all reads jointly using k-mers. Despite this rich literature, a natural approach to simultaneously address both challenges has been underexplored: compute a distance from the mixture to all references, deconvolve those distances, and place the sample on multiple branches of a reference phylogeny with associated abundances. This multi-placement approach is a natural extension of the single-read phylogenetic placement used in practice. We argue that by placing the entire sample on multiple branches instead of placing reads individually, we can obtain a less noisy profile of the mixture. We formalize this approach as the phylogenetic distance deconvolution (PDD) problem, show some limits on the identifiability of PDDs, propose a slow exact algorithm, and an efficient heuristic greedy algorithm with local refinements. Benchmarking shows that these heuristics are effective and that our implementation of the PDD approach (called DecoDiPhy) can accurately deconvolve phylogenetic mixture distances while scaling quadratically. Applied to metagenomics, DecoDiPhy consolidates reads mapped to a large number of branches on a reference tree to a much smaller number of placements. The consolidated placements improve the accuracy of downstream tasks, such as sample differentiation and detection of differentially abundant taxa.
Matching journals
The top 4 journals account for 50% of the predicted probability mass.