Back

Deconvolving Phylogenetic Distance Mixtures

Arasti, S.; Sapci, A. O. B.; Rachtman, E.; El-Kebir, M.; Mirarab, S.

2026-01-21 evolutionary biology
10.64898/2026.01.18.700179 bioRxiv
Show abstract

Mixtures of multiple constituent organisms are sequenced in several widely used applications, including metagenomics and metabarcoding. Characterizing the elements of the sequence mixture and their abundance with respect to a reference set of known organisms has been the subject of intense research across several domains, including microbiome analyses, and methods must overcome two key challenges. First, the mixture constituents are related to each other through an evolutionary history, and hence, should not be considered independent entities. Second, sequence data is noisy, with each short read providing a limited signal. While existing approaches attempt to address these challenges, addressing both challenges simultaneously has proved challenging. For evolutionary dependencies, methods either define hierarchical clusters (e.g., taxonomies or operational taxonomic/genomic units) or use phylogenetic trees. For the second challenge, they either assemble reads into contigs, use statistical priors to summarize read placements, or attempt to analyze all reads jointly using k-mers. Despite this rich literature, a natural approach to simultaneously address both challenges has been underexplored: compute a distance from the mixture to all references, deconvolve those distances, and place the sample on multiple branches of a reference phylogeny with associated abundances. This multi-placement approach is a natural extension of the single-read phylogenetic placement used in practice. We argue that by placing the entire sample on multiple branches instead of placing reads individually, we can obtain a less noisy profile of the mixture. We formalize this approach as the phylogenetic distance deconvolution (PDD) problem, show some limits on the identifiability of PDDs, propose a slow exact algorithm, and an efficient heuristic greedy algorithm with local refinements. Benchmarking shows that these heuristics are effective and that our implementation of the PDD approach (called DecoDiPhy) can accurately deconvolve phylogenetic mixture distances while scaling quadratically. Applied to metagenomics, DecoDiPhy consolidates reads mapped to a large number of branches on a reference tree to a much smaller number of placements. The consolidated placements improve the accuracy of downstream tasks, such as sample differentiation and detection of differentially abundant taxa.

Matching journals

The top 4 journals account for 50% of the predicted probability mass.

1
Bioinformatics
1061 papers in training set
Top 0.5%
38.0%
2
Genome Research
409 papers in training set
Top 0.3%
6.9%
3
Microbiome
139 papers in training set
Top 0.8%
4.3%
4
PLOS Computational Biology
1633 papers in training set
Top 8%
4.2%
50% of probability mass above
5
Nature Communications
4913 papers in training set
Top 37%
4.0%
6
Journal of Computational Biology
37 papers in training set
Top 0.1%
2.7%
7
Nature Biotechnology
147 papers in training set
Top 3%
2.7%
8
Methods in Ecology and Evolution
160 papers in training set
Top 1%
2.4%
9
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 27%
2.1%
10
Molecular Biology and Evolution
488 papers in training set
Top 2%
2.1%
11
Cell Systems
167 papers in training set
Top 6%
1.9%
12
Nature Computational Science
50 papers in training set
Top 0.5%
1.8%
13
Genome Biology
555 papers in training set
Top 4%
1.7%
14
eLife
5422 papers in training set
Top 43%
1.7%
15
Systematic Biology
121 papers in training set
Top 0.3%
1.7%
16
PLOS ONE
4510 papers in training set
Top 57%
1.5%
17
Microbial Genomics
204 papers in training set
Top 2%
0.9%
18
Bioinformatics Advances
184 papers in training set
Top 4%
0.9%
19
BMC Bioinformatics
383 papers in training set
Top 6%
0.9%
20
Molecular Ecology Resources
161 papers in training set
Top 0.9%
0.9%
21
Peer Community Journal
254 papers in training set
Top 4%
0.8%
22
Nature Microbiology
133 papers in training set
Top 4%
0.8%
23
mSphere
281 papers in training set
Top 6%
0.7%
24
Nature Genetics
240 papers in training set
Top 8%
0.6%
25
iScience
1063 papers in training set
Top 37%
0.6%
26
NAR Genomics and Bioinformatics
214 papers in training set
Top 4%
0.6%
27
Ecology Letters
121 papers in training set
Top 2%
0.5%