Back

Beta Diversity Meta-Analysis Shows Transformations Have Broadly Similar Performance in Machine Learning Applications Regardless of Compositional or Phylogenetic Awareness

Fry Brumit, D.; Sorgen, A. A.; Fodor, A.

2026-01-23 bioinformatics
10.64898/2026.01.20.699043 bioRxiv
Show abstract

BackgroundBeta diversity quantifies pairwise differences between two or more communities through matrix transformations, which are either naive to phylogeny or phylogenetically aware. Methods have recently been introduced that also consider compositionality and sparsity and that display an increased magnitude of pseudo-F scores as produced by PERMANOVA to measure effect size. In this study, we ask how transformations that consider phylogeny, sparsity, and compositionality compare to older, simpler methods across five publicly available datasets. ResultsApplication of random forest methods to 107 features across 5 datasets did not yield a consistent increase in classification performance between different beta diversity methods. Limiting datasets to just three eigenvalue decomposition (EVD) axes leads to a small but reliably detectable decrease in performance compared to giving random forest models access to log-normalized or even un-normalized raw count tables. Increasing the number of included EVD axes in classification improves performance across all available models up to [~]10-20 axes. We observed larger variation in PERMANOVA pseudo-F scores for some features associated with phylogenetically and compositionally aware beta diversity algorithms across multiple datasets, but did not find that these improved scores yielded consistently increased resolution or accuracy for machine learning methods. ConclusionsWhile EVD remains an essential technique for dimension reduction, retaining higher-dimensional structures past 3 EVD axes may improve performance. Elevated but insignificant pseudo-F scores may be explained by the higher variance in pseudo-F scores for phylogenetically or compositionally aware methods compared to simpler methods.This indicates that pseudo-F scores are an unreliable overall metric of algorithm performance. Taken together, our results show that choice of beta diversity metric does not yield a substantial difference in effect size or machine learning performance. We conclude that analysts are free to choose appropriate methods for each dataset balancing simplicity vs. corrections for phylogeny, sparsity and compositionality and that these choices are unlikely to impact the overall power and resolution of biological conclusions from microbial data.

Matching journals

The top 5 journals account for 50% of the predicted probability mass.

1
BMC Bioinformatics
383 papers in training set
Top 0.3%
18.3%
2
PLOS ONE
4510 papers in training set
Top 20%
9.9%
3
PLOS Computational Biology
1633 papers in training set
Top 3%
9.9%
4
PeerJ
261 papers in training set
Top 0.3%
8.2%
5
mSphere
281 papers in training set
Top 1%
4.2%
50% of probability mass above
6
NAR Genomics and Bioinformatics
214 papers in training set
Top 0.6%
3.9%
7
mSystems
361 papers in training set
Top 3%
3.9%
8
Methods in Ecology and Evolution
160 papers in training set
Top 0.9%
3.5%
9
Microbial Genomics
204 papers in training set
Top 0.6%
3.5%
10
Bioinformatics Advances
184 papers in training set
Top 1%
3.5%
11
Bioinformatics
1061 papers in training set
Top 6%
2.7%
12
Scientific Reports
3102 papers in training set
Top 46%
2.6%
13
Frontiers in Bioinformatics
45 papers in training set
Top 0.1%
2.0%
14
GigaScience
172 papers in training set
Top 2%
1.6%
15
F1000Research
79 papers in training set
Top 2%
1.5%
16
BMC Genomics
328 papers in training set
Top 4%
1.2%
17
Microbiology Spectrum
435 papers in training set
Top 4%
0.9%
18
Frontiers in Microbiology
375 papers in training set
Top 8%
0.9%
19
Microbiome
139 papers in training set
Top 3%
0.9%
20
Computational and Structural Biotechnology Journal
216 papers in training set
Top 8%
0.9%
21
Molecular Ecology Resources
161 papers in training set
Top 1%
0.7%
22
Environmental Microbiome
26 papers in training set
Top 0.6%
0.7%
23
Wellcome Open Research
57 papers in training set
Top 2%
0.7%
24
Metabarcoding and Metagenomics
12 papers in training set
Top 0.1%
0.7%
25
Frontiers in Genetics
197 papers in training set
Top 11%
0.7%
26
Cell Reports Methods
141 papers in training set
Top 6%
0.6%
27
Ecological Informatics
29 papers in training set
Top 0.9%
0.6%
28
Fungal Genetics and Biology
14 papers in training set
Top 0.3%
0.6%