Back

Comparative performance of reference-based metagenomic tools to identify species-level taxa among families of bacteria: benchmarking Mycobacteriaceae and Neisseriaceae

Harrison, L. B.; Ahmed, J. O.; Coulibaly, G. M.; Veyrier, F. J.

2026-04-23 bioinformatics
10.1101/2025.06.23.661092 bioRxiv
Show abstract

Hypotheses concerning the ecology and evolution of bacteria commonly relate to the presence and abundance of species in various settings and conditions. Shotgun metagenomics may address these hypotheses, which previously relied on PCR or culture. However, the problem of determining the presence/absence of a given species of interest is not trivial, particularly when closely related species are present in the reference database or metagenomic sample. Reference-based methods to detect species-level taxa mostly rely on thresholding of aligned reads or mapped k-mers or derivative metrics like genomic coverage, and create a trade-off between recall/completeness and precision/purity. New methods for species-level profiling (YACHT, metapresence and sylph) have recently been published. Here we test the performance of these methods, along with Kraken2/bracken and MetaPhlAn4, to detect related species of interest using simulated metagenomic samples from genomes in the families Mycobacteriaceae and Neisseriaceae, which contain closely related genomes. Among methods tested, metapresence, when used with an alignment quality filter, and sylph offer the best overall performance. Sylph maintains high precision but requires a depth of coverage greater than approximately 0.1x to reliably detect a genomes presence. Metapresence has a lower limit of detection of hundreds of reads but this is balanced against relatively lower precision. Both methods are relatively robust to the presence of reads from genomes outside the groups of interest. We demonstrate the application of these methods in two real-world datasets: a mycobacterial community in a drinking water system and the community of Neisseriaceae present in the human oral cavity. ImportanceDetecting which bacterial species of interest are present in a given sample is fundamental to studies of microbial ecology and evolution, and to applied microbiology (e.g. clinical diagnostics). Culture-dependent and independent (e.g. PCR) approaches are increasingly complemented by metagenomic approaches, but methods to accurately identify specific low-abundance species-level genomes in a shotgun metagenomic sample are still being refined. Here we comprehensively test YACHT, Kraken2/bracken, metapresence, MetaPhlAn4, and sylph using two simulated datasets of bacterial families, Mycobacteriaceae and Neisseriaceae that contain closely related species. Our simulations exploit natural genomic diversity to create a challenging benchmark. We demonstrate that metapresence and sylph perform best, with the former being well-suited to low-biomass host-associated datasets, and the latter with environmental metagenomic samples. This study is the first extensive benchmark of these methods for this use case, and demonstrates these methods can accurately identify closely related species of interest.

Matching journals

The top 4 journals account for 50% of the predicted probability mass.

1
Bioinformatics
1061 papers in training set
Top 2%
18.3%
2
Microbial Genomics
204 papers in training set
Top 0.1%
17.2%
3
Microbiome
139 papers in training set
Top 0.2%
9.9%
4
mSystems
361 papers in training set
Top 1%
8.2%
50% of probability mass above
5
BMC Bioinformatics
383 papers in training set
Top 2%
4.8%
6
mSphere
281 papers in training set
Top 0.9%
4.8%
7
PLOS Computational Biology
1633 papers in training set
Top 10%
3.5%
8
Molecular Ecology Resources
161 papers in training set
Top 0.5%
2.1%
9
Genome Biology
555 papers in training set
Top 4%
2.1%
10
Nature Communications
4913 papers in training set
Top 47%
2.0%
11
PLOS ONE
4510 papers in training set
Top 51%
1.9%
12
Cell Reports Methods
141 papers in training set
Top 2%
1.7%
13
Nucleic Acids Research
1128 papers in training set
Top 12%
1.5%
14
Briefings in Bioinformatics
326 papers in training set
Top 4%
1.5%
15
Frontiers in Microbiology
375 papers in training set
Top 6%
1.3%
16
Microbiology Spectrum
435 papers in training set
Top 4%
1.2%
17
Scientific Reports
3102 papers in training set
Top 67%
1.2%
18
Genome Research
409 papers in training set
Top 3%
0.9%
19
BMC Genomics
328 papers in training set
Top 5%
0.9%
20
Nature Microbiology
133 papers in training set
Top 4%
0.9%
21
The Lancet Microbe
43 papers in training set
Top 1%
0.9%
22
Methods in Ecology and Evolution
160 papers in training set
Top 2%
0.7%
23
PeerJ
261 papers in training set
Top 18%
0.6%
24
BMC Biology
248 papers in training set
Top 6%
0.6%
25
Frontiers in Cellular and Infection Microbiology
98 papers in training set
Top 7%
0.6%
26
mBio
750 papers in training set
Top 13%
0.6%