Back

Meta16S: large-scale discovery and taxonomic assignment of unknown microbes from 16S amplicon sequencing samples

Cumbo, F.; Felici, G.; Blankenberg, D.; Valeriani, F.; Romano Spica, V.; Santoni, D.

2026-05-20 microbiology
10.64898/2026.05.19.726236 bioRxiv
Show abstract

BackgroundThe exponential growth of public metagenomic datasets offers an unprecedented opportunity to explore microbial diversity. However, analyzing this vast amount of data presents significant computational challenges. While shotgun metagenomics provides deep functional and taxonomic resolution, its high cost still limits its application. On the other hand, 16S rRNA gene sequencing remains a cost-effective and widely used alternative, but tools are needed to maximize its discovery potential. Traditional clustering is not scalable, obstructing the creation of a comprehensive and continuously updated catalog of microbial life from 16S data. MethodsWe developed a reproducible and scalable Snakemake pipeline for the incremental clustering of 16S rRNA amplicons. The workflow begins by constructing a reference database from bacterial and archaeal genomes. It then processes 16S rRNA samples sequentially. For each new sample, sequences are first mapped against the existing cluster centroids. Sequences that match known centroids are assigned accordingly, while unmapped sequences are clustered independently to form novel operational taxonomic units (OTUs). These new centroids are then merged with the existing database, allowing it to grow dynamically without the need for computationally prohibitive all-at-once re-clustering. ResultsOur pipeline enables the efficient and continuous expansion of a 16S rRNA cluster database. By processing a large corpus of public 16S rRNA samples, we generated a comprehensive atlas of tens of thousands of OTUs. A significant fraction of these clusters, particularly at the genus and family levels, were classified as unknown. ConclusionsThis work provides a powerful, open-source tool for large-scale analysis of 16S rRNA samples. The incremental clustering strategy overcomes the scalability limitations of traditional methods, allowing researchers to leverage public data and discover novel microbes in their own microbiome samples.

Matching journals

The top 5 journals account for 50% of the predicted probability mass.

1
Microbiome
139 papers in training set
Top 0.1%
18.3%
2
mSystems
361 papers in training set
Top 0.4%
14.1%
3
Bioinformatics
1061 papers in training set
Top 3%
8.3%
4
ISME Communications
103 papers in training set
Top 0.2%
6.7%
5
Microbial Genomics
204 papers in training set
Top 0.4%
6.3%
50% of probability mass above
6
mSphere
281 papers in training set
Top 2%
3.5%
7
Methods in Ecology and Evolution
160 papers in training set
Top 0.9%
3.5%
8
PLOS ONE
4510 papers in training set
Top 41%
3.2%
9
PLOS Computational Biology
1633 papers in training set
Top 11%
3.0%
10
Nature Communications
4913 papers in training set
Top 45%
2.6%
11
Bioinformatics Advances
184 papers in training set
Top 2%
2.3%
12
GigaScience
172 papers in training set
Top 1.0%
2.0%
13
mBio
750 papers in training set
Top 8%
1.5%
14
Nucleic Acids Research
1128 papers in training set
Top 12%
1.5%
15
NAR Genomics and Bioinformatics
214 papers in training set
Top 2%
1.3%
16
Scientific Reports
3102 papers in training set
Top 64%
1.3%
17
Cell Reports Methods
141 papers in training set
Top 3%
1.3%
18
Molecular Ecology Resources
161 papers in training set
Top 0.8%
1.2%
19
Frontiers in Microbiology
375 papers in training set
Top 7%
1.1%
20
Microbiology Spectrum
435 papers in training set
Top 5%
0.8%
21
PeerJ
261 papers in training set
Top 14%
0.8%
22
BMC Bioinformatics
383 papers in training set
Top 7%
0.8%
23
Microorganisms
101 papers in training set
Top 2%
0.7%
24
Microbiology Resource Announcements
22 papers in training set
Top 1.0%
0.7%
25
Metabolites
50 papers in training set
Top 1%
0.7%
26
Frontiers in Bioinformatics
45 papers in training set
Top 1%
0.6%
27
Journal of Visualized Experiments
30 papers in training set
Top 1.0%
0.6%