Meta16S: large-scale discovery and taxonomic assignment of unknown microbes from 16S amplicon sequencing samples
Cumbo, F.; Felici, G.; Blankenberg, D.; Valeriani, F.; Romano Spica, V.; Santoni, D.
Show abstract
BackgroundThe exponential growth of public metagenomic datasets offers an unprecedented opportunity to explore microbial diversity. However, analyzing this vast amount of data presents significant computational challenges. While shotgun metagenomics provides deep functional and taxonomic resolution, its high cost still limits its application. On the other hand, 16S rRNA gene sequencing remains a cost-effective and widely used alternative, but tools are needed to maximize its discovery potential. Traditional clustering is not scalable, obstructing the creation of a comprehensive and continuously updated catalog of microbial life from 16S data. MethodsWe developed a reproducible and scalable Snakemake pipeline for the incremental clustering of 16S rRNA amplicons. The workflow begins by constructing a reference database from bacterial and archaeal genomes. It then processes 16S rRNA samples sequentially. For each new sample, sequences are first mapped against the existing cluster centroids. Sequences that match known centroids are assigned accordingly, while unmapped sequences are clustered independently to form novel operational taxonomic units (OTUs). These new centroids are then merged with the existing database, allowing it to grow dynamically without the need for computationally prohibitive all-at-once re-clustering. ResultsOur pipeline enables the efficient and continuous expansion of a 16S rRNA cluster database. By processing a large corpus of public 16S rRNA samples, we generated a comprehensive atlas of tens of thousands of OTUs. A significant fraction of these clusters, particularly at the genus and family levels, were classified as unknown. ConclusionsThis work provides a powerful, open-source tool for large-scale analysis of 16S rRNA samples. The incremental clustering strategy overcomes the scalability limitations of traditional methods, allowing researchers to leverage public data and discover novel microbes in their own microbiome samples.
Matching journals
The top 5 journals account for 50% of the predicted probability mass.