Back

Microbiome

Springer Science and Business Media LLC

Preprints posted in the last 90 days, ranked by how well they match Microbiome's content profile, based on 139 papers previously published here. The average preprint has a 0.13% match score for this journal, so anything above that is already an above-average fit.

1
MetaGEAR Explorer: Rapid interactive searches and cross-cohort analyses of microbiome gene associations in disease

Rios, E.; Jin, S.; Zhang, C.; Neuhaus, F.; He, X.; Weissenberger, S.; Schirmer, M.

2026-03-31 bioinformatics 10.64898/2026.03.30.715271 medRxiv
Top 0.1%
33.2%
Show abstract

The human gut microbiome has been linked to inflammatory bowel disease (IBD) and colorectal cancer (CRC), yet identifying disease-associated microbial genes across diverse human cohort studies remains challenging due to inconsistent data processing and the high dimensionality of gene-level abundance profiles. Here we present MetaGEAR Explorer, a web platform comprising a user interface and web services for interactive and programmatic gene-centric exploration of >33 million microbial gene families across 9,053 metagenomic samples from 24 IBD, CRC, and healthy cohorts. MetaGEAR Explorer facilitates gene searches against a catalog of non-redundant gene families via nucleotide or amino acid sequence queries (BLAST) and Pfam domain-based searches. For matched gene families, the platform computes disease-stratified prevalence, cross-cohort disease associations, species-level taxonomic stratification, and functional domain annotations. Importantly, users can also explore the genomic context of individual gene families via contig-based co-localization networks derived from metagenomic species pangenome (MSP) assignments and pivot from sequence to domain searches to identify functional homologs. Additionally, the platform features a dedicated catalog to interactively browse 13,795 MSPs and export results programmatically via API endpoints. We demonstrate MetaGEAR Explorers utility using the narG-encoding nitrate reductase gene and a case study of colibactin self-protection genes (clbS and DUF1706 homologs), where the platform revealed a consistent shift from commensals to Gammaproteobacteria carriers in disease. In summary, MetaGEAR Explorer enables rapid cross-cohort functional meta-analyses and is freely available at https://metagear-explorer.schirmerlab.de. GRAPHICAL ABSTRACT O_FIG O_LINKSMALLFIG WIDTH=177 HEIGHT=200 SRC="FIGDIR/small/715271v1_ufig1.gif" ALT="Figure 1"> View larger version (37K): org.highwire.dtl.DTLVardef@ea318dorg.highwire.dtl.DTLVardef@15b497borg.highwire.dtl.DTLVardef@354abcorg.highwire.dtl.DTLVardef@bd7dc5_HPS_FORMAT_FIGEXP M_FIG C_FIG

2
ZeaMiC: a Publicly Available Culture Collection of Maize Root-Associated Bacteria

Garrell, A.-K.; Ginnan, N.; Swift, J. F.; Pal, G.; Zervas, A.; Pestalozzi, C.; Tang, C.; Tso, F.; Ford, N. E.; Niu, B.; Castrillo, G.; Schlaeppi, K.; Hahnke, R. L.; Wagner, M. R.; Kleiner, M.

2026-03-24 microbiology 10.64898/2026.03.23.713778 medRxiv
Top 0.1%
32.0%
Show abstract

Plant-associated microbiota are composed of hundreds of microbial species. For many of them, little is known about their individual functions and even less is known about their emergent community-level traits. While culture-independent methods provide valuable insights into the composition, diversity, and functional potential of plant-associated microbiota, culture-dependent methods are essential for reductionist lines of inquiry into the roles of individual species and their interactions within a community. Here, we present ZeaMiC, a publicly available culture collection of root-associated bacteria from Zea mays (maize). This resource comprises 88 isolates obtained from diverse soils and several maize genotypes, with live cultures available through DSMZ (German Collection of Microorganisms and Cell Cultures) both as single stocks and as cost-effective bundles (https://www.dsmz.de/collection/catalogue/microorganisms/microbiota/zeamic). To maximize relevance, isolates were selected to be representative of maize root-associated microbiomes in the Corn Belt of the United States, based on abundance-occupancy patterns from previously published root microbiome data, phylogenetic diversity, and literature-based evidence of functional importance. Whole-genome sequencing and annotation revealed genes associated with root colonization, plant growth promotion, and nutrient cycling, including functions such as chemotaxis, biofilm formation, secretion systems, hormone modulation, and phosphate solubilization. This collection serves as a community resource for future mechanistic studies of plant-microbe and microbe-microbe interactions, filling the gap in our understanding of the ecological interactions in plant microbiomes.

3
Disentangling Production and Persistence of Extracellular Virions in Grassland Soils with SIP-Viromics

Trubl, G.; Roux, S.; Kellom, M.; Vyshenska, D.; Tomatsu, A.; Singh, K.; Kimbrel, J.; Eloe-Fadrosh, E. A.; Malmstrom, R. R.; Pett-Ridge, J.; Blazewicz, S. J.

2026-05-15 microbiology 10.1101/2025.05.25.655894 medRxiv
Top 0.1%
31.9%
Show abstract

Viruses are abundant and ecologically important in soils, yet the persistence and production dynamics of extracellular virions remain poorly understood. We applied a genome-resolved stable isotope probing viromics (SIP-viromics) approach, combining H 18O labeling with viral metagenomics, to track virion turnover in seasonally dry grassland soils following rewetting. We identified 354 viral populations (vOTUs) using individual-sample and combined metagenome assemblies. Only 22% of vOTUs exhibited significant 18O enrichment, indicating active replication and new virion production during the 1-week incubation; the majority (78%) persisted without detectable replication, consistent with a viral seed bank. Active vOTUs accounted for 4.76-5.15% of total virions per gram of soil, with viral loads ranging from 3.15 x 1010 to 6.59 x 1010 virions per gram. Probabilistic and deterministic sensitivity analyses spanning viral DNA fraction and genome length reinforced that persistent virions represented the majority of the extracellular viral pool post-wet-up, regardless of parameter assumptions. Host predictions linked both active and persistent vOTUs primarily to Actinomycetota and Pseudomonadota--bacterial groups known to rapidly resuscitate following rewetting--suggesting that some viruses exhibit rapid turnover while others persist over longer timescales, forming a stable viral pool capable of reinitiating infections during favorable conditions. These results demonstrate that SIP-viromics can distinguish newly produced from persistent virions and reveal host-associated patterns of lytic infection and virion production. Our findings advance understanding of soil virus-host interactions and highlight the ecological role of persistent virions as a genetic reservoir contributing to microbial turnover and biogeochemical cycling following environmental disturbance. ImportanceUnderstanding the persistence and production dynamics of soil viruses is critical for elucidating their roles in microbial community dynamics and nutrient cycling, yet these processes have remained largely uncharacterized due to methodological limitations. By integrating stable isotope probing with viromics, this study provides a robust framework for directly distinguishing newly produced from persistent virions in situ. Unlike conventional viromics, which only catalogs viral diversity, SIP-viromics enables quantification of active viral replication and persistence under natural soil conditions. Our results demonstrate that most virions in a seasonally dry soil persisted through a rewetting event, with active replication limited to a minority of viral populations. Persistent virions were primarily linked to dominant bacterial groups, indicating that host ecophysiology and environmental stability strongly influence lytic infection. Collectively, these findings highlight viruses as long-term reservoirs of genetic material, capable of shaping microbial dynamics and ecosystem processes over time. This work establishes SIP-viromics as a powerful approach for studying virus-host interactions and their ecological significance in terrestrial environments.

4
Who Infects Whom? Exploiting Bacterial Minicells for Targeted Virome Enrichment and Phage-Host Interaction Analysis through an Integrated Metagenomic Approach

Pedramfar, A.; Ensenat, E.; Allcock, N. S.; Millard, A. D.; Galyov, E. E.

2026-04-09 microbiology 10.64898/2026.04.08.717211 medRxiv
Top 0.1%
23.3%
Show abstract

Linking bacteriophages (phages) to their hosts remains a fundamental challenge to understanding microbial ecology, viral evolution, and horizontal gene transfer. Although phages are the most abundant biological entities on Earth, the majority of them remain uncharacterized due to the lack of efficient host-linking approaches. Traditional methods, such as plaque assays, have significant limitations as they depend on visible lysis and therefore fail to detect phages that do not form plaques. Conversely, shotgun metagenomics can recover viral genomes directly from environmental samples; however, it cannot directly link phages to their bacterial hosts. In this study, we addressed this limitation by tackling the critical question of "who infects whom?" through the development of a novel, culture-independent approach that utilises an anucleate bacterial minicells-based platform to enrich for phages capable of infecting a target bacterial host. To validate our approach, purified Escherichia coli minicells were exposed to a concentrated viral fraction derived from sewage samples. Genomic DNA from phages that successfully infected and interacted with the E. coli minicells was isolated, amplified, and sequenced. Metagenomic analysis revealed a distinct E. coli-specific virome, including several putatively novel phage species and genera. This platform effectively bridges the gap between culture-dependent and metagenomic methods, providing a scalable, host-targeted tool for identifying phage-host pairs. Our approach also opens new opportunities for studying phage-host interaction networks in complex microbial ecosystems and enhances our ability to investigate viral diversity, host specificity, and the ecological roles of phages in natural environments.

5
Benchmarking strain-level profiling of Escherichia coli in short-read gut metagenomes

Galbraith, M.; Williams, D.; Shaw, L. P.; Lipworth, S.; Stoesser, N.

2026-05-19 bioinformatics 10.64898/2026.05.19.726160 medRxiv
Top 0.1%
22.6%
Show abstract

2.Metagenomes offer the potential to characterise Escherichia coli strain-level diversity within the human gut microbiome, informing our understanding of colonisation diversity and the genetic features distinguishing infection from carriage. Among numerous reference-based tools for short-read metagenomic strain-level profiling, the best approach remains unclear. Here, we benchmarked six published tools--PanTax, PathoScope, StrainGE, Strainify, StrainR2 and StrainScan--for their ability to detect co-existing strains of E. coli and estimate their relative abundance across real and simulated metagenomes of increasing complexity with varying reference database composition. In the ZymoBIOMICS(R) D6331 dataset, only PanTax achieved zero error when predicting the equal abundance of five E. coli strains. In a differentially abundant four-strain mock community dataset (SRR13355226), StrainScan had the lowest mean absolute proportional error (0.89), driven by reduced sensitivity (0.5), followed by PathoScope (4.08). Across simulated metagenomes reflecting the healthy adult gut microbiome, all tools demonstrated high sensitivity ([≥]0.833), but specificity, precision and F1 score were selectively improved in some tools through detection thresholds to remove low abundance false positives. Outright, StrainGE achieved the highest F1 score (0.978). Predicted relative abundances of the E. coli K12-MG1655 (phylogroup A) and O157:H7 Sakai (phylogroup E) strains spiked into simulated metagenomes across varying abundance ratios were generally accurate, with PanTax and StrainR2 showing the lowest mean absolute proportional error (0.06). When truly present strains were removed from the reference database, out-of-phylogroup assignments were observed for some tools. Collectively, our results demonstrate that published metagenomic strain-level profiling tools vary in their ability to profile E. coli strains, indicating that method selection should be guided by intended application. These findings will facilitate characterisation of E. coli strain-level diversity within short-read gut metagenomes with greater accuracy than previously possible. 3. Impact statementStrain-level diversity within the human gut microbiome can be important for human health, with species such as Escherichia coli existing as both commensal and pathogenic strains. Most existing gut microbiome datasets are from short-read i.e., Illumina, sequencing, and numerous bioinformatic tools have been developed to profile strain-level variation from these data. However, the existing literature is often difficult to navigate given that the available tools have been benchmarked in various ways and are subject to author bias. This is, to our knowledge, the first independent benchmarking of six published tools for profiling E. coli at strain-level resolution from short-read metagenomes. Using both real and simulated datasets of increasing complexity, we demonstrate substantial variation in tool performance in terms of strain detection and relative abundance estimation, highlighting that tool choice should be guided by the specific research question, as no single method performs optimally across all scenarios. This work provides an unbiased framework for tool selection and will support more accurate and reproducible E. coli strain-level analyses in gut microbiome research from short-metagenomic data. 4. Data summaryThe authors confirm all supporting data, code and protocols have been provided within the article or through supplementary data files. Supplementary methods, six supplementary tables and four supplementary figures are available in the online Supplementary Material. Code for simulating metagenomes using InSilicoSeq, SLURM job scripts for the simulated metagenomes dataset and R visualization and statistical analysis scripts are available within a dedicated public GitHub repository (https://github.com/mattgal11/benchmarking_short_read_strain_profilers). The following supplementary data are available on FigShare (https://doi.org/10.6084/m9.figshare.32125474): O_LINormalised per-contig relative abundances for 98 species assemblies used to construct the baseline gut microbiome profile for InSilicoSeq metagenome simulation (Normalised_relative_abundance_for_InSilicoSeq_simulated_metagenomes_ gut_microbiome_profile.csv) C_LIO_LIZymoBIOMICS(R) D6331 gut microbiome standard dataset predicted relative abundance data (Zymobiomics_D6331_raw_predicted_abundance.csv) C_LIO_LISRR13355226 mock community (99% human reads; 1% E. coli reads) paired-end reads with human reads depleted (SRR13355226_depleted_R1.fastq.gz & SRR13355226_depleted_R2.fastq.gz) C_LIO_LISRR13355226 mock community dataset raw predicted abundance data, with and without human read removal (SRR13355226_raw_predicted_abundance_with_and_without_human_read_r emoval.csv) C_LIO_LISimulated metagenomes dataset raw call types and detection metric values with increasing detection thresholds (Simulated_metagenomes_raw_call_type_assingments_and_detection_thres holds.csv) C_LIO_LISimulated metagenomes dataset (all references) predicted relative abundance data (Simulated_metagenomes_all_references_raw_predicted_abundances.csv) C_LIO_LISimulated metagenomes dataset (all references) mapped reads for PathoScope and Strainify (all_refs_pathoscope_reads_mapped.csv & all_refs_strainify_reads_mapped.csv) C_LIO_LISimulated metagenomes dataset (reduced reference database) predicted relative abundance data (Simulated_metagenomes_K12_and_Sakai_removed_from_reference_datab ase_raw_predicted_abundance.csv) C_LI

6
Micro16S: Universal Phylogenetic 16S rRNA Gene Representations for Deep Learning of the Microbiome

Bishop, H. V.; Ogilvie, O. J.; Dobson, R. C. J.; Herbold, C. W.

2026-03-24 bioinformatics 10.64898/2026.03.21.713432 medRxiv
Top 0.1%
22.3%
Show abstract

1Existing self-supervised microbiome models represent taxa as discrete, independent units restricted to fixed vocabularies, disregarding their evolutionary context. Here we present Micro16S, a deep learning approach that embeds 16S ribosomal RNA gene sequences into a continuous vector space according to phylogenetic relationships derived from the Genome Taxonomy Database. Using a combination of triplet and pair loss objectives, the model learns representations where spatial proximity reflects phylogenetic relatedness, while remaining largely invariant to the specific 16S rRNA region. Evaluations demonstrate taxonomically coherent clustering across most ranks and substantially improved region invariance compared to k-mer frequency baselines. A transformer pretrained on 50,418 unlabelled gut microbiome samples using these embeddings captured biologically meaningful community structure, though classical machine learning baselines outperformed Micro16S across six benchmark classification tasks, highlighting the limitations of the current system. These results establish the feasibility of phylogenetic embeddings for microbiome deep learning and identify mining algorithm design and class imbalance as primary targets for future improvement.

7
StrataBionn: a neural network supervised classification method for microbial communities

Symons, A. E.; Huynh, A. V.; Cornejo, O. E.

2026-04-02 genomics 10.64898/2026.03.31.715659 medRxiv
Top 0.1%
22.0%
Show abstract

The classification of microbial communities into discrete states or "community state types" (CSTs) is fundamental to understanding host-microbiome interactions and their clinical implications. Traditional methods, such as the nearest-neighbor approaches, often struggle with the inherent noise, high dimensionality, and non-linear signatures of taxonomic profiles. We present a novel supervised framework for microbial community classification, leveraging an Artificial Neural Network (ANN) architecture implemented in a new tool we named StrataBionn. We rigorously evaluated our approach using large-scale vaginal microbiome datasets, directly benchmarking performance against VALENCIA and a Random Forest (RF) classifier. To demonstrate the versatility of our models, we further extended the framework to oral microbiome classification, assessing its stability across diverse anatomical sites. Our supervised models consistently outperformed the nearest-neighbor approach across all evaluated datasets. In the vaginal microbiome, our method achieved an 11.6% to 13.3% increase in performance across all primary metrics, including precision, recall, accuracy, and F1-score. Furthermore, we demonstrate that this performance advantage is maintained in the oral microbiome, highlighting the generalizability of our neural network and ensemble strategies to various microbial ecosystems without the need for niche-specific algorithmic adjustments. By capturing complex feature dependencies that distance-based methods overlook, our approach provides a more robust and accurate census of microbial community structures. StrataBionns ability to learn classification schemes for any microbiome with high accuracy and explainability, through the use of provided utilities to visualize feature-space classification boundaries and perform perturbation analysis on trained classifiers, makes it ideal for broad application in microecology research. This framework offers a scalable, high-performance alternative for microbiome researchers, facilitating more precise clinical stratification and biological insights across hosts body sites.

8
MATRIX: Rapid Quantification of Total and Active Microbial Cells with Single Cell Phenotypes for Environmental Microbiomes

Gonzalo, M.; Liu, X.; Dufour, Y. S.; Shade, A.

2026-03-18 microbiology 10.64898/2026.03.16.712149 medRxiv
Top 0.1%
21.8%
Show abstract

Quantifying the abundance and activity of bacteria within populations and communities is fundamental to systems microbiology and microbiome research. Yet direct microscopic cell counting remains low-throughput, labor-intensive, and prone to user variability, leading many researchers to rely on indirect proxies such as optical density or multicopy marker-gene quantification. These indirect approaches do not distinguish between active and inactive cells and can obscure ecological interpretation. Here, we introduce MATRIX (Microbial Activity and Total cell quantification via Rapid Imaging and eXtraction), an efficient workflow that integrates sample extraction, fluorescence staining, automated microscopy and image analysis, and Bayesian statistical inference to quantify total and redox-active cells and derive single-cell measurements for environmental microbial populations and communities. We demonstrate its reproducibility and versatility using both cultured isolates and high-diversity soil communities. The resulting quantitative, phenotypic datasets provide rapid, direct measurements of population of community size and activity, enabling well-powered analyses that strengthen mechanistic insight into microbial responses and improve the ecological grounding of microbiome studies. ImportanceMicrobiome studies commonly rely on relative abundance data, which cannot distinguish whether compositional shifts reflect true population growth, declines in total community size, or both. Without explicit measurements of population and community sizes, mechanistic interpretation of microbiome dynamics remains incomplete. Here we present a rapid, throughput workflow, MATRIX, that quantifies both total and redox-active bacterial cells from environmental samples. By integrating single-cell phenotypes with community-level metrics, this approach anchors microbiome datasets in direct ecological accounting rather than proxies. These measurements can clarify whether observed changes in community structure represent shifts in abundance, activity, or both, improving inference about microbial responses to stress or environmental change. MATRIX therefore offers an efficient way to incorporate quantitative ecology into systems-microbiology and microbiome studies and to strengthen the link between microbial cellular physiology, community dynamics, and eco-system function. Graphical Abstract O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=125 SRC="FIGDIR/small/712149v1_ufig1.gif" ALT="Figure 1"> View larger version (46K): org.highwire.dtl.DTLVardef@2e5883org.highwire.dtl.DTLVardef@b5412dorg.highwire.dtl.DTLVardef@1c9fbfaorg.highwire.dtl.DTLVardef@1bdde14_HPS_FORMAT_FIGEXP M_FIG C_FIG

9
A widespread gut bacterial lineage distinguished by redox metabolism and phage defense

Noecker, C.; Guo, L.; Date, C.; Rai, N.; Daramy, F.; Ramirez Hernandez, L. A.; Kyaw, T. S.; Trepka, K. R.; Gupta, C. L.; Ha, C. W. Y.; Babdor, J.; Spitzer, M. H.; Turnbaugh, P. J.

2026-04-01 microbiology 10.64898/2026.03.31.715625 medRxiv
Top 0.1%
21.7%
Show abstract

Genomic variation within gut microbial species can have consequences for host health and disease. However, for low abundance species, these variations can be difficult to capture by both culture-dependent and -independent approaches. Here, we focus on the prevalent but low abundance gut Actinomycetota Eggerthella lenta. We developed a selective media for sensitive and specific isolation of E. lenta from human stool. Genomes from 87 new E. lenta isolates were combined with prior high-quality assemblies, shedding light on within-species functional diversity. Phylogenetic analysis revealed a broadly distributed subclade, which we refer to as E. lenta Group B. This lineage was differentiated by its metabolic potential and bacteriophage defense, though mobile elements were shared broadly across the species. Notably, Group B was positively associated with intestinal inflammation in subjects with inflammatory bowel disease. Overall, these results emphasize the importance of bacterial population structure in host-microbiome interactions and provide a framework to study low-abundance gut taxa. HIGHLIGHTSO_LISelective media enables E. lenta isolation and reveals high prevalence in humans C_LIO_LIDiscovery of a distinctive lineage within E. lenta undergoing genome reduction C_LIO_LIE. lenta Group B has altered metabolism, phage defense, and disease associations C_LIO_LIA widespread conjugative plasmid could enable improved genetics C_LI

10
WasteFams: A database of protein families from global wastewater microbiomes

Galaras, A.; Chasapi, I. N.; Aplakidou, E.; Chasapi, M. N.; Lamari, E.; Diplari, S.; Georgakopoulos-Soares, I.; Karatzas, E.; Baltoumas, F. A.; Kyrpides, N.; Pavlopoulos, G.

2026-05-12 bioinformatics 10.64898/2026.05.08.723720 medRxiv
Top 0.1%
19.0%
Show abstract

Wastewater surveillance has emerged as a critical tool for global epidemiology, yet the functional diversity of wastewater microbiomes remains poorly characterized at the protein level. Here, we present WasteFams, the first comprehensive database dedicated to the systematic exploration of protein families in wastewater metagenomic and metatranscriptomic studies worldwide. Integrating data from 580 metagenomes, 132 metatranscriptomes, and 1,709 reference genomes, WasteFams catalogs 3,887 non-redundant protein families (containing {succeq}100 members) derived from over 105 million predicted proteins. Each protein family is enriched with multi-layered annotations, including AlphaFold3 structural predictions, taxonomic classifications, and biome-specific metadata. To further expand their functional annotation, we integrated deep genomic context analysis to link protein families to Mobile Genetic Elements (MGEs), Biosynthetic Gene Clusters (BGCs), Antibiotic Resistance Genes (ARGs), and CRISPR elements. Accessible through the EnvoFams portal, WasteFams provides a user-friendly interface featuring advanced search capabilities, sequence and structural similarity tools, and interactive visualization modules. As global initiatives increasingly leverage wastewater for public health and environmental insights, WasteFams can serve as a critical resource for discovering novel microbial functions, monitoring resistance mechanisms, and exploring the biotechnological potential of secondary metabolites within wastewater-engineered ecosystems.

11
Mitag4taxa: Extracting SSU rRNA Illumina reads from metagenomes for taxonomic classification

He, Y.; Du, Y.; Nguyen, L.; Wang, Y.

2026-05-05 bioinformatics 10.64898/2026.05.01.722230 medRxiv
Top 0.1%
18.5%
Show abstract

The prevailing taxonomic profiling methods for an environmental sample rely heavily on PCR amplification of SSU ribosomal RNA (rRNA) genes and genome-based reference databases. Identification and extraction of Illumina metagenomics sequencing data are PCR independent but technically challenging in recognition of the SSU rRNA fragments. Here we present Mitag4taxa, a computational pipeline designed for taxonomic profiling of microbial communities from metagenomic Illumina sequencing reads containing rRNA tags (mitag). A Hidden Markov Model (HMM) of SSU rRNA genes and those for the V4 region of 16S rRNA and the V9 region of 18S rRNA genes were created, respectively, using the representative sequences of different families and corresponding hypervariable regions in the SILVA database. The pipeline identifies and extracts 16S and 18S rRNA gene fragments along with the quality score from metagenomic or metatranscriptomic datasets using HMM search integrated with the models. The hypervariable regions, including the V4 region of 16S rRNA and the V9 region of 18S rRNA genes, can be further scanned and recruited for taxonomic classification and biodiversity estimate. To demonstrate its high reliability, the performance of Mitag4taxa was evaluated using both real and simulated datasets. In human gut metagenomic assessments, taxonomic profiles derived from Mitag4taxa showed high consistency with those based on conventional 16S rRNA gene amplicons, identifying dominant families such as Bacteroidaceae and Prevotellaceae with similar relative abundances. Statistical analyses confirmed highly significant positive correlations between Mitag4taxa and amplicon-based community structures. The 18S V9 module was further validated using shotgun metagenomic data from deep-sea sediment cores, successfully recovering key eukaryotic taxa such as Collodaria and Leotiomycetes. Furthermore, benchmarking against the RiboTagger software using CAMI marine simulated datasets revealed that Mitag4taxa achieved a higher average F1 score and lower error metrics. Overall, Mitag4taxa provides a complementary rRNA gene amplicon- and genome-independent strategy for microbial community profiling, enabling improved detection of both prokaryotic and eukaryotic taxa from metagenomic and metatranscriptomic sequencing data.

12
Meta16S: large-scale discovery and taxonomic assignment of unknown microbes from 16S amplicon sequencing samples

Cumbo, F.; Felici, G.; Blankenberg, D.; Valeriani, F.; Romano Spica, V.; Santoni, D.

2026-05-20 microbiology 10.64898/2026.05.19.726236 medRxiv
Top 0.1%
18.3%
Show abstract

BackgroundThe exponential growth of public metagenomic datasets offers an unprecedented opportunity to explore microbial diversity. However, analyzing this vast amount of data presents significant computational challenges. While shotgun metagenomics provides deep functional and taxonomic resolution, its high cost still limits its application. On the other hand, 16S rRNA gene sequencing remains a cost-effective and widely used alternative, but tools are needed to maximize its discovery potential. Traditional clustering is not scalable, obstructing the creation of a comprehensive and continuously updated catalog of microbial life from 16S data. MethodsWe developed a reproducible and scalable Snakemake pipeline for the incremental clustering of 16S rRNA amplicons. The workflow begins by constructing a reference database from bacterial and archaeal genomes. It then processes 16S rRNA samples sequentially. For each new sample, sequences are first mapped against the existing cluster centroids. Sequences that match known centroids are assigned accordingly, while unmapped sequences are clustered independently to form novel operational taxonomic units (OTUs). These new centroids are then merged with the existing database, allowing it to grow dynamically without the need for computationally prohibitive all-at-once re-clustering. ResultsOur pipeline enables the efficient and continuous expansion of a 16S rRNA cluster database. By processing a large corpus of public 16S rRNA samples, we generated a comprehensive atlas of tens of thousands of OTUs. A significant fraction of these clusters, particularly at the genus and family levels, were classified as unknown. ConclusionsThis work provides a powerful, open-source tool for large-scale analysis of 16S rRNA samples. The incremental clustering strategy overcomes the scalability limitations of traditional methods, allowing researchers to leverage public data and discover novel microbes in their own microbiome samples.

13
16S rRNA k-mer composition encodes microbial functional potential

Liu, J.; De Paolis Klauza, M. C.; Bromberg, Y.

2026-04-18 bioinformatics 10.64898/2026.04.16.718937 medRxiv
Top 0.1%
18.1%
Show abstract

16S rRNA amplicon sequencing is widely used for microbiome profiling, but most methods rely on reference databases of characterized organisms, limiting its accuracy in function prediction for underrepresented environments. We discovered that 16S rRNA k-mer composition carries substantial functional signal: (i) whole-genome k-mer profiles predict genome-encoded functions, and (ii) 16S rRNA k-mer profiles reflect their source genomes composition. Building on these relationships, we developed embeRNA, a neural network framework that predicts functions directly from 16S rRNA k-mer embeddings without requiring taxonomy assignment or phylogenetic placement. embeRNA outputs per-function probability scores, enabling users to tune decision thresholds to balance precision and recall or account for community novelty. In a stringent "novel microbes" benchmark - where all test sequences shared <97% identity with training data - embeRNA outperformed reference-based methods, particularly for hard-to-label functions. Applied to soil metagenomes with paired 16S and whole metagenome shotgun sequencing (WMS) data, embeRNA recovered most WMS-inferred functions and produced abundance profiles strongly correlated with WMS results, attaining better performance than a reference-based approach. Our findings demonstrate that 16S rRNA directly captures functional potential, and 16S amplicon sequencing data can complement WMS-based inference to broaden functional characterization of microbiomes, especially in understudied environments.

14
Genome-resolved metagenomics reveals abundant novel, non-methanogenic lineages in tropical alpine paramo soils

Betancurt Anzola, D.; Vanegas, J.; Jurgensen, S.; Wrighton, K. C.; Santamaria Vanegas, J.; Couradeau, E.

2026-04-29 microbiology 10.64898/2026.04.28.721461 medRxiv
Top 0.1%
18.1%
Show abstract

Wetlands are the largest natural source of atmospheric methane, yet tropical high-altitude wetlands remain underrepresented in global climate frameworks. Here, we investigated soil metagenomes from the paramo ecosystem in Chingaza National Natural Park (Colombia) across three vegetation-defined ecosites. Microbial community composition differed significantly among ecosites, with peatland soils exhibiting the highest diversity. Genome-resolved metagenomics recovered 109 high-quality metagenome-assembled genomes (MAGs), of which 37.6% represent phylogenetically novel lineages absent from current genomic databases. These novel taxa were not restricted to the rare biosphere but comprised abundant members of the reconstructed community, highlighting a substantial gap in global microbial reference frameworks. Functional analyses revealed widespread carbon fixation potential via the Wood-Ljungdahl pathway and complementary bacterial pathways, but no evidence of methanogenesis: genes encoding the methyl-coenzyme M reductase complex (mcrABG) were absent across all MAGs. Instead, metabolic potential was consistent with acetogenic carbon fixation coupled to sulfate reduction, suggesting an alternative carbon cycling regime relative to canonical methane-producing wetlands. Together, these results identify the tropical alpine paramo as a reservoir of abundant and phylogenetically novel microbial diversity with distinct metabolic potential. Incorporating these lineages into global databases will be essential for improving predictions of carbon cycling in underrepresented high-altitude ecosystems.

15
Evolutionary Histories and Environments Shape Ugandan and Global Oral Microbiomes

Ademola-Popoola, I. J.; Grogen, K. E.; Abdul-Aziz, M. A.; Ta, C. K.; Tang, K.; Blekhman, R.; Barreiro, L. S.; Perry, G. H.; Weyrich, L. S.

2026-05-22 evolutionary biology 10.64898/2026.05.20.726600 medRxiv
Top 0.1%
17.5%
Show abstract

Industrialization has been identified as the single biggest factor driving global microbiome diversity. While many studies examining gut microbiomes attribute these shifts to dietary increases in fat and reductions in protein, oral microbiome responses to industrialization remains debated. The oral microbiome is more resilient due to long-standing coevolution with host tissues and biofilm stability. However, limited geographic and historical representation has constrained our understanding of how these transitions unfolded globally in the oral microbiomes. Here, we investigate oral microbiome variation in Batwa rainforest hunter-gatherers and neighboring Bakiga subsistence farmers from southwestern Uganda, comparing them with publicly available data from Tanzanian, Venezuelan, and industrialized populations from North America, Europe, and Australia. Using 16S rRNA gene sequencing, we characterized salivary microbiota and evaluated differences in local and global diversity, composition, and differential abundance. Ugandan populations contained significant compositional differences but similar levels of diversity, suggesting that shared environments and dietary overlap may shape microbial assemblages despite distinct cultural histories. Globally, strong continental and industrialization effects were observed in the oral microbiome, with all industrial populations clustering separately from people living in other locations. African populations also clustered separately from non-African groups. Oral microbiome diversity was highest in Ugandan individuals and lowest in industrialized populations, mirroring patterns previously observed in the gut microbiome. Together, these findings demonstrate that both geography and subsistence strategy structure global oral microbiome variation. They also clarify the position that oral microbial communities record biocultural transitions and highlight the need to better understand the industrial mechanisms that shape microbial diversity in the oral cavity.

16
A global metagenomic atlas uncovers ubiquitous biosynthetic potential linked to adaptation in extreme environments

Du, R.; He, R.; Qi, Q.; Li, Z.; Tang, Q.; Zhang, Z.; Xu, X.; Peng, H.; Liu, J.; Medema, M. H.; Xu, Q.

2026-04-20 microbiology 10.64898/2026.04.17.719132 medRxiv
Top 0.1%
17.2%
Show abstract

Extreme environments impose severe physicochemical stresses that drive microorganisms to evolve specialized survival strategies. Microbial secondary metabolites determined by biosynthetic gene clusters (BGCs) are recognized as important mediators of microbial adaptation to environmental stress. However, their ecological roles, particularly habitat-dependent preferences across different environments, remain poorly understood. Although extreme environments provide opportunities to mine microbiomes for unique adaptations, such research is hampered by a lack of systematic overview of its genomic diversity, BGC diversity, and the relationships between them. Here, we constructed a standardized extremophilic genomic catalogue (SEGC) from 1,462 metagenomic samples spanning seven representative extreme habitats. The catalogue comprised 54,661 metagenome-assembled genomes representing 21,805 species, 66.1% of which were previously uncharacterized. With this catalogue, we identified 162,855 BGCs distributed across 81.5% of MAGs. Gene cluster family analysis showed the strong habitat dependence largely explained by species-level habitat specificity. Terpene biosynthetic pathways illustrated habitat-linked adaptive strategies, with hopan-22-ol biosynthesis enriched in acid mine, deep sea and hydrothermal plume environments, while retinal-based phototrophy predominated in cryosphere and saline-alkaline habitats. Metatranscriptomic analyses supported in situ activity of these pathways. In conclusion, we presented a global atlas of biosynthetic potential across extreme-environment microbiota and revealed habitat-dependent patterns of secondary metabolism linked to microbial survival.

17
Culture-enriched metagenomics enables genome-resolved detection of low abundance ESKAPE and Vibrio pathogens in coastal habitats

Ho, J. Y.; Hu, D.; Kang, D. Y.; Sim, C. B. W.; Wijaya, W.; Boucher, Y. F.

2026-05-14 microbiology 10.64898/2026.05.14.725077 medRxiv
Top 0.1%
17.2%
Show abstract

Coastal marine environments are increasingly recognised as reservoirs of antimicrobial-resistant (AMR) pathogens. However, it remains challenging to recover high-quality genomes of clinically relevant bacteria present at low abundance from complex natural systems. Here, we applied culture-enriched metagenomics to systematically track the diversity and dynamics of major AMR pathogens within the coastal marine system of St. Johns Island, Singapore, as a model ecosystem for pathogen surveillance. Selective media-based enrichment recovered 773 metagenome-assembled genomes (MAGs) from 92 multi-matrix environmental samples, which includes coastal water, sediment, and seaweed, capturing diverse AMR ESKAPE and Vibrio species. Distinct bacterial signatures and dispersal patterns were observed in each niche, for example, microbes that signal human impact was detected at the beach, while fish-associated pathogens were present at the aquaculture facility outlet. Notably, the high-quality MAGs enabled subspecies-level identification and supported the AMR gene detection across six distinct coastal habitats. Detailed differences in the recovery of specific pathogens across enrichment media were also identified, demonstrating the methods efficacy in finding media suitable for surveillance of specific organisms, such as deciding between liquid or solid formulations. MAGs recovered from culture-enriched metagenomics were highly similar to genomes obtained from pure isolates, as demonstrated for Klebsiella pneumoniae. The preserved culture-enriched stocks were capable of recovering organisms of interest when individual isolates were required for further study. Overall, our findings highlight the utility of culture-enriched metagenomics as a cost-effective, sensitive approach to uncovering the genomic landscape of pathogens with environmental reservoirs, with implications for AMR surveillance and ecological risk assessment.

18
Metaxa: A Transformer-Based Deep Learning Model for Taxonomic Classification of Long Nanopore Reads

Friganovic, K.; Stanojevic, D.; Chen, P.-S. B.; Sikic, M.

2026-04-23 bioinformatics 10.64898/2026.04.20.719780 medRxiv
Top 0.1%
17.2%
Show abstract

A significant fraction of the microbial diversity remains unclassified, hindering our understanding of microbial roles in health and ecosystems. State-of-the-art methods like Kraken 2 perform well for taxa that are present in the database. However, their accuracy drops significantly when classifying taxa that are not included. While deep learning has advanced many fields, its applications in metagenomics remain limited, and its full potential has yet to be realized. Here, we present Metaxa, a transformer-based deep learning model designed for the taxonomic classification of long-read Nanopore sequences. Metaxa leverages the sequential context of Nanopore reads, enabling robust classification beyond fixed k-mer profiles. Our results show that Metaxa matches Kraken 2 on in-sample data at both the species and genus levels, and significantly outperforms both Kraken 2 and MetageNN at the genus level on out-of-sample datasets where the species genome is absent from the reference database but a different species from the same genus is present. Furthermore, Metaxa demonstrates strong generalization across different Nanopore chemistries (R9.4.1 and R10.4.1). This work highlights the potential of deep learning models to improve metagenomic classification accuracy, especially in complex or underexplored environments where traditional tools fall short.

19
Diverse phages of ammonia oxidizers with the potential to modulate nitrification

Turner, A. A. B.; Stahn, M.; Millard, A.; Sauvageau, D.; Stein, L. Y.

2026-05-04 microbiology 10.64898/2026.05.02.722434 medRxiv
Top 0.1%
17.2%
Show abstract

Agriculture is a major source of anthropogenic greenhouse-gas emissions, being the largest source of nitrous oxide (N2O), an extremely potent greenhouse gas and ozone-depleting agent. Soil N2O emissions are largely driven by microbial nitrification, in which ammonia-oxidizing microorganisms catalyze the rate-limiting oxidation of ammonia to nitrite. Nitrification not only mediates N2O fluxes but also reduces fertilization efficiency and contributes to eutrophication through nitrate leaching. Bacteriophage (phage)-based control of microbial communities is rapidly garnering interest in a number of fields; however, phages infecting ammonia-oxidizers are largely uncharacterized, with only one lytic phage having been described, limiting the potential for phage-mediated nitrification inhibition. Here, we show the largest set of phages infecting ammonia-oxidizing bacteria (AOB) to date: 45 dsDNA phages identified from urban wastewater, infecting four AOB species, with 16 demonstrating cross-genus host ranges and capable of eliminating nitrification activity in liquid cultures. Phylogenetic and taxonomic analyses revealed six proposed families of Caudoviricetes and numerous monophyletic clades, likely representing higher-level lineages. Structure-guided genome annotation revealed these phages to carry diverse and seldom-seen auxiliary metabolic genes, ranging from a complete ABC transporter cassette to a large antimicrobial resistance gene cluster. These results unveil the previously unrecognized diversity of AOB phages and their potential to alter host physiology. Our data demonstrates a broad taxonomic and functional repertoire of cultured AOB phages, greatly expanding the panel of known AOB phages, suggesting that viruses play a more significant and complex role in nitrification than previously understood. Moreover, we outline an effective methodological framework for isolating AOB phages from environmental samples. These results will help reframe our understanding of environmental nitrification and enable intensified selection and use of phages for its control.

20
Optimizing a Culture-Enriched Hybrid Metagenomics Pipeline to Assess the AMR Footprint of Livestock Manure in Anaerobic Digestate

Rahman, N.; Rahman, A. S. M. Z.; Levin, D. B.; McAllister, T.; Cicek, N.; Derakhshani, H.

2026-04-24 microbiology 10.64898/2026.04.24.720626 medRxiv
Top 0.1%
16.9%
Show abstract

The extent that anaerobic digestate acts as a reservoir of antimicrobial resistance genes (ARGs) is likely underestimated as conventional metagenomics may underrepresent low-abundance determinants and lacks sufficient resolution to reliably link ARGs to mobile genetic elements (MGE). This study used hybrid assemblies to evaluate whether culture-enriched metagenomics (CEMG), with and without antimicrobial selectivity, improves detection of ARGs in digestate and characterization of ARG- MGE-host linkages. Culture enrichment substantially increased ARG recovery: mean ARG signal rose from 15.4 CPM in fresh digestate (FD; direct metagenomic samples) to 124 CPM in CEMG without antibiotics and 160.0 CPM in antibiotic-selective CEMG, corresponding to an approximately 10.4-fold increase over FD. While only 9 unique ARGs were detected in FD, enrichment recovered 112, including those of clinical importance such as those encoding for vancomycin resistance, extended-spectrum {beta}-lactamase, and linezolid resistance. Oxygen availability emerged as the strongest factor structuring enrichment, with aerobic and anaerobic samples forming distinct clusters and exhibiting shifts in dominant taxa and resistome composition. Antibiotic selection produced more targeted, class-specific shifts, with tetracycline resistance consistently enriched across treatments. Hybrid metagenomic assembly further resolved ARG-MGE-host linkages, revealing extensive co-localization of ARGs with MGEs and heavy metal resistance genes. Together, these findings demonstrate that antibiotic-selective culture enrichment enhances resistome surveillance by improving detection of low-abundance ARGs, while hybrid assembly provides critical genomic context to assess their mobility and host associations. IMPORTANCELivestock manure and its byproducts, such as anaerobic digestate, are recognized as important environmental reservoirs of antimicrobial resistance, yet current metagenomic approaches may underestimate this risk by failing to detect low abundance but clinically relevant resistance determinants. Here, we show that integrating culture enrichment with hybrid metagenomics improves the recovery of antimicrobial resistance genes and reveals their association with mobile genetic elements and bacterial hosts. This approach captures a cultivable and condition-responsive fraction of the resistome that is not readily accessible through direct metagenomic sequencing alone, providing a more informative framework for environmental AMR surveillance.