Sequencing depth overcomes extraction bias: repurposing human WGS data for salivary microbiome profiling
Velo-Suarez, L.; Herzig, A. F.; Bocher, O.; Le Folgoc, G.; Le Roux, L.; Delmas, C.; Zins, M.; Deleuze, J.-F.; Hery-Arnaud, G.; Genin, E.
Show abstract
Large-scale human genomic projects have generated whole-genome sequencing (WGS) data from hundreds of thousands of individuals, primarily to study host genetic variation. When saliva is the DNA source, the resulting datasets also contain microbial reads that are routinely discarded. Here, we investigate whether these host-centric WGS workflows can yield reliable microbiome profiles, effectively doubling the research value of existing data without additional sampling. We compared non-human reads from 39 deeply sequenced saliva samples from the GAZEL cohort (miG dataset; median [~]43 million reads/sample) with 14 samples processed with microbiome-optimized extraction (ASAL; median [~]4.3 million reads/sample), using two complementary classifiers: meteor, a coverage-based mapper against a curated saliva-specific database, and sylph, a k-mer classifier against the Genome Taxonomy Database (GTDB). Despite the absence of microbial lysis optimization, miG samples showed up to 3-fold higher species richness, [~]10-fold greater sequencing depth, and significantly lower inter-sample variability (PERMANOVA R{superscript 2} = 0.10, p = 0.001; BETADISPER p = 0.0036). Rarefaction to 10 reads eliminated most compositional differences, demonstrating that sequencing depth is the primary driver of community stability. Only [~]2% of detected taxa (12 of 592) showed extraction-related differences. The two classifiers exhibited fundamentally different depth-sensitivity profiles, with sylph retaining systematic detection asymmetries even after depth normalization, highlighting that classifier choice introduces biases that affect cross-study comparisons. These results show that biobank WGS data from saliva can be repurposed for robust, population-scale oral microbiome analyses, enabling simultaneous investigation of host genomic variation and the microbiome from the same archived samples. ImportanceSaliva-based whole-genome sequencing datasets generated across various cohorts to study human genetics contain non-human reads that are routinely discarded, thereby overlooking valuable microbial information. We show that these reads are sufficient to reconstruct robust oral microbiome profiles -- without any additional sampling or laboratory work. This finding unlocks a vast archive of existing genomic data for retrospective microbiome research, enabling population-scale studies of oral microbial diversity, host-microbiome interactions, and disease associations at minimal additional cost. We further demonstrate that the choice of taxonomic classifier introduces systematic, depth-dependent biases that persist even after normalization, a practical consideration for any cross-cohort or multi-platform microbiome study.
Matching journals
The top 4 journals account for 50% of the predicted probability mass.