Back

Sequencing depth overcomes extraction bias: repurposing human WGS data for salivary microbiome profiling

Velo-Suarez, L.; Herzig, A. F.; Bocher, O.; Le Folgoc, G.; Le Roux, L.; Delmas, C.; Zins, M.; Deleuze, J.-F.; Hery-Arnaud, G.; Genin, E.

2026-04-01 genomics
10.64898/2026.03.27.714786 bioRxiv
Show abstract

Large-scale human genomic projects have generated whole-genome sequencing (WGS) data from hundreds of thousands of individuals, primarily to study host genetic variation. When saliva is the DNA source, the resulting datasets also contain microbial reads that are routinely discarded. Here, we investigate whether these host-centric WGS workflows can yield reliable microbiome profiles, effectively doubling the research value of existing data without additional sampling. We compared non-human reads from 39 deeply sequenced saliva samples from the GAZEL cohort (miG dataset; median [~]43 million reads/sample) with 14 samples processed with microbiome-optimized extraction (ASAL; median [~]4.3 million reads/sample), using two complementary classifiers: meteor, a coverage-based mapper against a curated saliva-specific database, and sylph, a k-mer classifier against the Genome Taxonomy Database (GTDB). Despite the absence of microbial lysis optimization, miG samples showed up to 3-fold higher species richness, [~]10-fold greater sequencing depth, and significantly lower inter-sample variability (PERMANOVA R{superscript 2} = 0.10, p = 0.001; BETADISPER p = 0.0036). Rarefaction to 10 reads eliminated most compositional differences, demonstrating that sequencing depth is the primary driver of community stability. Only [~]2% of detected taxa (12 of 592) showed extraction-related differences. The two classifiers exhibited fundamentally different depth-sensitivity profiles, with sylph retaining systematic detection asymmetries even after depth normalization, highlighting that classifier choice introduces biases that affect cross-study comparisons. These results show that biobank WGS data from saliva can be repurposed for robust, population-scale oral microbiome analyses, enabling simultaneous investigation of host genomic variation and the microbiome from the same archived samples. ImportanceSaliva-based whole-genome sequencing datasets generated across various cohorts to study human genetics contain non-human reads that are routinely discarded, thereby overlooking valuable microbial information. We show that these reads are sufficient to reconstruct robust oral microbiome profiles -- without any additional sampling or laboratory work. This finding unlocks a vast archive of existing genomic data for retrospective microbiome research, enabling population-scale studies of oral microbial diversity, host-microbiome interactions, and disease associations at minimal additional cost. We further demonstrate that the choice of taxonomic classifier introduces systematic, depth-dependent biases that persist even after normalization, a practical consideration for any cross-cohort or multi-platform microbiome study.

Matching journals

The top 4 journals account for 50% of the predicted probability mass.

1
Nature Biotechnology
147 papers in training set
Top 0.1%
32.3%
2
Nature Communications
4913 papers in training set
Top 19%
9.9%
3
Nature Genetics
240 papers in training set
Top 2%
4.8%
4
Nature Microbiology
133 papers in training set
Top 0.7%
4.2%
50% of probability mass above
5
Microbiome
139 papers in training set
Top 1%
3.5%
6
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 23%
3.0%
7
Cell Reports Methods
141 papers in training set
Top 1%
2.8%
8
Cell
370 papers in training set
Top 8%
2.7%
9
Nature
575 papers in training set
Top 8%
2.5%
10
mSystems
361 papers in training set
Top 4%
1.8%
11
Nucleic Acids Research
1128 papers in training set
Top 10%
1.7%
12
Nature Aging
51 papers in training set
Top 1.0%
1.7%
13
Genome Biology
555 papers in training set
Top 4%
1.7%
14
Cell Genomics
162 papers in training set
Top 3%
1.7%
15
Nature Methods
336 papers in training set
Top 4%
1.6%
16
eLife
5422 papers in training set
Top 46%
1.5%
17
Cell Reports
1338 papers in training set
Top 26%
1.5%
18
Genome Medicine
154 papers in training set
Top 6%
1.2%
19
Cell Host & Microbe
113 papers in training set
Top 4%
1.2%
20
PLOS Computational Biology
1633 papers in training set
Top 21%
1.1%
21
Genome Research
409 papers in training set
Top 4%
0.9%
22
PLOS ONE
4510 papers in training set
Top 65%
0.9%
23
mBio
750 papers in training set
Top 10%
0.9%
24
Cell Systems
167 papers in training set
Top 11%
0.9%
25
Scientific Reports
3102 papers in training set
Top 74%
0.8%
26
Science Translational Medicine
111 papers in training set
Top 6%
0.8%
27
Microbial Genomics
204 papers in training set
Top 3%
0.6%