Back

Systematic detection of abnormal samples reveals widespread mislabeling in metagenomic studies

Ye, W.; Zhou, Y.; Chen, J.; Wanxin, L.; Du, S.

2026-03-25 microbiology
10.64898/2026.03.22.713545 bioRxiv
Show abstract

The human microbiome plays a critical role in health and disease, and its dynamic nature has made longitudinal sampling a key strategy for elucidating microbiome-disease relationships. Although the gut microbiome generally stabilizes over time, a subset of samples frequently shows marked deviations from an individuals baseline profile. We refer to these as abnormal samples. To analyze these abnormal samples, we developed a three-stage workflow to identify and classify these abnormal samples to figure out the underlying causes of these abnormal samples. Moreover, we systematically investigated abnormal samples across 16 publicly available metagenomic datasets, comprising a total of 5,171 metagenomes. Our analysis revealed that abnormal samples are often the result of mislabeling during sample collection, processing, or sequencing. Of which, fecal samples from family are more likely mislabeled. We found evidence of mislabeling in 75% of longitudinal datasets, involving up to dozens of samples per study, and in 25% of randomly selected cross-sectional datasets. Additional factors such as disease status (e.g., inflammatory bowel disease), sampling intervals, and sampling density may also contribute to sample abnormalities owing to true biological variations. These findings highlight that mislabeling is a common yet underrecognized issue in microbiome research. Our work underscores the importance of identifying and correcting abnormal samples to ensure data integrity in microbiome studies and provides a practical solution for quality control in large-scale metagenomic datasets.

Matching journals

The top 6 journals account for 50% of the predicted probability mass.

1
mSystems
361 papers in training set
Top 0.1%
22.3%
2
Microbiome
139 papers in training set
Top 0.2%
10.0%
3
Genomics, Proteomics & Bioinformatics
171 papers in training set
Top 1%
6.3%
4
mSphere
281 papers in training set
Top 1%
4.1%
5
PLOS Computational Biology
1633 papers in training set
Top 9%
3.9%
6
Scientific Reports
3102 papers in training set
Top 38%
3.6%
50% of probability mass above
7
PLOS ONE
4510 papers in training set
Top 40%
3.6%
8
npj Biofilms and Microbiomes
56 papers in training set
Top 0.7%
2.7%
9
Frontiers in Microbiology
375 papers in training set
Top 4%
2.6%
10
Nature Communications
4913 papers in training set
Top 47%
2.1%
11
Microbiology Spectrum
435 papers in training set
Top 2%
1.9%
12
Microbial Genomics
204 papers in training set
Top 1.0%
1.9%
13
mBio
750 papers in training set
Top 8%
1.7%
14
PLOS Biology
408 papers in training set
Top 10%
1.7%
15
Briefings in Bioinformatics
326 papers in training set
Top 4%
1.7%
16
Scientific Data
174 papers in training set
Top 1%
1.7%
17
eLife
5422 papers in training set
Top 43%
1.6%
18
Bioinformatics
1061 papers in training set
Top 8%
1.5%
19
Genome Biology
555 papers in training set
Top 5%
1.5%
20
ISME Communications
103 papers in training set
Top 1%
1.2%
21
Gut Microbes
70 papers in training set
Top 0.8%
0.9%
22
Science China Life Sciences
26 papers in training set
Top 2%
0.9%
23
iScience
1063 papers in training set
Top 30%
0.8%
24
Methods in Ecology and Evolution
160 papers in training set
Top 2%
0.7%
25
Epidemics
104 papers in training set
Top 2%
0.7%
26
Frontiers in Cellular and Infection Microbiology
98 papers in training set
Top 6%
0.7%
27
Cell Systems
167 papers in training set
Top 13%
0.7%
28
GigaScience
172 papers in training set
Top 4%
0.6%
29
Advanced Science
249 papers in training set
Top 22%
0.6%
30
Computational and Structural Biotechnology Journal
216 papers in training set
Top 11%
0.6%