Back

KuafuPrimer: Machine learning empowers the design of 16S amplicon sequencing primers toward minimal bias for bacterial communities

Zhang, H.; Jiang, X.; Yu, X.; Wang, H.; Lu, P.; Hou, J.; Guo, Q.; Xiao, T.; Wu, S.; Yin, H.; Geng, P. X.; Guo, J.; Jousset, A.; Wei, Z.; Xiao, Y.; Zhu, H.

2026-03-31 bioinformatics
10.64898/2026.03.29.714677 bioRxiv
Show abstract

Amplicon sequencing protocol targeting the 16S rRNA gene is a widely used and cost-effective method for exploring bacterial communities. However, its performance is often limited by primer bias arising from the arbitrary use of universal primers across diverse microbial communities and habitats. We propose KuafuPrimer to design the optimal 16S rRNA gene primers toward minimal bias for targeted bacterial communities, using few-shot machine learning to guide the primer design procedure based on a small number of samples. Simulations on 809 samples across 26 representative environments and habitats showed that KuafuPrimer-designed primers outperformed the universal primers in taxonomic accuracy, achieving an averaged 16.31% relative reduction in primer bias, with reductions up to 46.08% in plant samples. Notably, KuafuPrimer detected 29 rare and key taxa undetectable by the universal primers. Validation with 317 longitudinal gut microbiota samples demonstrated that KuafuPrimer-designed primers consistently outperformed the universal primers across temporal, individual, and cohort levels, with relative bias reductions of 5.03%, 3.53%, and 3.10%, respectively. Finally, in real PCR experiments on human gut samples from Clostridioides difficile-infected and healthy groups showed that polymerase chain reaction products using KuafuPrimer-designed primers correlated better with metagenomic data compared to the universal primers. More importantly, KuafuPrimer successfully detected Clostridioides difficile, the key pathogen missed by the universal primers, highlighting its potential for improving clinical diagnostics. In summary, KuafuPrimer provides a machine learning-based primer design strategy for targeted bacterial communities, with demonstrated utility in large-scale microbiome initiatives, longitudinal surveys and clinical diagnostics.

Matching journals

The top 10 journals account for 50% of the predicted probability mass.

1
Microbiome
139 papers in training set
Top 0.1%
13.8%
2
Microbial Genomics
204 papers in training set
Top 0.4%
6.1%
3
Nature Communications
4913 papers in training set
Top 31%
6.1%
4
Briefings in Bioinformatics
326 papers in training set
Top 1%
4.7%
5
BMC Bioinformatics
383 papers in training set
Top 2%
4.1%
6
Bioinformatics
1061 papers in training set
Top 5%
3.8%
7
Nucleic Acids Research
1128 papers in training set
Top 6%
3.5%
8
Scientific Reports
3102 papers in training set
Top 40%
3.5%
9
Genome Medicine
154 papers in training set
Top 2%
3.5%
10
Genome Biology
555 papers in training set
Top 3%
3.5%
50% of probability mass above
11
PLOS Computational Biology
1633 papers in training set
Top 11%
3.5%
12
PLOS ONE
4510 papers in training set
Top 44%
2.8%
13
mSystems
361 papers in training set
Top 4%
2.5%
14
Advanced Science
249 papers in training set
Top 8%
2.5%
15
BMC Genomics
328 papers in training set
Top 1%
2.5%
16
NAR Genomics and Bioinformatics
214 papers in training set
Top 1%
2.4%
17
Frontiers in Microbiology
375 papers in training set
Top 4%
2.0%
18
Cell Reports Methods
141 papers in training set
Top 2%
1.8%
19
Bioinformatics Advances
184 papers in training set
Top 3%
1.6%
20
GigaScience
172 papers in training set
Top 2%
1.6%
21
Molecular Ecology Resources
161 papers in training set
Top 0.6%
1.6%
22
Nature Machine Intelligence
61 papers in training set
Top 2%
1.4%
23
Computational and Structural Biotechnology Journal
216 papers in training set
Top 6%
1.2%
24
mSphere
281 papers in training set
Top 5%
1.2%
25
Genome Research
409 papers in training set
Top 4%
0.9%
26
Genomics, Proteomics & Bioinformatics
171 papers in training set
Top 5%
0.9%
27
Nature Biotechnology
147 papers in training set
Top 7%
0.9%
28
Methods in Ecology and Evolution
160 papers in training set
Top 2%
0.8%
29
Cell Systems
167 papers in training set
Top 12%
0.8%
30
Frontiers in Genetics
197 papers in training set
Top 9%
0.8%