KuafuPrimer: Machine learning empowers the design of 16S amplicon sequencing primers toward minimal bias for bacterial communities
Zhang, H.; Jiang, X.; Yu, X.; Wang, H.; Lu, P.; Hou, J.; Guo, Q.; Xiao, T.; Wu, S.; Yin, H.; Geng, P. X.; Guo, J.; Jousset, A.; Wei, Z.; Xiao, Y.; Zhu, H.
Show abstract
Amplicon sequencing protocol targeting the 16S rRNA gene is a widely used and cost-effective method for exploring bacterial communities. However, its performance is often limited by primer bias arising from the arbitrary use of universal primers across diverse microbial communities and habitats. We propose KuafuPrimer to design the optimal 16S rRNA gene primers toward minimal bias for targeted bacterial communities, using few-shot machine learning to guide the primer design procedure based on a small number of samples. Simulations on 809 samples across 26 representative environments and habitats showed that KuafuPrimer-designed primers outperformed the universal primers in taxonomic accuracy, achieving an averaged 16.31% relative reduction in primer bias, with reductions up to 46.08% in plant samples. Notably, KuafuPrimer detected 29 rare and key taxa undetectable by the universal primers. Validation with 317 longitudinal gut microbiota samples demonstrated that KuafuPrimer-designed primers consistently outperformed the universal primers across temporal, individual, and cohort levels, with relative bias reductions of 5.03%, 3.53%, and 3.10%, respectively. Finally, in real PCR experiments on human gut samples from Clostridioides difficile-infected and healthy groups showed that polymerase chain reaction products using KuafuPrimer-designed primers correlated better with metagenomic data compared to the universal primers. More importantly, KuafuPrimer successfully detected Clostridioides difficile, the key pathogen missed by the universal primers, highlighting its potential for improving clinical diagnostics. In summary, KuafuPrimer provides a machine learning-based primer design strategy for targeted bacterial communities, with demonstrated utility in large-scale microbiome initiatives, longitudinal surveys and clinical diagnostics.
Matching journals
The top 10 journals account for 50% of the predicted probability mass.