Back

A Bioinformatic Pipeline for Consensus Taxonomic Classification of Long-Read Amplicons

Paulsen, A. A.; LaSarre, B.; Delp, D.; Beattie, G. A.; Halverson, L. J.

2026-04-30 microbiology
10.64898/2026.04.29.721641 bioRxiv
Show abstract

Characterizing community composition is fundamental to understanding microbial community function. Recent advances in Oxford Nanopore Technology (ONT) long-read sequencing now allow community profiling using full-length gene amplicons, affording better taxonomic resolution than standard short-amplicon Illumina sequencing. However, robust ONT-compatible profiling workflows are lacking. To address this, we have created the Amplicon Consensus Taxonomy (ACT) pipeline for classifying long-read amplicons. ACT combines output from three existing pipelines -Emu, Sintax, and LACA - to leverage the strengths of each while offsetting their individual limitations. We also developed the ACT database (ACT-DB), a sequence-similarity-aware reference database that clusters highly similar sequences into multi-taxa groups to reduce overclassification. We benchmarked ACT performance against Emu and Sintax using a defined simple mock community, simulated datasets, and a complex rhizosphere community supplemented with novel species. While ACT exhibited generally comparable or superior performance across datasets, ACT demonstrated a marked advantage over Emu and Sintax in identifying novel and low-abundance taxa in both simple and complex communities, resulting in significantly higher species-richness estimates that better reflected those observed in prior Illumina amplicon studies. Furthermore, by clustering ambiguous reference sequences, ACT-DB allowed ACT to resolve reads to meaningful multi-species groups, improving resolution without coercing artificial precision. Together, ACT and ACT-DB form a robust long-read amplicon profiling workflow that confidently identifies known species while reducing overclassification and preserving low-abundance and unknown taxa.

Matching journals

The top 5 journals account for 50% of the predicted probability mass.

1
ISME Communications
103 papers in training set
Top 0.1%
12.5%
2
mSystems
361 papers in training set
Top 0.7%
10.1%
3
Microbiome
139 papers in training set
Top 0.2%
10.1%
4
mBio
750 papers in training set
Top 1%
10.1%
5
Nature Communications
4913 papers in training set
Top 22%
8.4%
50% of probability mass above
6
mSphere
281 papers in training set
Top 0.6%
6.4%
7
Genome Biology
555 papers in training set
Top 2%
3.6%
8
Frontiers in Microbiology
375 papers in training set
Top 3%
3.1%
9
The ISME Journal
194 papers in training set
Top 1%
2.1%
10
Nature Microbiology
133 papers in training set
Top 2%
2.1%
11
PLOS ONE
4510 papers in training set
Top 50%
1.9%
12
Microbiology Spectrum
435 papers in training set
Top 2%
1.8%
13
Nucleic Acids Research
1128 papers in training set
Top 11%
1.7%
14
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 33%
1.7%
15
Cell Reports Methods
141 papers in training set
Top 3%
1.5%
16
Scientific Reports
3102 papers in training set
Top 64%
1.3%
17
Environmental Microbiome
26 papers in training set
Top 0.5%
0.8%
18
eLife
5422 papers in training set
Top 56%
0.8%
19
npj Biofilms and Microbiomes
56 papers in training set
Top 2%
0.8%
20
PLOS Computational Biology
1633 papers in training set
Top 25%
0.7%
21
Methods in Ecology and Evolution
160 papers in training set
Top 2%
0.7%
22
Cell
370 papers in training set
Top 17%
0.7%
23
Microbiology Resource Announcements
22 papers in training set
Top 0.9%
0.7%
24
Molecular Ecology Resources
161 papers in training set
Top 1%
0.7%
25
Scientific Data
174 papers in training set
Top 2%
0.7%
26
New Phytologist
309 papers in training set
Top 5%
0.7%
27
Briefings in Bioinformatics
326 papers in training set
Top 7%
0.7%
28
Bioinformatics Advances
184 papers in training set
Top 5%
0.6%
29
Water Research
74 papers in training set
Top 2%
0.6%
30
Communications Biology
886 papers in training set
Top 29%
0.6%