Back

Mitag4taxa: Extracting SSU rRNA Illumina reads from metagenomes for taxonomic classification

He, Y.; Du, Y.; Nguyen, L.; Wang, Y.

2026-05-05 bioinformatics
10.64898/2026.05.01.722230 bioRxiv
Show abstract

The prevailing taxonomic profiling methods for an environmental sample rely heavily on PCR amplification of SSU ribosomal RNA (rRNA) genes and genome-based reference databases. Identification and extraction of Illumina metagenomics sequencing data are PCR independent but technically challenging in recognition of the SSU rRNA fragments. Here we present Mitag4taxa, a computational pipeline designed for taxonomic profiling of microbial communities from metagenomic Illumina sequencing reads containing rRNA tags (mitag). A Hidden Markov Model (HMM) of SSU rRNA genes and those for the V4 region of 16S rRNA and the V9 region of 18S rRNA genes were created, respectively, using the representative sequences of different families and corresponding hypervariable regions in the SILVA database. The pipeline identifies and extracts 16S and 18S rRNA gene fragments along with the quality score from metagenomic or metatranscriptomic datasets using HMM search integrated with the models. The hypervariable regions, including the V4 region of 16S rRNA and the V9 region of 18S rRNA genes, can be further scanned and recruited for taxonomic classification and biodiversity estimate. To demonstrate its high reliability, the performance of Mitag4taxa was evaluated using both real and simulated datasets. In human gut metagenomic assessments, taxonomic profiles derived from Mitag4taxa showed high consistency with those based on conventional 16S rRNA gene amplicons, identifying dominant families such as Bacteroidaceae and Prevotellaceae with similar relative abundances. Statistical analyses confirmed highly significant positive correlations between Mitag4taxa and amplicon-based community structures. The 18S V9 module was further validated using shotgun metagenomic data from deep-sea sediment cores, successfully recovering key eukaryotic taxa such as Collodaria and Leotiomycetes. Furthermore, benchmarking against the RiboTagger software using CAMI marine simulated datasets revealed that Mitag4taxa achieved a higher average F1 score and lower error metrics. Overall, Mitag4taxa provides a complementary rRNA gene amplicon- and genome-independent strategy for microbial community profiling, enabling improved detection of both prokaryotic and eukaryotic taxa from metagenomic and metatranscriptomic sequencing data.

Matching journals

The top 7 journals account for 50% of the predicted probability mass.

1
Microbiome
139 papers in training set
Top 0.1%
18.5%
2
Bioinformatics
1061 papers in training set
Top 3%
10.0%
3
Briefings in Bioinformatics
326 papers in training set
Top 0.6%
7.1%
4
PLOS ONE
4510 papers in training set
Top 31%
4.8%
5
BMC Bioinformatics
383 papers in training set
Top 2%
4.8%
6
Genome Biology
555 papers in training set
Top 2%
4.3%
7
Frontiers in Microbiology
375 papers in training set
Top 2%
4.3%
50% of probability mass above
8
Scientific Reports
3102 papers in training set
Top 37%
3.6%
9
Computational and Structural Biotechnology Journal
216 papers in training set
Top 2%
3.6%
10
mSystems
361 papers in training set
Top 3%
3.6%
11
Molecular Ecology Resources
161 papers in training set
Top 0.4%
2.9%
12
NAR Genomics and Bioinformatics
214 papers in training set
Top 1%
2.6%
13
Nature Communications
4913 papers in training set
Top 45%
2.6%
14
Nucleic Acids Research
1128 papers in training set
Top 8%
2.4%
15
PLOS Computational Biology
1633 papers in training set
Top 16%
1.7%
16
Microorganisms
101 papers in training set
Top 1%
1.3%
17
ISME Communications
103 papers in training set
Top 1%
1.2%
18
Microbial Genomics
204 papers in training set
Top 2%
1.2%
19
mSphere
281 papers in training set
Top 4%
1.2%
20
iScience
1063 papers in training set
Top 23%
1.1%
21
Genomics, Proteomics & Bioinformatics
171 papers in training set
Top 5%
0.9%
22
Methods in Ecology and Evolution
160 papers in training set
Top 2%
0.9%
23
Water Research
74 papers in training set
Top 1%
0.8%
24
Cell Reports Methods
141 papers in training set
Top 5%
0.7%
25
Frontiers in Bioinformatics
45 papers in training set
Top 1%
0.7%
26
Advanced Science
249 papers in training set
Top 21%
0.7%