EMITS: expectation-maximization abundance estimation for fungal ITS communities from long-read sequencing
O'Brien, A.; Lagos, C.; Fernandez, K.; Ojeda, B.; Parada, P.
Show abstract
As long-read amplicon sequencing becomes routine for fungal metabarcoding, species-level abundance estimation from ITS amplicons remains limited by naive best-hit classification, which misattributes reads among closely related species sharing similar ITS sequences and fragments abundance across redundant database entries. Here we present EMITS, a Rust-based tool that applies expectation-maximization (EM) to iteratively resolve ambiguous read-to-reference mappings from minimap2 alignments against the UNITE database, producing probabilistic specieslevel abundance estimates. EMITS includes platform-specific presets for Oxford Nanopore and PacBio chemistries and performs taxonomic aggregation across UNITE accessions. We validated EMITS using three complementary approaches: controlled simulations with tunable alignment noise, an Oxford Nanopore mock community of 10 fungal species with known composition, and a synthetic community of 21 species derived from UNITE reference sequences. In simulations, EM reduced L1 error by 80-92% compared to naive counting under realistic noise conditions. On the ONT mock community, EM correctly resolved within-genus species assignments where naive counting misattributed reads (e.g., Trichophyton mentagrophytes vs. T. simii; Penicillium species) and consolidated abundance across redundant database accessions. On the synthetic community, EM reduced false positive abundance by 54% and improved overall accuracy by 13.4%. Together with ITSxRust [OBrien et al., 2026] for upstream ITS extraction, EMITS provides a complete high-performance pipeline for long-read fungal amplicon profiling.
Matching journals
The top 2 journals account for 50% of the predicted probability mass.