Benchmarking strain-level profiling of Escherichia coli in short-read gut metagenomes

Galbraith, M.; Williams, D.; Shaw, L. P.; Lipworth, S.; Stoesser, N.

2026-05-19 bioinformatics

10.64898/2026.05.19.726160 bioRxiv

Show abstract

2.Metagenomes offer the potential to characterise Escherichia coli strain-level diversity within the human gut microbiome, informing our understanding of colonisation diversity and the genetic features distinguishing infection from carriage. Among numerous reference-based tools for short-read metagenomic strain-level profiling, the best approach remains unclear. Here, we benchmarked six published tools--PanTax, PathoScope, StrainGE, Strainify, StrainR2 and StrainScan--for their ability to detect co-existing strains of E. coli and estimate their relative abundance across real and simulated metagenomes of increasing complexity with varying reference database composition. In the ZymoBIOMICS(R) D6331 dataset, only PanTax achieved zero error when predicting the equal abundance of five E. coli strains. In a differentially abundant four-strain mock community dataset (SRR13355226), StrainScan had the lowest mean absolute proportional error (0.89), driven by reduced sensitivity (0.5), followed by PathoScope (4.08). Across simulated metagenomes reflecting the healthy adult gut microbiome, all tools demonstrated high sensitivity ([≥]0.833), but specificity, precision and F1 score were selectively improved in some tools through detection thresholds to remove low abundance false positives. Outright, StrainGE achieved the highest F1 score (0.978). Predicted relative abundances of the E. coli K12-MG1655 (phylogroup A) and O157:H7 Sakai (phylogroup E) strains spiked into simulated metagenomes across varying abundance ratios were generally accurate, with PanTax and StrainR2 showing the lowest mean absolute proportional error (0.06). When truly present strains were removed from the reference database, out-of-phylogroup assignments were observed for some tools. Collectively, our results demonstrate that published metagenomic strain-level profiling tools vary in their ability to profile E. coli strains, indicating that method selection should be guided by intended application. These findings will facilitate characterisation of E. coli strain-level diversity within short-read gut metagenomes with greater accuracy than previously possible. 3. Impact statementStrain-level diversity within the human gut microbiome can be important for human health, with species such as Escherichia coli existing as both commensal and pathogenic strains. Most existing gut microbiome datasets are from short-read i.e., Illumina, sequencing, and numerous bioinformatic tools have been developed to profile strain-level variation from these data. However, the existing literature is often difficult to navigate given that the available tools have been benchmarked in various ways and are subject to author bias. This is, to our knowledge, the first independent benchmarking of six published tools for profiling E. coli at strain-level resolution from short-read metagenomes. Using both real and simulated datasets of increasing complexity, we demonstrate substantial variation in tool performance in terms of strain detection and relative abundance estimation, highlighting that tool choice should be guided by the specific research question, as no single method performs optimally across all scenarios. This work provides an unbiased framework for tool selection and will support more accurate and reproducible E. coli strain-level analyses in gut microbiome research from short-metagenomic data. 4. Data summaryThe authors confirm all supporting data, code and protocols have been provided within the article or through supplementary data files. Supplementary methods, six supplementary tables and four supplementary figures are available in the online Supplementary Material. Code for simulating metagenomes using InSilicoSeq, SLURM job scripts for the simulated metagenomes dataset and R visualization and statistical analysis scripts are available within a dedicated public GitHub repository (https://github.com/mattgal11/benchmarking_short_read_strain_profilers). The following supplementary data are available on FigShare (https://doi.org/10.6084/m9.figshare.32125474): O_LINormalised per-contig relative abundances for 98 species assemblies used to construct the baseline gut microbiome profile for InSilicoSeq metagenome simulation (Normalised_relative_abundance_for_InSilicoSeq_simulated_metagenomes_ gut_microbiome_profile.csv) C_LIO_LIZymoBIOMICS(R) D6331 gut microbiome standard dataset predicted relative abundance data (Zymobiomics_D6331_raw_predicted_abundance.csv) C_LIO_LISRR13355226 mock community (99% human reads; 1% E. coli reads) paired-end reads with human reads depleted (SRR13355226_depleted_R1.fastq.gz & SRR13355226_depleted_R2.fastq.gz) C_LIO_LISRR13355226 mock community dataset raw predicted abundance data, with and without human read removal (SRR13355226_raw_predicted_abundance_with_and_without_human_read_r emoval.csv) C_LIO_LISimulated metagenomes dataset raw call types and detection metric values with increasing detection thresholds (Simulated_metagenomes_raw_call_type_assingments_and_detection_thres holds.csv) C_LIO_LISimulated metagenomes dataset (all references) predicted relative abundance data (Simulated_metagenomes_all_references_raw_predicted_abundances.csv) C_LIO_LISimulated metagenomes dataset (all references) mapped reads for PathoScope and Strainify (all_refs_pathoscope_reads_mapped.csv & all_refs_strainify_reads_mapped.csv) C_LIO_LISimulated metagenomes dataset (reduced reference database) predicted relative abundance data (Simulated_metagenomes_K12_and_Sakai_removed_from_reference_datab ase_raw_predicted_abundance.csv) C_LI

Benchmarking strain-level profiling of Escherichia coli in short-read gut metagenomes

Matching journals