Back

Benchmarking strain-level profiling of Escherichia coli in short-read gut metagenomes

Galbraith, M.; Williams, D.; Shaw, L. P.; Lipworth, S.; Stoesser, N.

2026-05-19 bioinformatics
10.64898/2026.05.19.726160 bioRxiv
Show abstract

2.Metagenomes offer the potential to characterise Escherichia coli strain-level diversity within the human gut microbiome, informing our understanding of colonisation diversity and the genetic features distinguishing infection from carriage. Among numerous reference-based tools for short-read metagenomic strain-level profiling, the best approach remains unclear. Here, we benchmarked six published tools--PanTax, PathoScope, StrainGE, Strainify, StrainR2 and StrainScan--for their ability to detect co-existing strains of E. coli and estimate their relative abundance across real and simulated metagenomes of increasing complexity with varying reference database composition. In the ZymoBIOMICS(R) D6331 dataset, only PanTax achieved zero error when predicting the equal abundance of five E. coli strains. In a differentially abundant four-strain mock community dataset (SRR13355226), StrainScan had the lowest mean absolute proportional error (0.89), driven by reduced sensitivity (0.5), followed by PathoScope (4.08). Across simulated metagenomes reflecting the healthy adult gut microbiome, all tools demonstrated high sensitivity ([≥]0.833), but specificity, precision and F1 score were selectively improved in some tools through detection thresholds to remove low abundance false positives. Outright, StrainGE achieved the highest F1 score (0.978). Predicted relative abundances of the E. coli K12-MG1655 (phylogroup A) and O157:H7 Sakai (phylogroup E) strains spiked into simulated metagenomes across varying abundance ratios were generally accurate, with PanTax and StrainR2 showing the lowest mean absolute proportional error (0.06). When truly present strains were removed from the reference database, out-of-phylogroup assignments were observed for some tools. Collectively, our results demonstrate that published metagenomic strain-level profiling tools vary in their ability to profile E. coli strains, indicating that method selection should be guided by intended application. These findings will facilitate characterisation of E. coli strain-level diversity within short-read gut metagenomes with greater accuracy than previously possible. 3. Impact statementStrain-level diversity within the human gut microbiome can be important for human health, with species such as Escherichia coli existing as both commensal and pathogenic strains. Most existing gut microbiome datasets are from short-read i.e., Illumina, sequencing, and numerous bioinformatic tools have been developed to profile strain-level variation from these data. However, the existing literature is often difficult to navigate given that the available tools have been benchmarked in various ways and are subject to author bias. This is, to our knowledge, the first independent benchmarking of six published tools for profiling E. coli at strain-level resolution from short-read metagenomes. Using both real and simulated datasets of increasing complexity, we demonstrate substantial variation in tool performance in terms of strain detection and relative abundance estimation, highlighting that tool choice should be guided by the specific research question, as no single method performs optimally across all scenarios. This work provides an unbiased framework for tool selection and will support more accurate and reproducible E. coli strain-level analyses in gut microbiome research from short-metagenomic data. 4. Data summaryThe authors confirm all supporting data, code and protocols have been provided within the article or through supplementary data files. Supplementary methods, six supplementary tables and four supplementary figures are available in the online Supplementary Material. Code for simulating metagenomes using InSilicoSeq, SLURM job scripts for the simulated metagenomes dataset and R visualization and statistical analysis scripts are available within a dedicated public GitHub repository (https://github.com/mattgal11/benchmarking_short_read_strain_profilers). The following supplementary data are available on FigShare (https://doi.org/10.6084/m9.figshare.32125474): O_LINormalised per-contig relative abundances for 98 species assemblies used to construct the baseline gut microbiome profile for InSilicoSeq metagenome simulation (Normalised_relative_abundance_for_InSilicoSeq_simulated_metagenomes_ gut_microbiome_profile.csv) C_LIO_LIZymoBIOMICS(R) D6331 gut microbiome standard dataset predicted relative abundance data (Zymobiomics_D6331_raw_predicted_abundance.csv) C_LIO_LISRR13355226 mock community (99% human reads; 1% E. coli reads) paired-end reads with human reads depleted (SRR13355226_depleted_R1.fastq.gz & SRR13355226_depleted_R2.fastq.gz) C_LIO_LISRR13355226 mock community dataset raw predicted abundance data, with and without human read removal (SRR13355226_raw_predicted_abundance_with_and_without_human_read_r emoval.csv) C_LIO_LISimulated metagenomes dataset raw call types and detection metric values with increasing detection thresholds (Simulated_metagenomes_raw_call_type_assingments_and_detection_thres holds.csv) C_LIO_LISimulated metagenomes dataset (all references) predicted relative abundance data (Simulated_metagenomes_all_references_raw_predicted_abundances.csv) C_LIO_LISimulated metagenomes dataset (all references) mapped reads for PathoScope and Strainify (all_refs_pathoscope_reads_mapped.csv & all_refs_strainify_reads_mapped.csv) C_LIO_LISimulated metagenomes dataset (reduced reference database) predicted relative abundance data (Simulated_metagenomes_K12_and_Sakai_removed_from_reference_datab ase_raw_predicted_abundance.csv) C_LI

Matching journals

The top 5 journals account for 50% of the predicted probability mass.

1
Microbiome
139 papers in training set
Top 0.1%
22.6%
2
mSystems
361 papers in training set
Top 0.5%
12.6%
3
Bioinformatics
1061 papers in training set
Top 3%
7.2%
4
Nature Communications
4913 papers in training set
Top 29%
6.4%
5
Microbial Genomics
204 papers in training set
Top 0.4%
4.9%
50% of probability mass above
6
mSphere
281 papers in training set
Top 1%
4.0%
7
Nature Biotechnology
147 papers in training set
Top 3%
2.7%
8
Nucleic Acids Research
1128 papers in training set
Top 7%
2.7%
9
Bioinformatics Advances
184 papers in training set
Top 2%
2.1%
10
Briefings in Bioinformatics
326 papers in training set
Top 3%
2.1%
11
PLOS Computational Biology
1633 papers in training set
Top 15%
1.8%
12
Genome Biology
555 papers in training set
Top 4%
1.7%
13
Methods in Ecology and Evolution
160 papers in training set
Top 1%
1.7%
14
GigaScience
172 papers in training set
Top 2%
1.5%
15
Cell Reports Methods
141 papers in training set
Top 3%
1.5%
16
BMC Bioinformatics
383 papers in training set
Top 5%
1.3%
17
Molecular Ecology Resources
161 papers in training set
Top 0.8%
1.2%
18
Frontiers in Microbiology
375 papers in training set
Top 7%
1.2%
19
PLOS ONE
4510 papers in training set
Top 62%
1.0%
20
Computational and Structural Biotechnology Journal
216 papers in training set
Top 7%
1.0%
21
PeerJ
261 papers in training set
Top 12%
0.9%
22
Cell Systems
167 papers in training set
Top 11%
0.9%
23
Scientific Reports
3102 papers in training set
Top 70%
0.9%
24
Genome Research
409 papers in training set
Top 4%
0.8%
25
eLife
5422 papers in training set
Top 58%
0.8%
26
Microbiology Spectrum
435 papers in training set
Top 5%
0.8%
27
Gut Microbes
70 papers in training set
Top 1%
0.8%
28
NAR Genomics and Bioinformatics
214 papers in training set
Top 4%
0.7%
29
Microorganisms
101 papers in training set
Top 3%
0.6%
30
Metabolites
50 papers in training set
Top 1%
0.6%