Back

Decoding resistance: interpretable machine learning to predict ciprofloxacin resistance in Shigella spp

Gohari, M. R.; Zhang, P.; Villegas, A.; Rosella, L. C.; Patel, S. N.; Hopkins, J. P.; Duvvuri, V. R.

2026-04-11 infectious diseases
10.64898/2026.04.07.26350353 medRxiv
Show abstract

Antimicrobial resistance (AMR) is a growing global public health threat that complicates the treatment and control of bacterial infections. Shigella spp., a leading cause of bacterial diarrhea worldwide, has increasingly exhibited resistance to multiple antimicrobial agents that are commonly recommended therapy for severe shigellosis. Although conventional antimicrobial susceptibility testing (AST) remains the reference standard, it is time-consuming and provides limited insight into the genetic mechanisms underlying resistance. Whole-genome sequencing (WGS) has emerged as a complementary approach for AMR detection by enabling direct identification of resistance genetic determinants encoded in bacterial genomes. Machine learning (ML) methods applied to genomic features such as k-mers have shown promise for predicting resistance phenotypes from WGS data; however, applications to Shigella remain limited. In this study, we developed and evaluated an interpretable ML framework for predicting ciprofloxacin resistance using k-mer features derived from WGS data of 1,424 Shigella isolates collected in Ontario, Canada, between 2018 and 2025. K-mers were extracted from known gene targets associated with ciprofloxacin resistance, including chromosomal quinoline resistance-determining regions (QRDRs: gyrA and parC) and plasmid-mediated determinants (qnr). Supervised ML approaches were trained and compared. We evaluated the influence of k-mer lengths (k=11, 15, 21 and 31) on predictive performance and model interpretability; and compared models based on chromosomal determinants alone and models incorporating both chromosomal and plasmid-mediated determinants. Randon Forest classifier achieved the most consistent performance across models. Inclusion of plasmid-mediated determinants improved predictive accuracy relative to chromosomal-only models. Although differences across k-mer lengths were modest, k = 11 produced the highest area under the receiver operating characteristic curve (AUC) and the lowest Brier score. SHAP analyses localized high-impact features within QRDRs of gyrA and parC, supporting biological interpretability. These findings demonstrate that biologically-informed k-mer-based ML models can accurately and transparently predict ciprofloxacin resistance in Shigella, supporting their potential integration into genomic AMR surveillance and digital public health frameworks. Author summaryIn this study, we used genome sequencing data to develop machine learning models that predict ciprofloxacin resistance for Shigella directly from bacterial DNA. We focused on small DNA fragments (k-mers) derived from known resistance genes and mutations. Among the approaches tested, a Random Forest model showed the most consistent performance. Combining chromosomal mutations with plasmid-mediated resistance genes improved prediction accuracy and helped identify key genetic regions associated with resistance. These findings demonstrate that machine learning applied to genomic data can accurately and interpretable predict antibiotic resistance, supporting its potential use in genomic surveillance and public health monitoring.

Matching journals

The top 10 journals account for 50% of the predicted probability mass.

1
Microbial Genomics
204 papers in training set
Top 0.1%
14.8%
2
Scientific Reports
3102 papers in training set
Top 6%
10.1%
3
PLOS Computational Biology
1633 papers in training set
Top 6%
6.3%
4
PLOS ONE
4510 papers in training set
Top 34%
4.3%
5
BMC Genomics
328 papers in training set
Top 0.8%
3.6%
6
Frontiers in Microbiology
375 papers in training set
Top 3%
2.7%
7
Biology Methods and Protocols
53 papers in training set
Top 0.4%
2.6%
8
Journal of Clinical Microbiology
120 papers in training set
Top 0.8%
2.4%
9
mSystems
361 papers in training set
Top 4%
2.1%
10
Microbiology Spectrum
435 papers in training set
Top 2%
2.1%
50% of probability mass above
11
JAC-Antimicrobial Resistance
13 papers in training set
Top 0.2%
1.9%
12
mBio
750 papers in training set
Top 7%
1.9%
13
Briefings in Bioinformatics
326 papers in training set
Top 4%
1.8%
14
Clinical Infectious Diseases
231 papers in training set
Top 3%
1.7%
15
GigaScience
172 papers in training set
Top 1%
1.7%
16
The Lancet Microbe
43 papers in training set
Top 0.6%
1.7%
17
BMC Bioinformatics
383 papers in training set
Top 4%
1.7%
18
mSphere
281 papers in training set
Top 4%
1.3%
19
eLife
5422 papers in training set
Top 47%
1.3%
20
Genome Medicine
154 papers in training set
Top 5%
1.3%
21
Frontiers in Genetics
197 papers in training set
Top 6%
1.3%
22
BMC Infectious Diseases
118 papers in training set
Top 4%
1.2%
23
PeerJ
261 papers in training set
Top 10%
1.2%
24
Nature Communications
4913 papers in training set
Top 57%
1.1%
25
Bioinformatics
1061 papers in training set
Top 8%
1.0%
26
PLOS Digital Health
91 papers in training set
Top 2%
0.8%
27
The Journal of Infectious Diseases
182 papers in training set
Top 5%
0.7%
28
Frontiers in Public Health
140 papers in training set
Top 8%
0.7%
29
Open Forum Infectious Diseases
134 papers in training set
Top 3%
0.7%
30
Antibiotics
32 papers in training set
Top 1%
0.7%