Performance of IBD machine learning classifiers varies across microbiome training data independent of geographic diversity

Wolf, A.; Cirolia, G.; Gustafson, J. T.; Aswani, A.

2026-05-22 microbiology

10.64898/2026.05.21.727052 bioRxiv

Show abstract

Microbiome-based machine learning classifiers show increasing promise for disease identification across gastrointestinal, metabolic, and immune-mediated conditions. Inflammatory bowel disease (IBD), a chronic immune-mediated disorder associated with disruption of the gut microbiome, has been a particularly successful application area. However, while many predictive models achieve high performance within individual datasets, their ability to generalize across independent populations and geographic contexts remains unclear. Here, we tested whether model class and training dataset composition influence model generalizability across geographically diverse evaluation studies. We compiled seven publicly available shotgun metagenomic studies spanning five geographic regions, comprising 697 individuals with IBD or healthy controls. We trained 246,986 model configurations across seven model classes and five distinct training dataset combinations and evaluated top-performing models on independent studies from the USA, Ireland, Germany, Israel and China Extreme gradient boosting and random forest models showed the highest and most consistent performance across training datasets, a ranking that was maintained on independent evaluation studies. However, models trained on geographically diverse datasets did not outperform those trained on USA-only datasets. Instead, model performance was strongly dependent on the evaluation study itself, with consistent differences in achievable accuracy across studies. Despite most models achieving similar AUC scores, there was limited overlap in the key microbial species identified. Furthermore, even for the small set of disease predictive microbes shared between models, the direction of enrichment between IBD or healthy subjects often varied in opposing directions across study populations. These findings suggest that study-specific factors constrain generalization and may help explain the lack of consistent microbiome-based biomarkers for IBD. ImportanceMachine learning models based on the human gut microbiome are increasingly proposed as diagnostic tools for inflammatory bowel disease, but our findings suggest that identifying reliable microbiome biomarkers poses a challenge. Models trained on different datasets often selected different species as important predictors, even when diagnostic performance was similar, indicating that disease-associated microbes may depend strongly on the patient populations studied. Even species repeatedly selected across training datasets frequently showed inconsistent associations with disease, helping explain low agreement across microbiome studies. Importantly, models performed well across new patient groups independent of the geographic diversity present in the training datasets. By identifying microbial species repeatedly selected across datasets, model types, and evaluation studies, we identified a smaller group of more consistent biomarkers, including enrichment of Klebsiella pneumoniae and Erysipelatoclostridium ramosum and depletion of Lachnospiraceae and Alistipes species, which may represent stronger candidates for transferable microbiome markers.

Performance of IBD machine learning classifiers varies across microbiome training data independent of geographic diversity

Matching journals