Back

VAE (Variational Autoencoder) Based Gastrotype Identification and Predictive Diagnosis of Helicobacter pylori Infection

Ma, Z.; Qiao, Y.

2026-04-13 gastroenterology
10.64898/2026.04.11.26350690 medRxiv
Show abstract

Background: The enterotype concept proposed that gut microbiomes cluster into discrete types, but subsequent critiques demonstrated that such clustering depends on methodological choices, that the number of clusters is not fixed, and that faecal samples cannot capture spatial heterogeneity along the gastrointestinal tract. The stomach remains particularly understudied, and no systematic classification exists for gastric microbial community types. Methods: We assembled a multi-cohort dataset of 566 gastric mucosal samples spanning healthy controls to gastric cancer, with both Helicobacter pylori (HP)-negative and HP-positive individuals. Critically, we applied the key methodological lessons of the enterotype debate: we used a variational autoencoder (VAE) for dimensionality reduction to learn a continuous latent representation without forcing discrete structure, determined the optimal number of clusters using the Silhouette index (an absolute validation measure) across K=2 to K=10 rather than arbitrarily selecting a cluster number, and performed transparent evaluation of multiple clustering solutions. This VAE-plus-silhouette workflow directly addresses the critiques leveled against the original enterotype analysis. Results: Four gastotypes were identified, with K=4 achieving the highest mean silhouette score, indicating good cluster cohesion and separation. Two gastotypes (Variovorax-type and Trabulsiella-type) were significantly enriched in HP-positive samples, while two gastotypes (Bacteroides-type and Streptococcus-type) were significantly enriched in HP-negative samples. Random Forest and Gradient Boosting achieved excellent baseline performance for predicting HP infection (AUC = 0.990 and 0.993). Conclusions: The VAE-plus-silhouette workflow provides a robust, data-driven approach for identifying gastotypes without forcing discrete structure or arbitrarily fixing cluster numbers. Using this framework, we identified four gastotypes with significantly different HP infection rates. Variovorax-type and Trabulsiella-type showed strong HP-positive enrichment, while Bacteroides-type and Streptococcus-type showed strong HP-negative enrichment. These findings demonstrate that methodological advances from the enterotype controversy can be successfully transferred to the stomach, offering a reproducible taxonomy for stratifying HP infection status with potential clinical utility.

Matching journals

The top 4 journals account for 50% of the predicted probability mass.

1
Gut Microbes
70 papers in training set
Top 0.1%
22.1%
2
Microbiome
139 papers in training set
Top 0.1%
17.9%
3
mSystems
361 papers in training set
Top 1%
8.2%
4
Frontiers in Cellular and Infection Microbiology
98 papers in training set
Top 0.5%
6.2%
50% of probability mass above
5
Scientific Reports
3102 papers in training set
Top 29%
4.2%
6
Frontiers in Microbiology
375 papers in training set
Top 2%
4.2%
7
Nature Communications
4913 papers in training set
Top 41%
3.5%
8
PLOS ONE
4510 papers in training set
Top 55%
1.7%
9
PLOS Computational Biology
1633 papers in training set
Top 17%
1.7%
10
Nature Biomedical Engineering
42 papers in training set
Top 0.9%
1.7%
11
mBio
750 papers in training set
Top 8%
1.5%
12
Microbial Genomics
204 papers in training set
Top 1%
1.5%
13
Methods in Ecology and Evolution
160 papers in training set
Top 2%
1.5%
14
Bioinformatics
1061 papers in training set
Top 8%
1.2%
15
PeerJ
261 papers in training set
Top 10%
1.2%
16
Journal of Medical Internet Research
85 papers in training set
Top 3%
1.2%
17
Microbiology Spectrum
435 papers in training set
Top 4%
0.9%
18
npj Biofilms and Microbiomes
56 papers in training set
Top 2%
0.9%
19
Cell Reports Medicine
140 papers in training set
Top 7%
0.9%
20
eBioMedicine
130 papers in training set
Top 4%
0.8%
21
Cell Reports Methods
141 papers in training set
Top 5%
0.8%
22
Bioinformatics Advances
184 papers in training set
Top 5%
0.7%
23
mSphere
281 papers in training set
Top 6%
0.7%
24
Genome Biology
555 papers in training set
Top 8%
0.7%
25
Science China Life Sciences
26 papers in training set
Top 3%
0.6%
26
International Journal of Environmental Research and Public Health
124 papers in training set
Top 8%
0.6%
27
Briefings in Bioinformatics
326 papers in training set
Top 8%
0.6%
28
GigaScience
172 papers in training set
Top 4%
0.6%