Back

The Second Brain: Diffusion Models for Realistic Human Microbiome Generation

Yee, B.; Fu, J.

2026-05-11 bioinformatics
10.64898/2026.05.07.723523 bioRxiv
Show abstract

The human microbiome is a critical determinant of health and disease, but microbiome machine learning is constrained by limited data availability, heterogeneous cohort coverage, and privacy risks from individually identifying microbial signatures. Synthetic microbiome generation could support method development and privacy-preserving sharing, provided that generated samples preserve the ecological zero-inflation of real communities. We present a diffusion-based generative model with a sparsity-preserving decoder built around two sparsity-focused mechanisms: (1) prevalence-aware bias initialization that anchors per-taxon presence probabilities to observed prevalences from epoch one; and (2) a hard sparsity loss implemented with straight-through gradient estimators. The implementation also uses hyperbolic taxonomic embeddings as an unvalidated, phylogeny-aware architectural prior in the diffusion backbone. Evaluated on the American Gut Project (4,827 samples, 500 taxa), the full 15.2M-parameter model achieves parametric-level sparsity preservation: 1.4% deviation in the main comparison and 2.6%{+/-}0.5% deviation across three AGP seeds. SparseDOSSA2 achieves the lowest sparsity deviation in this comparison (0.7%), and MIDASim also passes the operational sparsity threshold (4.9%). Among the three threshold-passing methods, MIDASim achieves the best ecological distance scores, SparseDOSSA2 is best on sparsity deviation, and our model achieves the best prevalence correlation (0.996) while narrowly improving on SparseDOSSA2 on Bray-Curtis (0.0485 vs. 0.0495) and UniFrac (0.0400 vs. 0.0435) discrepancies. PERMANOVA remains able to distinguish generated from real AGP samples (F = 64.29), which we treat as an important limitation rather than evidence of indistinguishability. These results support a deliberately narrow conclusion: this is, to our knowledge, the first deep generative model to match parametric-level sparsity preservation for human microbiome profiles while remaining competitive on standard ecological distance metrics.

Matching journals

The top 6 journals account for 50% of the predicted probability mass.

1
Nature Biotechnology
147 papers in training set
Top 0.2%
18.6%
2
Nature Communications
4913 papers in training set
Top 21%
9.1%
3
PLOS Computational Biology
1633 papers in training set
Top 4%
8.3%
4
Cell Systems
167 papers in training set
Top 2%
7.1%
5
Microbiome
139 papers in training set
Top 0.8%
4.3%
6
Nature Microbiology
133 papers in training set
Top 0.8%
4.3%
50% of probability mass above
7
Bioinformatics
1061 papers in training set
Top 5%
4.3%
8
Nature Machine Intelligence
61 papers in training set
Top 0.7%
4.1%
9
Genome Biology
555 papers in training set
Top 2%
3.6%
10
Nature Methods
336 papers in training set
Top 3%
3.2%
11
Scientific Reports
3102 papers in training set
Top 55%
1.8%
12
Briefings in Bioinformatics
326 papers in training set
Top 4%
1.7%
13
PLOS ONE
4510 papers in training set
Top 54%
1.7%
14
mSystems
361 papers in training set
Top 5%
1.7%
15
Nature Medicine
117 papers in training set
Top 3%
1.5%
16
Nature Biomedical Engineering
42 papers in training set
Top 1%
1.5%
17
Cell Reports Methods
141 papers in training set
Top 3%
1.5%
18
Advanced Science
249 papers in training set
Top 13%
1.3%
19
BMC Bioinformatics
383 papers in training set
Top 5%
1.3%
20
Patterns
70 papers in training set
Top 2%
1.2%
21
Nature
575 papers in training set
Top 14%
1.1%
22
Nature Genetics
240 papers in training set
Top 6%
0.9%
23
Nucleic Acids Research
1128 papers in training set
Top 16%
0.9%
24
Nature Computational Science
50 papers in training set
Top 2%
0.8%
25
Cell Genomics
162 papers in training set
Top 6%
0.8%
26
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 45%
0.7%
27
Frontiers in Genetics
197 papers in training set
Top 10%
0.7%
28
Genome Medicine
154 papers in training set
Top 9%
0.7%
29
Cell Host & Microbe
113 papers in training set
Top 6%
0.6%