Back

SNPic: SNP Topic Modeling for Interpretable Clustering of Complex phenotypes

Leyi, Z.; Seiler, C.; Speed, D.; Micheroli, R.; Ospelt, C.

2026-04-24 genetics
10.64898/2026.04.22.720106 bioRxiv
Show abstract

Genome-wide association studies (GWAS) have cataloged thousands of disease-associated variants, yet a central challenge remains: decoding the shared, pleiotropic architecture that links complex phenotypes. Existing approaches, including dimensionality reduction methods and regression genetic models, either lack interpretability or rely on external linkage disequilibrium (LD) reference panels, limiting their ability to recover coherent biological mechanisms. Here we introduce the SNP topic model (SNPic), a generative probabilistic framework that reframes GWAS summary statistics as a structured corpus and models genetic architecture using principles from Natural Language Processing (NLP). By treating phenotypes as documents and genes or the whole corpus of traits as words, SNPic applies topic models, e.g. Latent Dirichlet Allocation (LDA), to infer latent "genetic topics", representing interpretable, overlapping biological modules that jointly explain complex traits. This formulation enables simultaneous reconstruction of trait relationships and identification of their underlying molecular drivers. SNPic integrates two complementary schemes: Sumstat-as-word for capturing global phenotypic structure and Gene-as-word for resolving mechanistic detail, within a unified modeling framework. To ensure robustness, we introduce a stability-optimized inference pipeline based on bootstrap resampling, allowing data-driven selection of topic number and filtering of stochastic signals. Across extensive simulations, SNPic consistently outperforms conventional dimensionality reduction methods in recovering latent structure under both linear and non-linear, highly overlapping genetic architectures. Applied to integrated FinnGen and UK Biobank datasets, SNPic identifies reproducible genetic topics corresponding to distinct biological programs, including HLA-mediated immune processes and transporter-driven metabolic regulation, with strong tissue-specific support. The framework further generalizes across species, organizing complex traits in maize, Arabidopsis thaliana, and cattle into biologically coherent modules. Together, these results establish SNPic as a scalable and interpretable framework that shifts GWAS analysis from association cataloging toward the construction of an interpretable knowledge graph representing the latent semantic architecture of the genome. By unifying statistical genetics with NLP, SNPic reframes GWAS analysis as a probabilistic language modeling task, enabling the systematic decoding of complex trait architectures and delivering a systemic graph of cross-phenotype relationships.

Matching journals

The top 4 journals account for 50% of the predicted probability mass.

1
Nature Genetics
240 papers in training set
Top 0.1%
27.7%
2
The American Journal of Human Genetics
206 papers in training set
Top 0.3%
12.4%
3
Nature Communications
4913 papers in training set
Top 25%
7.2%
4
Genome Biology
555 papers in training set
Top 1%
6.3%
50% of probability mass above
5
Cell Genomics
162 papers in training set
Top 1%
4.2%
6
Nature
575 papers in training set
Top 6%
4.0%
7
Nature Biotechnology
147 papers in training set
Top 3%
3.6%
8
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 20%
3.6%
9
Nucleic Acids Research
1128 papers in training set
Top 7%
2.7%
10
Science
429 papers in training set
Top 13%
1.9%
11
Genome Medicine
154 papers in training set
Top 4%
1.8%
12
PLOS Genetics
756 papers in training set
Top 8%
1.7%
13
Bioinformatics
1061 papers in training set
Top 7%
1.7%
14
Nature Neuroscience
216 papers in training set
Top 4%
1.7%
15
Nature Human Behaviour
85 papers in training set
Top 3%
1.5%
16
PLOS Computational Biology
1633 papers in training set
Top 19%
1.3%
17
Science Translational Medicine
111 papers in training set
Top 4%
1.2%
18
Genetics
225 papers in training set
Top 3%
1.2%
19
Nature Methods
336 papers in training set
Top 6%
0.9%
20
Science Advances
1098 papers in training set
Top 28%
0.8%
21
Genome Research
409 papers in training set
Top 4%
0.8%
22
Nature Plants
84 papers in training set
Top 2%
0.6%
23
Nature Computational Science
50 papers in training set
Top 2%
0.6%
24
Cell Reports
1338 papers in training set
Top 36%
0.6%