Back

Vector2Variant: Discovery of Genetic Associations from ML Derived Representations without Phenotype Engineering

Sooknah, M.; Srinivasan, R.; Sankarapandian, S.; Chen, Z.; Xu, J.

2026-04-17 genetic and genomic medicine
10.64898/2026.04.10.26350624 medRxiv
Show abstract

Genome-wide association studies (GWAS) have transformed our understanding of human biology, but are constrained by the need for predefined phenotypes. We introduce Vector2Variant (V2V), a general-purpose framework that transforms any set of high-dimensional measurements (such as machine learning embeddings) into a genome-wide scan for associations, without requiring rigid specification of a phenotype. Rather than testing genetic variants against single traits, V2V finds the axis in multivariate space along which carriers and non-carriers maximally differ, and produces a continuous "projection phenotype" that can be interpreted by association with disease labels. The projection phenotypes correlate with orthogonal clinical biomarkers never seen during training, suggesting the learned axes capture biologically meaningful variation. We applied V2V to imaging, timeseries, and omics modalities in the UK Biobank and recovered established biology (like the role of CASP9 in renal failure) without the need for targeted measurements, alongside novel associations including a frameshift variant in LRRIQ1 (potentially protective for cardiovascular disease). V2V is computationally efficient at genome-wide scale, producing summary statistics and disease associations that facilitate target prioritization without the need for phenotype engineering.

Matching journals

The top 6 journals account for 50% of the predicted probability mass.

1
Nature
575 papers in training set
Top 2%
13.8%
2
Nature Genetics
240 papers in training set
Top 0.7%
10.0%
3
Nature Communications
4913 papers in training set
Top 20%
9.7%
4
Genome Biology
555 papers in training set
Top 1%
6.1%
5
Cell Systems
167 papers in training set
Top 2%
6.1%
6
Nature Medicine
117 papers in training set
Top 0.4%
6.1%
50% of probability mass above
7
Cell
370 papers in training set
Top 5%
4.0%
8
Nature Methods
336 papers in training set
Top 3%
3.5%
9
Nature Neuroscience
216 papers in training set
Top 3%
3.5%
10
Genome Medicine
154 papers in training set
Top 2%
3.4%
11
The American Journal of Human Genetics
206 papers in training set
Top 1%
2.9%
12
Science
429 papers in training set
Top 10%
2.9%
13
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 27%
2.3%
14
Cell Genomics
162 papers in training set
Top 3%
2.0%
15
Nature Biomedical Engineering
42 papers in training set
Top 0.6%
2.0%
16
Nature Biotechnology
147 papers in training set
Top 5%
1.6%
17
Nature Microbiology
133 papers in training set
Top 3%
1.6%
18
Bioinformatics
1061 papers in training set
Top 7%
1.6%
19
Nucleic Acids Research
1128 papers in training set
Top 13%
1.3%
20
Scientific Reports
3102 papers in training set
Top 68%
1.2%
21
Neuron
282 papers in training set
Top 8%
0.9%
22
Nature Computational Science
50 papers in training set
Top 2%
0.7%
23
Cancer Discovery
61 papers in training set
Top 2%
0.7%
24
Nature Structural & Molecular Biology
218 papers in training set
Top 5%
0.7%
25
PLOS Computational Biology
1633 papers in training set
Top 28%
0.6%
26
Nature Human Behaviour
85 papers in training set
Top 5%
0.6%
27
Nature Immunology
71 papers in training set
Top 2%
0.6%
28
Genome Research
409 papers in training set
Top 5%
0.6%