Back

Improving GWAS performance in underrepresented groups by appropriate modeling of genetics, environment, and sociocultural factors

Cataldo-Ramirez, C.; Lin, M.; McMahon, A.; Gignoux, C.; Weaver, T. D.; Henn, B. M.

2026-04-08 genetics
10.1101/2024.10.28.620716 bioRxiv
Show abstract

Genome-wide association studies (GWAS) and polygenic score (PGS) development are typically constrained by the data available in biobank repositories in which European cohorts are vastly overrepresented. Here, we increase the utility of non-European participant data within the UK Biobank (UKB) by characterizing the genetic affinities of UKB participants who self-identify as Bangladeshi, Indian, Pakistani, "White and Asian" (WA), and "Any Other Asian" (AOA), towards creating a more robust South Asian sample size for future genetic analyses. We assess the relationships between genetic structure and self-selected ethnic identities and use consistent patterns of clustering in the dataset to train a support vector machine (SVM). The SVM was utilized to reassign n = 1,853 AOA and WA participants at the subcontinental level, and increase the sample size of the UKB South Asian group by 1,381 additional participants. We further leverage these samples to assess GWAS performance and PGS development. We include environmental covariates in the height GWAS by implementing a rigorous covariate selection procedure, and compare the outputs of two GWAS models: GWASnull and GWASenv. We show that PGS performance derived from both GWAS models yield comparable prediction to PGS models developed with an order of magnitude larger training, and environmentally-adjusted PGS models reduce the sex-bias in predictive performance. In summary, we demonstrate how GWAS performance can be improved by leveraging ambiguous ethnicity codes, ancestry matched imputation panels, and including environmental covariates.

Matching journals

The top 4 journals account for 50% of the predicted probability mass.

1
Genetics
225 papers in training set
Top 0.2%
21.5%
2
Human Genetics and Genomics Advances
70 papers in training set
Top 0.1%
16.7%
3
Cell Genomics
162 papers in training set
Top 0.2%
9.7%
4
The American Journal of Human Genetics
206 papers in training set
Top 0.7%
6.5%
50% of probability mass above
5
G3 Genes|Genomes|Genetics
351 papers in training set
Top 0.4%
6.1%
6
Nature Communications
4913 papers in training set
Top 31%
6.0%
7
PLOS Genetics
756 papers in training set
Top 3%
4.6%
8
GENETICS
189 papers in training set
Top 0.4%
2.6%
9
Human Molecular Genetics
130 papers in training set
Top 2%
1.8%
10
Frontiers in Genetics
197 papers in training set
Top 5%
1.7%
11
International Journal of Epidemiology
74 papers in training set
Top 1%
1.6%
12
Scientific Reports
3102 papers in training set
Top 61%
1.6%
13
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 34%
1.6%
14
eLife
5422 papers in training set
Top 44%
1.6%
15
Bioinformatics
1061 papers in training set
Top 8%
1.4%
16
American Journal of Biological Anthropology
11 papers in training set
Top 0.2%
1.3%
17
Genetic Epidemiology
46 papers in training set
Top 0.6%
1.2%
18
Human Brain Mapping
295 papers in training set
Top 4%
1.1%
19
PLOS ONE
4510 papers in training set
Top 65%
0.9%
20
Nature Genetics
240 papers in training set
Top 8%
0.7%
21
Communications Biology
886 papers in training set
Top 31%
0.6%
22
European Journal of Human Genetics
49 papers in training set
Top 2%
0.6%