Back

Concordance and dissonance: A genome-wide analysis of self-declared versus inferred ancestry in 10,250 participants from the HostSeq cohort

Warren, R. L.; Birol, I.

2025-06-13 genomics
10.1101/2025.06.10.658783 bioRxiv
Show abstract

Accurate characterization of human diversity is foundational to equitable genomics. In this study, we analyzed self-declared and genome-derived ancestry in 10,250 participants from the pan-Canadian HostSeq cohort. Using the alignment-free ntRoot algorithm on whole genome sequencing data, we inferred global and local ancestry at the continental super-population level and compared these with self-reported sociocultural identity categories. We observed high concordance among individuals self-identifying as White (98.8%), Black (97.2%), East Asian (96.1%), and South Asian (89.9%). Concordance was lower among those self-identifying as Hispanic (74.6%), Middle Eastern / Central Asian (67.9%), or Indigenous (40.7%), reflecting greater admixture complexity. Agreement between expected and inferred ancestry labels was modest (Cohens kappa {kappa} = -0.01 unweighted; 0.35 weighted), and ancestry discordance was strongly associated with higher Shannon entropy of ancestry fractions. Principal component analysis of ntRoot-derived ancestry composition revealed tightly clustered profiles in some groups and broader, overlapping distributions in others, illustrating how sociocultural identities and genomic data capture distinct but intersecting dimensions of human diversity. These findings support the complementary use of genome-derived continental ancestry fractions alongside self-identification, particularly in settings where sociocultural labels may be incomplete, heterogenous, or poorly aligned with genetic background. This approach can improve scientific rigor and enhance inclusion in population-scale genomics while respecting the social meaning of identity. We emphasize that genetic ancestry estimates are not proxies for race, which is a social construct with no biological basis.

Matching journals

The top 2 journals account for 50% of the predicted probability mass.

1
Cell Genomics
162 papers in training set
Top 0.1%
37.7%
2
The American Journal of Human Genetics
206 papers in training set
Top 0.3%
14.3%
50% of probability mass above
3
Genome Medicine
154 papers in training set
Top 2%
4.3%
4
Genome Biology
555 papers in training set
Top 2%
3.7%
5
Scientific Reports
3102 papers in training set
Top 37%
3.6%
6
Cell
370 papers in training set
Top 6%
3.6%
7
Nature Communications
4913 papers in training set
Top 40%
3.6%
8
eLife
5422 papers in training set
Top 28%
3.2%
9
European Journal of Human Genetics
49 papers in training set
Top 0.5%
2.1%
10
Frontiers in Genetics
197 papers in training set
Top 4%
2.1%
11
Human Genetics and Genomics Advances
70 papers in training set
Top 0.3%
1.7%
12
Nature Genetics
240 papers in training set
Top 4%
1.7%
13
American Journal of Biological Anthropology
11 papers in training set
Top 0.2%
1.3%
14
Bioinformatics Advances
184 papers in training set
Top 4%
0.9%
15
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 43%
0.8%
16
BMC Biology
248 papers in training set
Top 4%
0.8%
17
Nature Biotechnology
147 papers in training set
Top 7%
0.7%
18
Philosophical Transactions of the Royal Society B
51 papers in training set
Top 6%
0.7%
19
Science
429 papers in training set
Top 20%
0.7%
20
Methods in Ecology and Evolution
160 papers in training set
Top 2%
0.7%
21
Nucleic Acids Research
1128 papers in training set
Top 20%
0.6%