Back

Integrating Structural Variants into Sequence-Based GWAS Using a Pangenome and Imputation Framework in French Dairy Cattle

NAJI, M.; Sorin, V.; Grohs, C.; Fritz, S.; Klopp, C.; Faraut, T.; Boichard, D.; Sanchez, M.-P.; Boussaha, M.

2026-01-21 genomics
10.64898/2026.01.18.700144 bioRxiv
Show abstract

Structural variants (SVs) are most effectively identified using long-read (LR) sequencing. However, long-read (LR) data remain limited, and sequenced samples often lack associated phenotypic information. To overcome this limitation, we combined pangenome-based (variation graph) and imputation approaches to enable large-scale SV association studies in the three main French dairy cattle breeds. A variation graph was constructed using 69,892 deletions, 89,900 insertions, and 17,402 duplications detected in 176 LR samples. We subsequently genotyped 939 samples for each SV in the panel by realigning their short read (SR) sequences to the graph. Validation analyses showed high genotype concordance rates for deletions (0.79) and insertions (0.79); however, the rates for duplications were low (0.14), leading to their exclusion from this study. Retained SVs were combined with single nucleotide variants (SNVs) and served as sequence-level imputation reference panel. From the SNP genotyping array data, we imputed SVs and SNVs for 11,902 Holstein, 3,753 Montbeliarde, and 3,053 Normande bulls. After quality control, more than 14 million SNVs and 40 thousand SVs were retained for within-breed genome-wide association analyses (GWAS) with daughter yield deviations for 13 traits related to milk production, udder health, fertility, and stature. The results of the GWAS demonstrated genetic architectures aligning with earlier discoveries and uncovered thirty-six unique significant associations between structural variants and traits. Conditional analysis revealed that ten of these SVs were the primary variants in the quantitative trait loci related to fat content, protein content, and stature.

Matching journals

The top 7 journals account for 50% of the predicted probability mass.

1
Nature Communications
4913 papers in training set
Top 14%
12.3%
2
BMC Genomics
328 papers in training set
Top 0.1%
10.1%
3
Genetics Selection Evolution
33 papers in training set
Top 0.1%
7.2%
4
Scientific Reports
3102 papers in training set
Top 12%
7.2%
5
Communications Biology
886 papers in training set
Top 0.3%
6.4%
6
Frontiers in Genetics
197 papers in training set
Top 0.8%
6.4%
7
PLOS Genetics
756 papers in training set
Top 4%
4.2%
50% of probability mass above
8
Genome Medicine
154 papers in training set
Top 2%
4.0%
9
Genome Research
409 papers in training set
Top 1.0%
3.6%
10
eLife
5422 papers in training set
Top 31%
2.7%
11
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 25%
2.6%
12
Cell Genomics
162 papers in training set
Top 2%
2.1%
13
PLOS ONE
4510 papers in training set
Top 54%
1.7%
14
Science Advances
1098 papers in training set
Top 17%
1.7%
15
Journal of Dairy Science
11 papers in training set
Top 0.1%
1.7%
16
Emerging Infectious Diseases
103 papers in training set
Top 1%
1.7%
17
Genomics, Proteomics & Bioinformatics
171 papers in training set
Top 4%
1.7%
18
Genomics
60 papers in training set
Top 1%
1.5%
19
Human Molecular Genetics
130 papers in training set
Top 2%
1.3%
20
Genome Biology
555 papers in training set
Top 5%
1.3%
21
iScience
1063 papers in training set
Top 32%
0.7%
22
PNAS Nexus
147 papers in training set
Top 2%
0.7%
23
Cell Reports
1338 papers in training set
Top 34%
0.7%
24
Genetics
225 papers in training set
Top 4%
0.7%
25
Gigabyte
60 papers in training set
Top 2%
0.6%
26
Scientific Data
174 papers in training set
Top 3%
0.6%
27
The American Journal of Human Genetics
206 papers in training set
Top 4%
0.6%
28
International Journal of Epidemiology
74 papers in training set
Top 3%
0.6%