Back

A comparative study of structural variant calling strategies using the Alzheimer's Disease Sequencing Project's whole genome family data

Malamon, J. S.; Farrell, J. J.; Xia, L. C.; Dombroski, B. A.; Lee, W.-P.; Das, R. G.; Vardarajan, B. N.; Way, J.; Kuzma, A. B.; Valladares, O.; Leung, Y. Y.; Scanlon, A.; Lopez, I. A. B.; Brehony, J.; Worley, K. C.; Zhang, N. R.; Wang, L.-S.; Farrer, L. A.; Schellenberg, G. D.

2022-05-20 genetics
10.1101/2022.05.19.492472 bioRxiv
Show abstract

BackgroundReliable detection and accurate genotyping of structural variants (SVs) and insertion/deletions (indels) from whole-genome sequence (WGS) data is a significant challenge. We present a protocol for variant calling, quality control, call merging, sensitivity analysis, in silico genotyping, and laboratory validation protocols for generating a high-quality deletion call set from whole genome sequences as part of the Alzheimers Disease Sequencing Project (ADSP). This dataset contains 578 individuals from 111 families. MethodsWe applied two complementary pipelines (Scalpel and Parliament) for SV/indel calling, break-point refinement, genotyping, and local reassembly to produce a high-quality annotated call set. Sensitivity was measured in sample replicates (N=9) for all callers using in silico variant spike-in for a wide range of event sizes. We focused on deletions because these events were more reliably called. To evaluate caller specificity, we developed a novel metric called the D-score that leverages deletion sharing frequencies within and outside of families to rank recurring deletions. Assessment of overall quality across size bins was measured with the kinship coefficient. Individual callers were evaluated for computational cost, performance, sensitivity, and specificity. Quality of calls were evaluated by Sanger sequencing of predicted loss-of-function (LOF) variants, variants near AD candidate genes, and randomly selected genome-wide deletions ranging from 2 to 17,000 bp. ResultsWe generated a high-quality deletion call set across a wide range of event sizes consisting of 152,301 deletions with an average of 263 per genome. A total of 114 of 146 predicted deletions (78.1%) were validated by Sanger sequencing. Scalpel was more accurate in calling deletions [≤]100 bp, whereas for Parliament, sensitivity was improved for deletions > 900 bp. We validated 83.0% (88/106) and 72.5% (37/51) of calls made by Scalpel and Parliament, respectively. Eleven deletions called by both Parliament and Scalpel in the 101-900 bin were tested and all were confirmed by Sanger sequencing. ConclusionsWe developed a flexible protocol to assess the quality of deletion detection across a wide range of sizes. We also generated a truth set of Sanger sequencing validated deletions with precise breakpoints covering a wide spectrum of sizes between 1 and 17,000 bp.

Matching journals

The top 2 journals account for 50% of the predicted probability mass.

1
Alzheimer's & Dementia
143 papers in training set
Top 0.1%
40.1%
2
International Journal of Epidemiology
74 papers in training set
Top 0.1%
10.3%
50% of probability mass above
3
Bioinformatics
1061 papers in training set
Top 3%
8.6%
4
Neurology Genetics
14 papers in training set
Top 0.1%
4.9%
5
BMC Bioinformatics
383 papers in training set
Top 2%
4.0%
6
Genetics in Medicine
69 papers in training set
Top 0.4%
3.7%
7
PLOS ONE
4510 papers in training set
Top 48%
2.1%
8
Genome Medicine
154 papers in training set
Top 4%
1.9%
9
Genetic Epidemiology
46 papers in training set
Top 0.4%
1.8%
10
Scientific Reports
3102 papers in training set
Top 58%
1.7%
11
BMC Genomics
328 papers in training set
Top 3%
1.5%
12
Neurobiology of Aging
95 papers in training set
Top 2%
1.1%
13
Nature Communications
4913 papers in training set
Top 58%
1.0%
14
Alzheimer's & Dementia: Diagnosis, Assessment & Disease Monitoring
38 papers in training set
Top 0.9%
0.9%
15
Annals of Neurology
57 papers in training set
Top 2%
0.8%
16
The American Journal of Human Genetics
206 papers in training set
Top 4%
0.8%
17
Bioinformatics Advances
184 papers in training set
Top 5%
0.8%
18
Briefings in Bioinformatics
326 papers in training set
Top 7%
0.7%
19
Journal of Alzheimer’s Disease
39 papers in training set
Top 1%
0.7%
20
Neurology
44 papers in training set
Top 2%
0.5%
21
Journal of the American Medical Informatics Association
61 papers in training set
Top 2%
0.5%
22
Annals of Clinical and Translational Neurology
29 papers in training set
Top 1%
0.5%