Back

Sequence effects on patterns of variation and DNA strand asymmetries observed from whole-genome sequenced UK Biobank participants

Curtis, D.

2026-03-07 genetics
10.64898/2026.03.06.710079 bioRxiv
Show abstract

UK Biobank has released whole genome sequence data for 500,000 participants, including allele counts for hundreds of millions of variants and these were considered in the context of the pentanucleotide background on which they occurred. Frequencies of singleton variants were obtained and compared with frequencies of more common variants. Results were highly correlated across chromosomes, reflecting systematic effects. C>T singleton variants were less frequent in the CG context but the opposite was true for more common variants, suggesting that they are relatively well tolerated and not subject to strong negative selection. The frequencies of singleton variant types were strongly influenced by their trinucleotide context and the total counts of variants in their trinucleotide context could be well approximated by combining five mutational signatures obtained from genomes of cancer cells. For some variant types, there were marked asymmetries in counts between plus and minus DNA strands. The patterns of these asymmetries for singleton variants differed between chromosomes, with five being negatively correlated with the rest. These asymmetries did not appear related to strand-specific gene content. It was noted that there were also strand asymmetries for some pentanucleotide sequences in the reference genome and that these were consistent across chromosomes. The sequence TTCGT is seen 673300 times on the plus strand but only 465807 times on the minus strand. These findings must reflect strand-specific mechanisms affecting mutation and selection which are not currently well understood and which could be investigated further. This research has been conducted using the UK Biobank Resource.

Matching journals

The top 6 journals account for 50% of the predicted probability mass.

1
Frontiers in Genetics
197 papers in training set
Top 0.1%
18.6%
2
Scientific Reports
3102 papers in training set
Top 4%
12.3%
3
European Journal of Human Genetics
49 papers in training set
Top 0.1%
7.2%
4
BMC Genomics
328 papers in training set
Top 0.4%
4.9%
5
PLOS ONE
4510 papers in training set
Top 31%
4.9%
6
Nature Communications
4913 papers in training set
Top 37%
4.0%
50% of probability mass above
7
PLOS Genetics
756 papers in training set
Top 4%
4.0%
8
Genome Research
409 papers in training set
Top 1.0%
3.6%
9
Genes
126 papers in training set
Top 0.4%
3.1%
10
Genetic Epidemiology
46 papers in training set
Top 0.3%
2.4%
11
Journal of Medical Genetics
28 papers in training set
Top 0.2%
2.1%
12
G3 Genes|Genomes|Genetics
351 papers in training set
Top 1%
1.7%
13
Human Molecular Genetics
130 papers in training set
Top 2%
1.3%
14
PLOS Computational Biology
1633 papers in training set
Top 19%
1.3%
15
npj Genomic Medicine
33 papers in training set
Top 0.5%
1.3%
16
BMC Cancer
52 papers in training set
Top 2%
1.3%
17
Communications Biology
886 papers in training set
Top 16%
1.1%
18
Genome Biology and Evolution
280 papers in training set
Top 1%
1.1%
19
International Journal of Molecular Sciences
453 papers in training set
Top 13%
0.9%
20
Genome Biology
555 papers in training set
Top 6%
0.9%
21
BMC Bioinformatics
383 papers in training set
Top 6%
0.9%
22
Genome Medicine
154 papers in training set
Top 7%
0.8%
23
F1000Research
79 papers in training set
Top 4%
0.8%
24
The American Journal of Human Genetics
206 papers in training set
Top 3%
0.8%
25
Biology
43 papers in training set
Top 2%
0.8%
26
PeerJ
261 papers in training set
Top 15%
0.7%
27
GENETICS
189 papers in training set
Top 1%
0.7%
28
Nucleic Acids Research
1128 papers in training set
Top 19%
0.7%
29
iScience
1063 papers in training set
Top 35%
0.7%
30
BMC Genomic Data
12 papers in training set
Top 0.2%
0.7%