Back

Fast computation of principal components of genomic similarity matrices

Hahn, G.; Lutz, S.; Hecker, J.; Prokopenko, D.; Cho, M.; Silverman, E. K.; Weiss, S. T.; Lange, C.

2022-10-08 bioinformatics
10.1101/2022.10.06.511168 bioRxiv
Show abstract

The computation of a similarity measure for genomic data, for instance using the (genomic) covariance matrix, the Jaccard matrix, or the genomic relationship matrix (GRM), is a standard tool in computational genetics. The principal components of such matrices are routinely used to correct for biases in, for instance, linear regressions. However, the calculation of both a similarity matrix and its singular value decomposition (SVD) are computationally intensive. The contribution of this article is threefold. First, we demonstrate that the calculation of three matrices (the genomic covariance matrix, the weighted Jaccard matrix, and the genomic relationship matrix) can be reformulated in a unified way which allows for an exact, faster SVD computation. An exception is the Jaccard matrix, which does not have a structure applicable for the fast SVD computation. An exact algorithm is proposed to compute the principal components of the genomic covariance, weighted Jaccard, and genomic relationship matrices. The algorithm is adapted from an existing randomized SVD algorithm and ensures that all computations are carried out in sparse matrix algebra. Second, an approximate Jaccard matrix is introduced to which the fast SVD computation is applicable. Third, we establish guaranteed theoretical bounds on the distance (in L2 norm and angle) between the principal components of the Jaccard matrix and the ones of our proposed approximation, thus putting the proposed Jaccard approximation on a solid mathematical foundation. We illustrate all computations on both simulated data and data of the 1000 Genome Project, showing that the approximation error is very low in practice.

Matching journals

The top 4 journals account for 50% of the predicted probability mass.

1
Bioinformatics
1061 papers in training set
Top 0.8%
26.5%
2
Journal of Computational Biology
37 papers in training set
Top 0.1%
8.6%
3
BMC Bioinformatics
383 papers in training set
Top 1%
8.6%
4
PLOS Computational Biology
1633 papers in training set
Top 5%
7.0%
50% of probability mass above
5
Theoretical Population Biology
47 papers in training set
Top 0.1%
3.7%
6
Frontiers in Genetics
197 papers in training set
Top 2%
3.7%
7
Genetics
225 papers in training set
Top 2%
2.1%
8
PLOS ONE
4510 papers in training set
Top 47%
2.1%
9
Biophysical Journal
545 papers in training set
Top 3%
1.8%
10
Algorithms for Molecular Biology
15 papers in training set
Top 0.1%
1.8%
11
Scientific Reports
3102 papers in training set
Top 57%
1.7%
12
Statistics in Medicine
34 papers in training set
Top 0.2%
1.7%
13
Biostatistics
21 papers in training set
Top 0.1%
1.7%
14
Physical Review E
95 papers in training set
Top 0.7%
1.7%
15
Journal of Bioinformatics and Systems Biology
14 papers in training set
Top 0.2%
1.5%
16
The Annals of Applied Statistics
15 papers in training set
Top 0.1%
1.3%
17
The American Journal of Human Genetics
206 papers in training set
Top 3%
1.3%
18
Biometrics
22 papers in training set
Top 0.1%
1.3%
19
Bioinformatics Advances
184 papers in training set
Top 4%
1.3%
20
Genetic Epidemiology
46 papers in training set
Top 0.6%
1.1%
21
Frontiers in Molecular Biosciences
100 papers in training set
Top 3%
1.0%
22
IEEE Transactions on Computational Biology and Bioinformatics
17 papers in training set
Top 0.4%
1.0%
23
Cancers
200 papers in training set
Top 4%
0.8%
24
GENETICS
189 papers in training set
Top 1%
0.7%
25
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 45%
0.7%
26
Briefings in Bioinformatics
326 papers in training set
Top 7%
0.7%
27
Cell Systems
167 papers in training set
Top 13%
0.7%
28
IEEE/ACM Transactions on Computational Biology and Bioinformatics
32 papers in training set
Top 0.7%
0.7%
29
Nature Communications
4913 papers in training set
Top 67%
0.5%