Back

DistPCA: Tera-Scale Genomic PCA via Out-of-Core Distributed Parallelism

Mermigkis, G.; Sofotasios, A.; Kontopoulou, E.-M.; Gallopoulos, E.; Hadjidoukas, P.

2026-05-19 bioinformatics
10.64898/2026.05.15.725487 bioRxiv
Show abstract

Principal Component Analysis (PCA) is a fundamental tool in human genetics, widely used to study population structure. However, the rapid growth of modern genomic datasets, which often exceed main memory capacity, renders traditional PCA methods infeasible, motivating out-of-core approaches. Prior work on out-of-core genomic PCA has focused primarily on optimizing the inherently compute-intensive numerical core, largely overlooking the stages of data I/O and preprocessing, which emerge as significant performance bottlenecks at tera-scale. Furthermore, existing approaches remain limited to shared-memory single-node architectures, lacking support for distributed multi-node environments. To address these limitations, we introduce DistPCA, the first distributed out-of-core framework for tera-scale genomic PCA, implemented as a C++ library and scalable across both single- and multi-node systems. Built on top of Message Passage Interface (MPI), the proposed framework employs multi-level data parallelism across the entire PCA pipeline, combining multiprocessing, multithreading, SIMD vectorization, and compute-transfer overlap, while remaining compatible with block-based methods that rely on associative operations. Extensive evaluation on real and synthetic datasets demonstrates near-linear scalability, achieving speedups of up to 58.2x and over 98% reduction in wall-clock time, while maintaining parallel efficiency above 82% and preserving accuracy in the recovered principal components.

Matching journals

The top 3 journals account for 50% of the predicted probability mass.

1
Bioinformatics
1061 papers in training set
Top 0.6%
33.2%
2
BMC Bioinformatics
383 papers in training set
Top 0.6%
12.4%
3
Bioinformatics Advances
184 papers in training set
Top 0.4%
6.4%
50% of probability mass above
4
Genome Research
409 papers in training set
Top 0.4%
6.4%
5
Nucleic Acids Research
1128 papers in training set
Top 6%
3.6%
6
Frontiers in Genetics
197 papers in training set
Top 3%
2.9%
7
Genome Biology
555 papers in training set
Top 3%
2.4%
8
Nature Communications
4913 papers in training set
Top 46%
2.4%
9
Briefings in Bioinformatics
326 papers in training set
Top 3%
2.4%
10
IEEE Transactions on Computational Biology and Bioinformatics
17 papers in training set
Top 0.1%
2.1%
11
The American Journal of Human Genetics
206 papers in training set
Top 2%
1.9%
12
Nature Computational Science
50 papers in training set
Top 0.6%
1.7%
13
NAR Genomics and Bioinformatics
214 papers in training set
Top 2%
1.7%
14
Nature Methods
336 papers in training set
Top 5%
1.5%
15
Genome Medicine
154 papers in training set
Top 5%
1.3%
16
PLOS ONE
4510 papers in training set
Top 58%
1.3%
17
PLOS Computational Biology
1633 papers in training set
Top 19%
1.2%
18
Genomics, Proteomics & Bioinformatics
171 papers in training set
Top 5%
0.9%
19
GigaScience
172 papers in training set
Top 3%
0.8%
20
iScience
1063 papers in training set
Top 29%
0.8%
21
IEEE Journal of Biomedical and Health Informatics
34 papers in training set
Top 2%
0.8%
22
Scientific Reports
3102 papers in training set
Top 74%
0.8%
23
PLOS Genetics
756 papers in training set
Top 17%
0.6%
24
Advanced Science
249 papers in training set
Top 22%
0.6%
25
Cell Systems
167 papers in training set
Top 14%
0.6%
26
European Journal of Human Genetics
49 papers in training set
Top 2%
0.5%
27
Communications Biology
886 papers in training set
Top 32%
0.5%