Back

Haplotype Matching with GBWT for Pangenome Graphs

Sanaullah, A.; Villalobos, S.; Zhi, D.; Zhang, S.

2025-02-07 bioinformatics
10.1101/2025.02.03.634410 bioRxiv
Show abstract

Traditionally, variations from a linear reference genome were used to represent large sets of haplotypes compactly. In the linear reference genome based paradigm, the positional Burrows-Wheeler transform (PBWT) has traditionally been used to perform efficient haplotype matching. Pangenome graphs have recently been proposed as an alternative to linear reference genomes for representing the full spectrum of variations in the human genome. However, haplotype matches in pangenome graph based haplotype sets are not trivially generalizable from haplotype matches in the linear reference genome based haplotype sets. Work has been done to represent large sets of haplotypes as paths through a pangenome graph. The graph Burrows-Wheeler transform (GBWT) is one such work. The GBWT essentially stores the haplotype paths in a run length compressed BWT with compressed local alphabets. Although efficient in practice count and locate queries on the GBWT were provided by the original authors, the efficient haplotype matching capabilities of the PBWT have never been shown on the GBWT. In this paper, we formally define the notion of haplotype matches in pangenome graph-based haplotype sets by generalizing from haplotype matches in linear reference genome-based haplotype sets. We also describe the relationship between set maximal matches, long matches, locally maximal matches, and text maximal matches on the GBWT, PBWT, and the BWT. We provide algorithms for outputting some of these matches by applying the data structures of the r-index (introduced by Gagie et al.) to the GBWT. We show that these structures enable set maximal match and long match queries on the GBWT in almost linear time and in space close to linear in the number of runs in the GBWT. We also provide multiple versions of the query algorithms for different combinations of the available data structures. The long match query algorithms presented here even run on the BWT in the same time complexity as the GBWT due to their similarity.

Matching journals

The top 4 journals account for 50% of the predicted probability mass.

1
Bioinformatics
1061 papers in training set
Top 0.6%
33.2%
2
Bioinformatics Advances
184 papers in training set
Top 0.3%
7.2%
3
Genome Research
409 papers in training set
Top 0.3%
7.2%
4
BMC Bioinformatics
383 papers in training set
Top 1%
6.9%
50% of probability mass above
5
Algorithms for Molecular Biology
15 papers in training set
Top 0.1%
6.4%
6
Nucleic Acids Research
1128 papers in training set
Top 5%
3.7%
7
PLOS Computational Biology
1633 papers in training set
Top 12%
2.6%
8
Genome Biology
555 papers in training set
Top 3%
2.4%
9
Cell Systems
167 papers in training set
Top 5%
2.1%
10
PLOS ONE
4510 papers in training set
Top 48%
2.1%
11
iScience
1063 papers in training set
Top 10%
2.1%
12
Frontiers in Genetics
197 papers in training set
Top 4%
1.9%
13
Genetics
225 papers in training set
Top 2%
1.7%
14
Scientific Reports
3102 papers in training set
Top 58%
1.7%
15
Nature Communications
4913 papers in training set
Top 52%
1.7%
16
Journal of Molecular Biology
217 papers in training set
Top 2%
1.2%
17
IEEE Transactions on Computational Biology and Bioinformatics
17 papers in training set
Top 0.4%
1.1%
18
Journal of Computational Biology
37 papers in training set
Top 0.4%
0.9%
19
Nature Biotechnology
147 papers in training set
Top 7%
0.8%
20
BioData Mining
15 papers in training set
Top 0.8%
0.8%
21
NAR Genomics and Bioinformatics
214 papers in training set
Top 4%
0.8%
22
Nature Methods
336 papers in training set
Top 7%
0.6%
23
PLOS Genetics
756 papers in training set
Top 18%
0.5%