Back

PanKmer: k-mer based and reference-free pangenome analysis

Aylward, A. J.; Petrus, S.; Mamerto, A.; Hartwick, N. T.; Michael, T. P.

2023-04-02 bioinformatics
10.1101/2023.03.31.535143 bioRxiv
Show abstract

SummaryPangenomes are replacing single reference genomes as the definitive representation of DNA sequence within a species or clade. Pangenome analysis predominantly leverages graph-based methods that require computationally intensive multiple genome alignments, do not scale to highly complex eukaryotic genomes, limit their scope to identifying structural variants (SVs), or incur bias by relying on a reference genome. Here, we present PanKmer, a toolkit designed for reference-free analysis of pangenome datasets consisting of dozens to thou-sands of individual genomes. PanKmer decomposes a set of input genomes into a table of observed k-mers and their presence-absence values in each genome. These are stored in an efficient k-mer index data format that encodes SNPs, INDELs, and SVs. It also includes functions for downstream analysis of the k-mer index, such as calculating sequence similarity statistics between individuals at whole-genome or local scales. For example, k-mers can be "anchored" in any individual genome to quantify sequence variability or conservation at a specific locus. This facilitates workflows with various biological applications, e.g. identifying cases of hybridization between plant species. PanKmer provides researchers with a valuable and convenient means to explore the full scope of genetic variation in a population, without reference bias. Availability and implementationPanKmer is implemented as a Python package with components written in Rust, released under a BSD license. The source code is available from the Python Package Index (PyPI) at https://pypi.org/project/pankmer/ as well as Gitlab at https://gitlab.com/salk-tm/pankmer. Full documentation is available at https://salk-tm.gitlab.io/pankmer/. Supplementary informationSupplementary data are available online

Matching journals

The top 3 journals account for 50% of the predicted probability mass.

1
Bioinformatics
1061 papers in training set
Top 1%
22.5%
2
Genome Biology
555 papers in training set
Top 0.1%
22.5%
3
Nature Communications
4913 papers in training set
Top 26%
6.8%
50% of probability mass above
4
Nature Methods
336 papers in training set
Top 2%
4.9%
5
Nature Biotechnology
147 papers in training set
Top 2%
4.3%
6
Nucleic Acids Research
1128 papers in training set
Top 5%
4.0%
7
Genome Research
409 papers in training set
Top 1.0%
3.6%
8
Bioinformatics Advances
184 papers in training set
Top 1%
3.6%
9
BMC Bioinformatics
383 papers in training set
Top 3%
3.6%
10
NAR Genomics and Bioinformatics
214 papers in training set
Top 0.9%
3.3%
11
Molecular Biology and Evolution
488 papers in training set
Top 2%
2.1%
12
Molecular Plant
36 papers in training set
Top 0.8%
1.7%
13
Briefings in Bioinformatics
326 papers in training set
Top 5%
1.3%
14
GigaScience
172 papers in training set
Top 2%
1.2%
15
Genome Medicine
154 papers in training set
Top 6%
0.9%
16
Nature Genetics
240 papers in training set
Top 6%
0.9%
17
Cell Reports Methods
141 papers in training set
Top 5%
0.8%
18
Plant Communications
35 papers in training set
Top 1%
0.7%
19
Journal of Open Source Software
22 papers in training set
Top 0.3%
0.7%
20
PLOS Computational Biology
1633 papers in training set
Top 26%
0.7%
21
Plant Biotechnology Journal
56 papers in training set
Top 1%
0.6%
22
Nature
575 papers in training set
Top 17%
0.6%
23
Methods in Ecology and Evolution
160 papers in training set
Top 3%
0.6%