Back

Interpolating and Extrapolating Node Counts in Colored Compacted de Bruijn Graphs for Pangenome Diversity

Parmigiani, L.; Peterlongo, P.

2026-03-18 bioinformatics
10.64898/2026.03.16.711983 bioRxiv
Show abstract

A pangenome is a collection of taxonomically related genomes, often from the same species, serving as a representation of their genomic diversity. The study of pangenomes, or pangenomics, aims to quantify and compare this diversity, which has significant relevance in fields such as medicine and biology. Originally conceptualized as sets of genes, pangenomes are now commonly represented as pangenome graphs. These graphs consist of nodes representing genomic sequences and edges connecting consecutive sequences within a genome. Among possible pangenome graphs, a common option is the compacted de Bruijn graph. In our work, we focus on the colored compacted de Bruijn graph, where each node is associated with a set of colors that indicate the genomes traversing it. In response to the evolution of pangenome representation, we introduce a novel method for comparing pangenomes by their node counts, addressing two main challenges: the variability in node counts arising from graphs constructed with different numbers of genomes, and the large influence of rare genomic sequences. We propose an approach for interpolating and extrapolating node counts in colored compacted de Bruijn graphs, adjusting for the number of genomes. To tackle the influence of rare genomic sequences, we apply Hill numbers, a well-established diversity index previously utilized in ecology and metagenomics for similar purposes, to proportionally weight both rare and common nodes according to the frequency of genomes traversing them.

Matching journals

The top 4 journals account for 50% of the predicted probability mass.

1
Bioinformatics
1061 papers in training set
Top 0.7%
28.3%
2
BMC Bioinformatics
383 papers in training set
Top 0.6%
13.0%
3
Bioinformatics Advances
184 papers in training set
Top 0.3%
7.3%
4
PLOS Computational Biology
1633 papers in training set
Top 8%
4.4%
50% of probability mass above
5
Nucleic Acids Research
1128 papers in training set
Top 4%
4.3%
6
NAR Genomics and Bioinformatics
214 papers in training set
Top 0.5%
4.1%
7
Briefings in Bioinformatics
326 papers in training set
Top 2%
3.7%
8
Methods in Ecology and Evolution
160 papers in training set
Top 1%
2.4%
9
Journal of Computational Biology
37 papers in training set
Top 0.1%
2.1%
10
Genome Research
409 papers in training set
Top 2%
2.1%
11
Genome Biology
555 papers in training set
Top 4%
1.7%
12
iScience
1063 papers in training set
Top 14%
1.7%
13
Nature Communications
4913 papers in training set
Top 51%
1.7%
14
Scientific Reports
3102 papers in training set
Top 58%
1.7%
15
Frontiers in Microbiology
375 papers in training set
Top 6%
1.5%
16
PLOS ONE
4510 papers in training set
Top 58%
1.4%
17
Microbiome
139 papers in training set
Top 2%
1.4%
18
Genomics, Proteomics & Bioinformatics
171 papers in training set
Top 5%
0.8%
19
Molecular Ecology Resources
161 papers in training set
Top 1%
0.8%
20
Frontiers in Genetics
197 papers in training set
Top 10%
0.7%
21
Peer Community Journal
254 papers in training set
Top 4%
0.7%
22
mSystems
361 papers in training set
Top 8%
0.7%
23
Cell Systems
167 papers in training set
Top 14%
0.5%
24
BMC Genomics
328 papers in training set
Top 7%
0.5%
25
GigaScience
172 papers in training set
Top 4%
0.5%
26
Nature Biotechnology
147 papers in training set
Top 9%
0.5%
27
Genome Biology and Evolution
280 papers in training set
Top 2%
0.5%
28
Computational and Structural Biotechnology Journal
216 papers in training set
Top 12%
0.5%