Back

Conway-Bromage-Lyndon (CBL): an exact, dynamic representation of k-mer sets

Martayan, I.; Cazaux, B.; Limasset, A.; Marchet, C.

2024-03-25 bioinformatics
10.1101/2024.01.29.577700 bioRxiv
Show abstract

In this paper, we introduce the Conway-Bromage-Lyndon (CBL) structure, a compressed, dynamic and exact method for representing k-mer sets. Originating from Conway and Bromages concept, CBL innovatively employs the smallest cyclic rotations of k-mers, akin to Lyndon words, to leverage lexicographic redundancies. In order to support dynamic operations and set operations, we propose a dynamic bit vector structure that draws a parallel with Elias-Fanos scheme. This structure is encapsulated in a Rust library, demonstrating a balanced blend of construction efficiency, cache locality, and compression. Our findings suggest that CBL outperforms existing dynamic k-mer set methods. Unique to this work, CBL stands out as the only known exact k-mer structure offering in-place set operations. Its different combined abilities position it as a flexible Swiss knife structure for k-mer set management. Availability: https://github.com/imartayan/CBL

Matching journals

The top 8 journals account for 50% of the predicted probability mass.

1
Bioinformatics
1061 papers in training set
Top 2%
15.4%
2
Algorithms for Molecular Biology
15 papers in training set
Top 0.1%
7.1%
3
Scientific Reports
3102 papers in training set
Top 21%
5.1%
4
IEEE Transactions on Computational Biology and Bioinformatics
17 papers in training set
Top 0.1%
5.1%
5
Bioinformatics Advances
184 papers in training set
Top 0.8%
4.5%
6
iScience
1063 papers in training set
Top 3%
4.5%
7
PLOS ONE
4510 papers in training set
Top 33%
4.5%
8
BMC Bioinformatics
383 papers in training set
Top 2%
4.5%
50% of probability mass above
9
Nature Communications
4913 papers in training set
Top 38%
3.8%
10
IEEE/ACM Transactions on Computational Biology and Bioinformatics
32 papers in training set
Top 0.1%
2.6%
11
Nucleic Acids Research
1128 papers in training set
Top 8%
2.2%
12
PLOS Computational Biology
1633 papers in training set
Top 13%
2.2%
13
Gigabyte
60 papers in training set
Top 0.6%
1.9%
14
Journal of Molecular Biology
217 papers in training set
Top 1%
1.9%
15
Cell Systems
167 papers in training set
Top 7%
1.8%
16
GigaScience
172 papers in training set
Top 1%
1.7%
17
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 34%
1.6%
18
Journal of Computational Biology
37 papers in training set
Top 0.3%
1.4%
19
Frontiers in Genetics
197 papers in training set
Top 6%
1.3%
20
Genome Research
409 papers in training set
Top 3%
1.2%
21
Journal of Chemical Information and Modeling
207 papers in training set
Top 2%
1.2%
22
Advanced Science
249 papers in training set
Top 15%
1.0%
23
IEEE Journal of Biomedical and Health Informatics
34 papers in training set
Top 2%
1.0%
24
Computational and Structural Biotechnology Journal
216 papers in training set
Top 7%
0.9%
25
Genomics, Proteomics & Bioinformatics
171 papers in training set
Top 5%
0.9%
26
IEEE Access
31 papers in training set
Top 0.8%
0.8%
27
Frontiers in Bioinformatics
45 papers in training set
Top 0.8%
0.8%
28
Royal Society Open Science
193 papers in training set
Top 5%
0.8%
29
PeerJ
261 papers in training set
Top 15%
0.8%
30
Communications Biology
886 papers in training set
Top 31%
0.5%