Back

The omnitig framework can improve genome assembly contiguity in practice

Schmidt, S.; Toivonen, S.; Medvedev, P.; Tomescu, A. I.

2023-02-02 bioinformatics
10.1101/2023.01.30.526175 bioRxiv
Show abstract

Despite the long history of genome assembly research, there remains a large gap between the theoretical and practical work. There is practical software with little theoretical underpinning of accuracy on one hand and theoretical algorithms which have not been adopted in practice on the other. In this paper we attempt to bridge the gap between theory and practice by showing how the theoretical safe-and-complete framework can be integrated into existing assemblers in order to improve contiguity. The optimal algorithm in this framework, called the omnitig algorithm, has not been used in practice due to its complexity and its lack of robustness to real data. Instead, we pursue a simplified notion of omnitigs, giving an efficient algorithm to compute them and demonstrating their safety under certain conditions. We modify two assemblers (wtdbg2 and Flye) by replacing their unitig algorithm with the simple omnitig algorithm. We test our modifications using real HiFi data from the Drosophilia melanogaster and the Caenorhabditis elegans genome. Our modified algorithms lead to a substantial improvement in alignment-based contiguity, with negligible computational costs and either no or a small increase in the number of misassemblies.

Matching journals

The top 4 journals account for 50% of the predicted probability mass.

1
Bioinformatics
1061 papers in training set
Top 1%
22.5%
2
BMC Bioinformatics
383 papers in training set
Top 0.4%
17.5%
3
PLOS Computational Biology
1633 papers in training set
Top 5%
6.8%
4
PLOS ONE
4510 papers in training set
Top 32%
4.8%
50% of probability mass above
5
Nature Communications
4913 papers in training set
Top 35%
4.3%
6
Genome Research
409 papers in training set
Top 0.8%
4.0%
7
Algorithms for Molecular Biology
15 papers in training set
Top 0.1%
3.6%
8
Peer Community Journal
254 papers in training set
Top 0.8%
3.6%
9
iScience
1063 papers in training set
Top 7%
2.7%
10
Cell Systems
167 papers in training set
Top 5%
2.6%
11
Genome Biology
555 papers in training set
Top 4%
1.8%
12
Bioinformatics Advances
184 papers in training set
Top 3%
1.7%
13
Journal of Computational Biology
37 papers in training set
Top 0.2%
1.5%
14
Frontiers in Genetics
197 papers in training set
Top 7%
1.2%
15
Scientific Reports
3102 papers in training set
Top 66%
1.2%
16
Nucleic Acids Research
1128 papers in training set
Top 15%
0.9%
17
NAR Genomics and Bioinformatics
214 papers in training set
Top 3%
0.9%
18
Nature Biotechnology
147 papers in training set
Top 6%
0.9%
19
Molecular Biology and Evolution
488 papers in training set
Top 4%
0.9%
20
IEEE Transactions on Computational Biology and Bioinformatics
17 papers in training set
Top 0.6%
0.8%
21
PLOS Genetics
756 papers in training set
Top 14%
0.8%
22
Genome Biology and Evolution
280 papers in training set
Top 2%
0.7%
23
Genetics
225 papers in training set
Top 4%
0.7%
24
Nature Methods
336 papers in training set
Top 7%
0.6%
25
PeerJ
261 papers in training set
Top 18%
0.6%