Back

Co-linear chaining with overlaps and gap costs

Jain, C.; Gibney, D.; Thankachan, S. V.

2021-02-03 bioinformatics
10.1101/2021.02.03.429492 bioRxiv
Show abstract

Co-linear chaining has proven to be a powerful heuristic for finding near-optimal alignments of long DNA sequences (e.g., long reads or a genome assembly) to a reference. It is used as an intermediate step in several alignment tools that employ a seed-chain-extend strategy. Despite this popularity, efficient subquadratic-time algorithms for the general case where chains support anchor overlaps and gap costs are not currently known. We present algorithms to solve the co-linear chaining problem with anchor overlaps and gap costs in O(n) time, where n denotes the count of anchors. We also establish the first theoretical connection between co-linear chaining cost and edit distance. Specifically, we prove that for a fixed set of anchors under a carefully designed chaining cost function, the optimal anchored edit distance equals the optimal co-linear chaining cost. Finally, we demonstrate experimentally that optimal co-linear chaining cost under the proposed cost function can be computed orders of magnitude faster than edit distance, and achieves correlation coefficient above 0.9 with edit distance for closely as well as distantly related sequences.

Matching journals

The top 4 journals account for 50% of the predicted probability mass.

1
Bioinformatics
1061 papers in training set
Top 0.8%
27.5%
2
Nature Communications
4913 papers in training set
Top 14%
12.3%
3
Cell Systems
167 papers in training set
Top 2%
6.3%
4
Algorithms for Molecular Biology
15 papers in training set
Top 0.1%
4.8%
50% of probability mass above
5
Genome Research
409 papers in training set
Top 0.7%
4.3%
6
BMC Bioinformatics
383 papers in training set
Top 2%
3.8%
7
Scientific Reports
3102 papers in training set
Top 37%
3.6%
8
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 20%
3.6%
9
Bioinformatics Advances
184 papers in training set
Top 1%
3.6%
10
Nature Methods
336 papers in training set
Top 3%
2.9%
11
Nature Biotechnology
147 papers in training set
Top 3%
2.6%
12
iScience
1063 papers in training set
Top 8%
2.6%
13
Nature Computational Science
50 papers in training set
Top 0.4%
2.1%
14
PLOS Computational Biology
1633 papers in training set
Top 14%
2.1%
15
Communications Biology
886 papers in training set
Top 11%
1.5%
16
PLOS ONE
4510 papers in training set
Top 59%
1.3%
17
Journal of Computational Biology
37 papers in training set
Top 0.3%
1.2%
18
Genome Biology
555 papers in training set
Top 6%
0.9%
19
Nucleic Acids Research
1128 papers in training set
Top 15%
0.9%
20
IEEE Transactions on Computational Biology and Bioinformatics
17 papers in training set
Top 0.6%
0.8%
21
NAR Genomics and Bioinformatics
214 papers in training set
Top 4%
0.7%
22
Advanced Science
249 papers in training set
Top 20%
0.7%
23
The American Journal of Human Genetics
206 papers in training set
Top 4%
0.6%
24
Briefings in Bioinformatics
326 papers in training set
Top 7%
0.6%