Back

The effect of removing repeat-induced overlaps in de novo assembly

Shiarli Hossein Zade, R.; Abeel, T.

2023-04-18 bioinformatics
10.1101/2023.04.16.537101 bioRxiv
Show abstract

Determining accurate genotypes is important for associating phenotypes to genotypes. De novo genome assembly is a critical step to determine the complete genotype for species for which no reference exists yet. The main challenge of de novo eukaryote genome assembly, particularly plant genomes, are repetitive DNA sequences within their genomes. The introduction of third generation sequencing and corresponding long reads has promised to resolve repeat-related problems. While there have been notable improvements, reads originating from these repeats are still creating errors because they introduce false overlaps in the assembly graph. This study focuses on analyzing the effect of repeats on de novo assembly and improving performance of existing de novo assembly algorithms by removing repeat-induced overlaps. First, we show the possible improvements in de novo assembly with removing repeat-induced overlaps. Then we propose several methods for detecting and removing repeat-induced overlaps and evaluate their performance on several simulated datasets.

Matching journals

The top 8 journals account for 50% of the predicted probability mass.

1
BMC Bioinformatics
383 papers in training set
Top 0.3%
18.4%
2
Bioinformatics
1061 papers in training set
Top 4%
6.3%
3
PLOS Computational Biology
1633 papers in training set
Top 6%
6.2%
4
PLOS ONE
4510 papers in training set
Top 34%
4.3%
5
Journal of Bioinformatics and Systems Biology
14 papers in training set
Top 0.1%
4.1%
6
Frontiers in Genetics
197 papers in training set
Top 2%
3.9%
7
The Plant Genome
53 papers in training set
Top 0.2%
3.5%
8
Briefings in Bioinformatics
326 papers in training set
Top 2%
3.5%
50% of probability mass above
9
Frontiers in Plant Science
240 papers in training set
Top 3%
3.0%
10
Scientific Reports
3102 papers in training set
Top 43%
2.8%
11
Genomics, Proteomics & Bioinformatics
171 papers in training set
Top 2%
2.7%
12
Gigabyte
60 papers in training set
Top 0.4%
2.6%
13
BMC Genomics
328 papers in training set
Top 1%
2.6%
14
Journal of Computational Biology
37 papers in training set
Top 0.1%
1.9%
15
Genome Biology
555 papers in training set
Top 4%
1.8%
16
Journal of Genetics and Genomics
36 papers in training set
Top 1.0%
1.7%
17
Genes
126 papers in training set
Top 1%
1.7%
18
Plant Physiology
217 papers in training set
Top 2%
1.3%
19
Plant Methods
39 papers in training set
Top 0.5%
1.3%
20
Horticulture Research
43 papers in training set
Top 1%
1.2%
21
NAR Genomics and Bioinformatics
214 papers in training set
Top 3%
1.2%
22
Bioinformatics Advances
184 papers in training set
Top 4%
1.1%
23
Plant Direct
81 papers in training set
Top 2%
1.1%
24
PeerJ
261 papers in training set
Top 12%
0.9%
25
GigaScience
172 papers in training set
Top 3%
0.9%
26
IEEE/ACM Transactions on Computational Biology and Bioinformatics
32 papers in training set
Top 0.5%
0.8%
27
Genome Research
409 papers in training set
Top 4%
0.7%
28
Computational and Structural Biotechnology Journal
216 papers in training set
Top 10%
0.7%
29
BioData Mining
15 papers in training set
Top 1%
0.7%
30
Frontiers in Bioinformatics
45 papers in training set
Top 1%
0.6%