Back

Phasing or purging: tackling the genome assembly of a highly heterozygous animal species in the era of high-accuracy long reads

Guiglielmoni, N.; Schiffer, P. H.

2024-06-17 genomics
10.1101/2024.06.16.599187 bioRxiv
Show abstract

The revolution of high-accuracy long reads offers unprecedented quality and contiguity in genome assembly. Pacific Biosciences (PacBio) and Oxford Nanopore Technologies have made significant strides in improving their sequencing technologies, yielding reads with error rates below 1% and lengths ranging from kilobases to megabases. These advancements have prompted the development of assembly tools tailored to leverage the enhanced accuracy of long reads. However, the challenge of collapsing haplotypes into high-quality haploid assemblies persists, especially for highly heterozygous genomes. This raises questions about the feasibility and desirability of phased assemblies versus collapsed haploid assemblies. To address these challenges, we benchmarked five assembly tools on ultra-low input PacBio HiFi and Nanopore R10.4 reads from the parthenogenetic nematode species Plectus sambesii. We propose a comprehensive methodology for assessing phased assemblies, repurposing existing evaluation programs to collect haplotype-relevant statistics. Our evaluation criteria include assembly size, contiguity, and completeness, with a focus on assessing the accuracy of phased assemblies by examining duplicated BUSCO orthologs and k -mer spectra. Additionally, we present strategies for generating collapsed assemblies by purging haplotigs. This study provides valuable insights and guidelines for generating high-quality phased and collapsed de novo genome assemblies from highly accurate long reads, particularly beneficial for non-model species genome assembly projects.

Matching journals

The top 7 journals account for 50% of the predicted probability mass.

1
G3 Genes|Genomes|Genetics
351 papers in training set
Top 0.1%
12.3%
2
Gigabyte
60 papers in training set
Top 0.1%
9.9%
3
Molecular Ecology Resources
161 papers in training set
Top 0.1%
9.9%
4
Genome Biology
555 papers in training set
Top 2%
4.8%
5
BMC Bioinformatics
383 papers in training set
Top 2%
4.8%
6
G3: Genes, Genomes, Genetics
222 papers in training set
Top 0.1%
4.2%
7
BMC Genomics
328 papers in training set
Top 0.6%
4.2%
50% of probability mass above
8
Journal of Heredity
35 papers in training set
Top 0.1%
4.1%
9
GigaScience
172 papers in training set
Top 0.6%
3.5%
10
Scientific Data
174 papers in training set
Top 0.5%
3.5%
11
PLOS Computational Biology
1633 papers in training set
Top 10%
3.5%
12
Bioinformatics
1061 papers in training set
Top 6%
3.0%
13
Methods in Ecology and Evolution
160 papers in training set
Top 1%
2.3%
14
Frontiers in Genetics
197 papers in training set
Top 4%
2.1%
15
Scientific Reports
3102 papers in training set
Top 51%
2.0%
16
NAR Genomics and Bioinformatics
214 papers in training set
Top 2%
1.9%
17
PLOS ONE
4510 papers in training set
Top 55%
1.7%
18
Genetics
225 papers in training set
Top 2%
1.7%
19
Bioinformatics Advances
184 papers in training set
Top 3%
1.7%
20
Genome Research
409 papers in training set
Top 2%
1.7%
21
iScience
1063 papers in training set
Top 25%
0.9%
22
Biology Methods and Protocols
53 papers in training set
Top 2%
0.9%
23
Nucleic Acids Research
1128 papers in training set
Top 18%
0.7%
24
PeerJ
261 papers in training set
Top 16%
0.7%
25
Developmental Dynamics
50 papers in training set
Top 0.8%
0.7%
26
Briefings in Bioinformatics
326 papers in training set
Top 8%
0.6%
27
Nature Communications
4913 papers in training set
Top 66%
0.6%