Back

MCHelper automatically curates transposable element libraries across species

Orozco, S.; Sierra, P.; Durbin, R.; Gonzalez, J.

2023-10-20 genomics
10.1101/2023.10.17.562682 bioRxiv
Show abstract

The number of species with high quality genome sequences continues to increase, in part due to scaling up of multiple large scale biodiversity sequencing projects. While the need to annotate genic sequences in these genomes is widely acknowledged, the parallel need to annotate transposable element sequences that have been shown to alter genome architecture, rewire gene regulatory networks, and contribute to the evolution of host traits is becoming ever more evident. However, accurate genome-wide annotation of transposable element sequences is still technically challenging. Several de novo transposable element identification tools are now available, but manual curation of the libraries produced by these tools is needed to generate high quality genome annotations. Manual curation is time-consuming, and thus impractical for large-scale genomic studies, and lacks reproducibility. In this work, we present the Manual Curator Helper tool MCHelper, which automates the TE library curation process. By leveraging MCHelpers fully automated mode with the outputs from three de novo transposable element identification tools, RepeatModeler2, EDTA and REPET, in fruit fly, rice, hooded crow, zebrafish, maize, and human, we show a substantial improvement in the quality of the transposable element libraries and genome annotations. MCHelper libraries are less redundant, with up to 65% reduction in the number of consensus sequences, have up to 11.4% fewer false positive sequences, and up to [~]48% fewer "unclassified/unknown" transposable element consensus sequences. Genome-wide transposable element annotations were also improved, including larger unfragmented insertions. Moreover, MCHelper is an easy to install and easy to use tool.

Matching journals

The top 6 journals account for 50% of the predicted probability mass.

1
Mobile DNA
27 papers in training set
Top 0.1%
22.8%
2
Genome Biology
555 papers in training set
Top 0.6%
8.5%
3
Nucleic Acids Research
1128 papers in training set
Top 3%
6.9%
4
Cell Genomics
162 papers in training set
Top 0.5%
6.5%
5
Nature Biotechnology
147 papers in training set
Top 2%
4.2%
6
Nature Communications
4913 papers in training set
Top 37%
4.0%
50% of probability mass above
7
Bioinformatics
1061 papers in training set
Top 5%
3.6%
8
Genome Research
409 papers in training set
Top 1%
2.5%
9
NAR Genomics and Bioinformatics
214 papers in training set
Top 1%
2.1%
10
Plant Communications
35 papers in training set
Top 0.6%
2.1%
11
Genomics, Proteomics & Bioinformatics
171 papers in training set
Top 3%
1.9%
12
Scientific Reports
3102 papers in training set
Top 53%
1.9%
13
Bioinformatics Advances
184 papers in training set
Top 3%
1.8%
14
Molecular Plant
36 papers in training set
Top 0.8%
1.7%
15
Computational and Structural Biotechnology Journal
216 papers in training set
Top 4%
1.7%
16
Communications Biology
886 papers in training set
Top 12%
1.3%
17
Frontiers in Genetics
197 papers in training set
Top 7%
1.2%
18
Frontiers in Plant Science
240 papers in training set
Top 4%
1.2%
19
BMC Genomics
328 papers in training set
Top 3%
1.2%
20
Science
429 papers in training set
Top 17%
1.1%
21
Cell Reports Methods
141 papers in training set
Top 4%
1.0%
22
GigaScience
172 papers in training set
Top 2%
0.9%
23
Gigabyte
60 papers in training set
Top 1%
0.9%
24
PLOS ONE
4510 papers in training set
Top 66%
0.8%
25
The Plant Journal
197 papers in training set
Top 3%
0.8%
26
Frontiers in Cell and Developmental Biology
218 papers in training set
Top 9%
0.8%
27
Plant Biotechnology Journal
56 papers in training set
Top 1%
0.8%
28
Nature Plants
84 papers in training set
Top 2%
0.8%
29
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 45%
0.7%
30
Nature Genetics
240 papers in training set
Top 8%
0.7%