Back

Choosing representative proteins based on splicing structure similarity improves the accuracy of gene tree reconstruction

Kuitche Kamela, E.; Degen, M.; Wang, S.; Ouangraoua, A.

2020-04-10 bioinformatics
10.1101/2020.04.09.034785 bioRxiv
Show abstract

Constructing accurate gene trees is important, as gene trees play a key role in several biological studies, such as species tree reconstruction, gene functional analysis and gene family evolution studies. The accuracy of these studies is dependent on the accuracy of the input gene trees. Although several methods have been developed for improving the construction and the correction of gene trees by making use of the relationship with a species tree in addition to multiple sequence alignment, there is still a large room for improvement on the accuracy of gene trees and the computing time. In particular, accounting for alternative splicing that allows eukaryote genes to produce multiple transcripts/proteins per gene is a way to improve the quality of multiple sequence alignments used by gene tree reconstruction methods. Current methods for gene tree reconstruction usually make use of a set of transcripts composed of one representative transcript per gene, to generate multiple sequence alignments which are then used to estimate gene trees. Thus, the accuracy of the estimated gene tree depends on the choice of the representative transcripts. In this work, we present an alternative-splicing-aware method called Splicing Homology Transcript (SHT) method to estimate gene trees based on wisely selecting an accurate set of homologous transcripts to represent the genes of a gene family. We introduce a new similarity measure between transcripts for quantifying the level of homology between transcripts by combining a splicing structure-based similarity score with a sequence-based similarity score. We present a new method to cluster transcripts into a set of splicing homology groups based on the new similarity measure. The method is applied to reconstruct gene trees of the Ensembl database gene families, and a comparison with current EnsemblCompara gene trees is performed. The results show that the new approach improves gene tree accuracy thanks to the use of the new similarity measure between transcripts. An implementation of the method as well as the data used and generated in this work are available at https://github.com/UdeS-CoBIUS/SplicingHomologGeneTree/.

Matching journals

The top 2 journals account for 50% of the predicted probability mass.

1
BMC Bioinformatics
383 papers in training set
Top 0.1%
42.4%
2
Bioinformatics
1061 papers in training set
Top 3%
10.8%
50% of probability mass above
3
PLOS ONE
4510 papers in training set
Top 25%
6.8%
4
Journal of Computational Biology
37 papers in training set
Top 0.1%
3.8%
5
PLOS Computational Biology
1633 papers in training set
Top 9%
3.8%
6
Scientific Reports
3102 papers in training set
Top 52%
2.0%
7
Briefings in Bioinformatics
326 papers in training set
Top 3%
2.0%
8
BMC Genomics
328 papers in training set
Top 2%
1.8%
9
Bioinformatics Advances
184 papers in training set
Top 3%
1.4%
10
IEEE/ACM Transactions on Computational Biology and Bioinformatics
32 papers in training set
Top 0.3%
1.4%
11
NAR Genomics and Bioinformatics
214 papers in training set
Top 3%
1.0%
12
Genomics, Proteomics & Bioinformatics
171 papers in training set
Top 5%
1.0%
13
Neuroinformatics
40 papers in training set
Top 0.8%
0.8%
14
Computational Biology and Chemistry
23 papers in training set
Top 0.4%
0.8%
15
Frontiers in Bioinformatics
45 papers in training set
Top 0.7%
0.8%
16
F1000Research
79 papers in training set
Top 4%
0.8%
17
Frontiers in Genetics
197 papers in training set
Top 9%
0.8%
18
IEEE Access
31 papers in training set
Top 0.9%
0.8%
19
Journal of Bioinformatics and Systems Biology
14 papers in training set
Top 0.6%
0.8%
20
GigaScience
172 papers in training set
Top 3%
0.8%
21
Gigabyte
60 papers in training set
Top 1%
0.8%
22
Nucleic Acids Research
1128 papers in training set
Top 17%
0.8%
23
Methods
29 papers in training set
Top 0.7%
0.7%
24
Computational and Structural Biotechnology Journal
216 papers in training set
Top 10%
0.7%
25
IEEE Journal of Biomedical and Health Informatics
34 papers in training set
Top 3%
0.5%
26
Artificial Intelligence in Medicine
15 papers in training set
Top 0.9%
0.5%