Back

Supervised and Unsupervised Classification of lncRNA Subtypes

Sen, R.; Fallmann, J.; Walter, M. E. M. T.; Stadler, P. F.

2020-08-17 bioinformatics
10.1101/2020.07.20.211433 bioRxiv
Show abstract

Many small nucleolar RNAs and many of the hairpin precursors of miRNAs are processed from long non-protein-coding (lncRNA) host genes. In contrast to their highly conserved and heavily structured payload, the host genes feature poorly conserved sequences. Nevertheless there is mounting evidence that the host genes have biological functions. No obvious connections between the function of the host genes and the function of their payloads have been reported. Here we inverstigate whether there is an association of host gene function or mechanisms with the type of payload. To assess this hypothesis we test whether the miRNA host genes (MIRHGs), snoRNA host genes (SNHGs), and other lncRNAs host genes can be distinguished based on sequence and structure features. A positive answer would imply a correlation between host genes and their payload. While the three classes can be distinguished reliably when the classifier is allowed to extract features from the payloads, this is no longer the case when only sequence and structure of parts of the host gene distal from the snoRNAs or miRNA payload is used for classification. Our data indicate that the functions of MIRHGs and SNHGs are largely independent of the functions of their payloads. Furthermore, there is no evidence that the MIRHGs and SNHGs form coherent classes of long non-coding RNAs distinguished by features other than their payloads.

Matching journals

The top 4 journals account for 50% of the predicted probability mass.

1
RNA
169 papers in training set
Top 0.1%
26.7%
2
PLOS Computational Biology
1633 papers in training set
Top 3%
10.8%
3
PLOS ONE
4510 papers in training set
Top 21%
8.7%
4
Scientific Reports
3102 papers in training set
Top 13%
7.0%
50% of probability mass above
5
Frontiers in Genetics
197 papers in training set
Top 0.6%
7.0%
6
RNA Biology
70 papers in training set
Top 0.1%
6.5%
7
NAR Genomics and Bioinformatics
214 papers in training set
Top 1.0%
2.8%
8
Nucleic Acids Research
1128 papers in training set
Top 7%
2.7%
9
Journal of Molecular Biology
217 papers in training set
Top 1%
1.7%
10
PeerJ
261 papers in training set
Top 7%
1.7%
11
Nature Communications
4913 papers in training set
Top 51%
1.7%
12
iScience
1063 papers in training set
Top 19%
1.4%
13
Journal of Bioinformatics and Systems Biology
14 papers in training set
Top 0.3%
1.3%
14
Computational and Structural Biotechnology Journal
216 papers in training set
Top 7%
1.0%
15
BMC Genomics
328 papers in training set
Top 5%
0.8%
16
Genes
126 papers in training set
Top 3%
0.8%
17
BMC Bioinformatics
383 papers in training set
Top 7%
0.8%
18
PLOS Genetics
756 papers in training set
Top 14%
0.8%
19
Genome Biology
555 papers in training set
Top 7%
0.7%
20
Bioinformatics
1061 papers in training set
Top 10%
0.7%
21
G3 Genes|Genomes|Genetics
351 papers in training set
Top 3%
0.7%
22
Molecular Biology and Evolution
488 papers in training set
Top 5%
0.5%
23
Communications Biology
886 papers in training set
Top 31%
0.5%
24
Genomics
60 papers in training set
Top 3%
0.5%