Back

De novo protein discovery in non-model organisms

Ali, A.

2026-05-13 bioinformatics
10.64898/2026.05.08.723910 bioRxiv
Show abstract

We developed plant (Parallel Annotation of Transcriptomes), a de novo method that can potentially compare RNA-seq data of any two species without a reference genome. plant is conceptually similar to chromatography. In the same way a complex mixture is filtered to isolate its individual components, we applied a computational method to identify, annotate, and quantify components across transcriptomes. The comparison points are universal protein domain annotations rather than species-specific genes, as would be the case for a differential gene expression analysis. We looked at several Selaginella species via the 1000 Plant transcriptomes initiative (1KP) where RNA-seq data for various plant species have been made publicly available. The raw reads were assembled via Trinity. The assembled transcripts were then searched against the Pfam protein domain database via InterProScan. The assembled transcripts were also quantified via kallisto. By merging these two aspects, we were able to see how often a particular protein domain - a predicted protein structure - is expressed. These quantified annotations of protein domains are comparable across species, assuming a relatively short evolutionary distance. We were also able to identify the presence of species-specific protein domains and trace each annotation back to the gene. A bubble plot was created to visualize the distributions of Pfam annotations across species as well as GO terms.

Matching journals

The top 5 journals account for 50% of the predicted probability mass.

1
BMC Bioinformatics
383 papers in training set
Top 0.5%
14.5%
2
Bioinformatics
1061 papers in training set
Top 2%
14.1%
3
PLOS Computational Biology
1633 papers in training set
Top 3%
12.1%
4
NAR Genomics and Bioinformatics
214 papers in training set
Top 0.3%
6.2%
5
PLOS ONE
4510 papers in training set
Top 40%
3.5%
50% of probability mass above
6
GigaScience
172 papers in training set
Top 0.6%
3.5%
7
Plant Direct
81 papers in training set
Top 1.0%
2.0%
8
Nature Communications
4913 papers in training set
Top 49%
1.9%
9
Molecular Biology and Evolution
488 papers in training set
Top 2%
1.9%
10
eLife
5422 papers in training set
Top 40%
1.8%
11
Genome Biology
555 papers in training set
Top 4%
1.8%
12
The Plant Cell
141 papers in training set
Top 1%
1.7%
13
Communications Biology
886 papers in training set
Top 9%
1.7%
14
Scientific Reports
3102 papers in training set
Top 59%
1.7%
15
Bioinformatics Advances
184 papers in training set
Top 3%
1.5%
16
Molecular Systems Biology
142 papers in training set
Top 0.8%
1.5%
17
iScience
1063 papers in training set
Top 20%
1.3%
18
Genome Research
409 papers in training set
Top 3%
1.2%
19
Nucleic Acids Research
1128 papers in training set
Top 14%
1.2%
20
Genetics
225 papers in training set
Top 3%
1.2%
21
Frontiers in Plant Science
240 papers in training set
Top 4%
1.2%
22
Cell Systems
167 papers in training set
Top 11%
0.9%
23
Nature Protocols
30 papers in training set
Top 0.2%
0.9%
24
Computational and Structural Biotechnology Journal
216 papers in training set
Top 9%
0.8%
25
Cell Reports Methods
141 papers in training set
Top 5%
0.8%
26
The Plant Journal
197 papers in training set
Top 3%
0.7%
27
Journal of Genetics and Genomics
36 papers in training set
Top 2%
0.7%
28
PeerJ
261 papers in training set
Top 16%
0.7%
29
New Phytologist
309 papers in training set
Top 5%
0.7%
30
BMC Genomics
328 papers in training set
Top 7%
0.6%