Back

Deciphering Bacterial and Archaeal Transcriptional Dark Matter and Its Architectural Complexity

Mattick, J. S. A.; Bromley, R. E.; Watson, K. J.; Adkins, R. S.; Holt, C. I.; Lebov, J. F.; Sparklin, B. C.; Tyson, T. S.; Rasko, D. A.; Hotopp, J. C. D.

2024-04-03 genomics
10.1101/2024.04.02.587803 bioRxiv
Show abstract

Transcripts are potential therapeutic targets, yet bacterial transcripts remain biological dark matter with uncharacterized biodiversity. We developed and applied an algorithm to predict transcripts for Escherichia coli K12 and E2348/69 strains (Bacteria:gamma-Proteobacteria) with newly generated ONT direct RNA sequencing data while predicting transcripts for Listeria monocytogenes strains Scott A and RO15 (Bacteria:Firmicute), Pseudomonas aeruginosa strains SG17M and NN2 strains (Bacteria:gamma-Proteobacteria), and Haloferax volcanii (Archaea:Halobacteria) using publicly available data. From >5 million E. coli K12 ONT direct RNA sequencing reads, 2,484 mRNAs are predicted and contain more than half of the predicted E. coli proteins. While the number of predicted transcripts varied by strain based on the amount of sequence data used for the predictions, across all strains examined, the average size of the predicted mRNAs is 1.6-1.7 kbp while the median size of the predicted bacterial 5-and 3-UTRs are 30-90 bp. Given the lack of bacterial and archaeal transcript annotation, most predictions are of novel transcripts, but we also predicted many previously characterized mRNAs and ncRNAs, including post-transcriptionally generated transcripts and small RNAs associated with pathogenesis in the E. coli E2348/69 LEE pathogenicity islands. We predicted small transcripts in the 100-200 bp range as well as >10 kbp transcripts for all strains, with the longest transcript for two of the seven strains being the nuo operon transcript, and for another two strains it was a phage/prophage transcript. This quick, easy, inexpensive, and reproducible method will facilitate the presentation of operons, transcripts, and UTR predictions alongside CDS and protein predictions in bacterial genome annotation as important resources for the research community.

Matching journals

The top 5 journals account for 50% of the predicted probability mass.

1
BMC Genomics
328 papers in training set
Top 0.1%
22.7%
2
NAR Genomics and Bioinformatics
214 papers in training set
Top 0.1%
10.5%
3
Genome Biology
555 papers in training set
Top 0.9%
6.9%
4
Microbial Genomics
204 papers in training set
Top 0.3%
6.9%
5
Nucleic Acids Research
1128 papers in training set
Top 3%
6.4%
50% of probability mass above
6
mSystems
361 papers in training set
Top 3%
3.6%
7
PLOS Computational Biology
1633 papers in training set
Top 12%
2.8%
8
Frontiers in Genetics
197 papers in training set
Top 3%
2.1%
9
GigaScience
172 papers in training set
Top 0.9%
2.1%
10
BMC Bioinformatics
383 papers in training set
Top 4%
1.7%
11
Genome Medicine
154 papers in training set
Top 4%
1.7%
12
Microbiome
139 papers in training set
Top 2%
1.7%
13
Nature Biotechnology
147 papers in training set
Top 4%
1.7%
14
Bioinformatics
1061 papers in training set
Top 7%
1.7%
15
Genomics
60 papers in training set
Top 1.0%
1.7%
16
PLOS ONE
4510 papers in training set
Top 56%
1.5%
17
Genome Research
409 papers in training set
Top 3%
1.3%
18
Scientific Reports
3102 papers in training set
Top 66%
1.2%
19
Cell Genomics
162 papers in training set
Top 5%
0.9%
20
Computational and Structural Biotechnology Journal
216 papers in training set
Top 7%
0.9%
21
Nature Communications
4913 papers in training set
Top 59%
0.9%
22
RNA Biology
70 papers in training set
Top 0.5%
0.8%
23
Frontiers in Microbiology
375 papers in training set
Top 8%
0.8%
24
Microbiology Spectrum
435 papers in training set
Top 5%
0.8%
25
Frontiers in Cellular and Infection Microbiology
98 papers in training set
Top 6%
0.8%
26
iScience
1063 papers in training set
Top 31%
0.8%
27
Cell Reports Methods
141 papers in training set
Top 5%
0.8%
28
Journal of Genetics and Genomics
36 papers in training set
Top 2%
0.7%
29
Database
51 papers in training set
Top 1%
0.6%
30
ISME Communications
103 papers in training set
Top 2%
0.6%