Back

Early terminated transcripts and missing proteins reflect artifacts in bacterial proteomes

Insana, G.; Martin, M. J.; Pearson, W. R.

2026-05-19 bioinformatics
10.64898/2026.05.19.725897 bioRxiv
Show abstract

MMseqs2 clustering was used to examine the uniformity of proteomes from 20 bacterial species. Clusters with proteins from [&ge;]50% of proteomes typically contain proteins from 95% of the proteomes and capture more than 80% of the proteins in an organism. Protein clusters are highly uniform in length; across the 20 bacteria, the median cluster has more than 99% of the proteins at the mode length. In contrast to this uniformity, some clusters contain dozens to hundreds of proteins that are considerably shorter (<75%) than the mode-length, and a few clusters include proteins that are >133% the mode length. Most "outlier" proteins are found in fewer than 10% of clusters, and "high-outlier" clusters are over-represented in a small fraction of proteomes, that often have poor Proteome BUSCO fragment scores. Short-outlier proteins are artifacts; at least 80% of short-outlier genomes contain mode-length copies of the protein, which were missed because of frame-shifts, termination codons, or initiation codon choice. As with "short-outlier" proteins, the [~]5% of proteomes missing from the core (50% participation) cluster set encode the missing protein more than 98% of the time. MMseqs2 clustering with 50% participation provides robust sets of core bacterial proteins.

Matching journals

The top 5 journals account for 50% of the predicted probability mass.

1
mSystems
361 papers in training set
Top 0.4%
13.7%
2
PLOS Computational Biology
1633 papers in training set
Top 2%
13.7%
3
Bioinformatics
1061 papers in training set
Top 3%
8.7%
4
Genome Biology
555 papers in training set
Top 0.7%
8.0%
5
Nature Communications
4913 papers in training set
Top 32%
6.0%
50% of probability mass above
6
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 19%
3.8%
7
BMC Bioinformatics
383 papers in training set
Top 3%
3.5%
8
Genome Research
409 papers in training set
Top 1%
3.4%
9
Journal of Proteome Research
215 papers in training set
Top 0.9%
2.9%
10
mSphere
281 papers in training set
Top 2%
2.9%
11
Nucleic Acids Research
1128 papers in training set
Top 8%
2.6%
12
Cell Systems
167 papers in training set
Top 5%
2.6%
13
Journal of Molecular Biology
217 papers in training set
Top 1%
2.3%
14
Frontiers in Microbiology
375 papers in training set
Top 5%
1.8%
15
Molecular Systems Biology
142 papers in training set
Top 0.5%
1.8%
16
NAR Genomics and Bioinformatics
214 papers in training set
Top 2%
1.6%
17
BMC Genomics
328 papers in training set
Top 3%
1.6%
18
Scientific Reports
3102 papers in training set
Top 65%
1.3%
19
PLOS ONE
4510 papers in training set
Top 61%
1.1%
20
GigaScience
172 papers in training set
Top 2%
0.9%
21
mBio
750 papers in training set
Top 10%
0.9%
22
eLife
5422 papers in training set
Top 53%
0.9%
23
Microbial Genomics
204 papers in training set
Top 2%
0.9%
24
PeerJ
261 papers in training set
Top 12%
0.9%
25
Nature Microbiology
133 papers in training set
Top 4%
0.8%
26
PLOS Biology
408 papers in training set
Top 21%
0.7%
27
Microbiome
139 papers in training set
Top 4%
0.6%
28
Peer Community Journal
254 papers in training set
Top 5%
0.6%