Early terminated transcripts and missing proteins reflect artifacts in bacterial proteomes

Insana, G.; Martin, M. J.; Pearson, W. R.

2026-05-19 bioinformatics

10.64898/2026.05.19.725897 bioRxiv

Show abstract

MMseqs2 clustering was used to examine the uniformity of proteomes from 20 bacterial species. Clusters with proteins from [≥]50% of proteomes typically contain proteins from 95% of the proteomes and capture more than 80% of the proteins in an organism. Protein clusters are highly uniform in length; across the 20 bacteria, the median cluster has more than 99% of the proteins at the mode length. In contrast to this uniformity, some clusters contain dozens to hundreds of proteins that are considerably shorter (<75%) than the mode-length, and a few clusters include proteins that are >133% the mode length. Most "outlier" proteins are found in fewer than 10% of clusters, and "high-outlier" clusters are over-represented in a small fraction of proteomes, that often have poor Proteome BUSCO fragment scores. Short-outlier proteins are artifacts; at least 80% of short-outlier genomes contain mode-length copies of the protein, which were missed because of frame-shifts, termination codons, or initiation codon choice. As with "short-outlier" proteins, the [~]5% of proteomes missing from the core (50% participation) cluster set encode the missing protein more than 98% of the time. MMseqs2 clustering with 50% participation provides robust sets of core bacterial proteins.

Early terminated transcripts and missing proteins reflect artifacts in bacterial proteomes

Matching journals