Back

WasteFams: A database of protein families from global wastewater microbiomes

Galaras, A.; Chasapi, I. N.; Aplakidou, E.; Chasapi, M. N.; Lamari, E.; Diplari, S.; Georgakopoulos-Soares, I.; Karatzas, E.; Baltoumas, F. A.; Kyrpides, N.; Pavlopoulos, G.

2026-05-12 bioinformatics
10.64898/2026.05.08.723720 bioRxiv
Show abstract

Wastewater surveillance has emerged as a critical tool for global epidemiology, yet the functional diversity of wastewater microbiomes remains poorly characterized at the protein level. Here, we present WasteFams, the first comprehensive database dedicated to the systematic exploration of protein families in wastewater metagenomic and metatranscriptomic studies worldwide. Integrating data from 580 metagenomes, 132 metatranscriptomes, and 1,709 reference genomes, WasteFams catalogs 3,887 non-redundant protein families (containing {succeq}100 members) derived from over 105 million predicted proteins. Each protein family is enriched with multi-layered annotations, including AlphaFold3 structural predictions, taxonomic classifications, and biome-specific metadata. To further expand their functional annotation, we integrated deep genomic context analysis to link protein families to Mobile Genetic Elements (MGEs), Biosynthetic Gene Clusters (BGCs), Antibiotic Resistance Genes (ARGs), and CRISPR elements. Accessible through the EnvoFams portal, WasteFams provides a user-friendly interface featuring advanced search capabilities, sequence and structural similarity tools, and interactive visualization modules. As global initiatives increasingly leverage wastewater for public health and environmental insights, WasteFams can serve as a critical resource for discovering novel microbial functions, monitoring resistance mechanisms, and exploring the biotechnological potential of secondary metabolites within wastewater-engineered ecosystems.

Matching journals

The top 4 journals account for 50% of the predicted probability mass.

1
Nature Communications
4913 papers in training set
Top 2%
23.0%
2
Microbiome
139 papers in training set
Top 0.1%
19.0%
3
Nature Biotechnology
147 papers in training set
Top 1%
6.5%
4
mSystems
361 papers in training set
Top 2%
5.0%
50% of probability mass above
5
Nucleic Acids Research
1128 papers in training set
Top 4%
4.4%
6
Genome Biology
555 papers in training set
Top 2%
4.4%
7
Genome Medicine
154 papers in training set
Top 3%
2.8%
8
Advanced Science
249 papers in training set
Top 7%
2.7%
9
ISME Communications
103 papers in training set
Top 1%
1.8%
10
Water Research
74 papers in training set
Top 0.9%
1.7%
11
Cell Systems
167 papers in training set
Top 7%
1.7%
12
Nature
575 papers in training set
Top 12%
1.5%
13
PLOS ONE
4510 papers in training set
Top 57%
1.4%
14
Cell Reports Methods
141 papers in training set
Top 3%
1.4%
15
Environmental Science & Technology Letters
22 papers in training set
Top 0.2%
1.3%
16
Nature Microbiology
133 papers in training set
Top 4%
1.0%
17
Bioinformatics
1061 papers in training set
Top 8%
1.0%
18
PLOS Computational Biology
1633 papers in training set
Top 22%
0.9%
19
mBio
750 papers in training set
Top 10%
0.8%
20
Briefings in Bioinformatics
326 papers in training set
Top 6%
0.8%
21
Nature Methods
336 papers in training set
Top 6%
0.7%
22
Computational and Structural Biotechnology Journal
216 papers in training set
Top 10%
0.7%
23
Scientific Data
174 papers in training set
Top 3%
0.5%
24
Science of The Total Environment
179 papers in training set
Top 6%
0.5%
25
npj Biofilms and Microbiomes
56 papers in training set
Top 2%
0.5%