Back

Computational pipeline reveals nature's untapped reservoir of halogenating enzymes

Szenei, J.; Burke, A.; Liong, A.; Korenskaia, A.; Lukowski, A. L.; Ziemert, N.; Nikel, P. I.; Leao, P. N.; Moore, B. S.; Weber, T.; Blin, K.

2026-01-22 bioinformatics
10.64898/2026.01.20.700248 bioRxiv
Show abstract

Microbial halogenated natural products (hNPs) hold ecological, agricultural, and biomedical relevance. The hNP-producing potential of the organism can be assessed by the precise prediction of biosynthetic enzymes, yet the detailed annotations of halogenases are often missing from genomic and metagenomic data. We created a manually curated database (https://halogenases.secondarymetabolites.org/) containing information on the halide-specificity, role, and position of verified catalytic residues and results of the mutagenesis studies of more than 120 experimentally validated or in silico inferred halogenases. The collection of experimental data supports a computational pipeline that allows the family-, substrate-, and halide-scope-level annotation of halogenating enzymes by relying on catalytic residues, conserved motifs, and profile Hidden Markov Models (pHMMs). Our analysis with sequence similarity networks (SSNs) highlighted several underexplored clusters in the UniRef50 database. Such finding was a halogenase from Rhodopirellula baltica (RhobaVHPO) previously labelled as a hypothetical chloroperoxidase, which clustered apart from the known chloroperoxidases and bromoperoxidases, but accepted chloride and preferred bromide. Our database and workflow provide extensive and scalable solutions for the systematic and precise annotation of halogenating enzymes in genomic and metagenomic data. The in-depth categorization of halogenases will improve the chemical structure prediction of microbial hNPs, supporting ecological assessments and natural product discovery. O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=112 SRC="FIGDIR/small/700248v1_ufig1.gif" ALT="Figure 1"> View larger version (45K): org.highwire.dtl.DTLVardef@ebae51org.highwire.dtl.DTLVardef@10188f0org.highwire.dtl.DTLVardef@1c55684org.highwire.dtl.DTLVardef@b311bd_HPS_FORMAT_FIGEXP M_FIG C_FIG

Matching journals

The top 13 journals account for 50% of the predicted probability mass.

1
Computational and Structural Biotechnology Journal
216 papers in training set
Top 0.3%
7.4%
2
Nature Communications
4913 papers in training set
Top 27%
6.6%
3
Environmental Science & Technology
64 papers in training set
Top 0.4%
6.5%
4
Journal of Hazardous Materials
19 papers in training set
Top 0.1%
5.0%
5
RSC Advances
18 papers in training set
Top 0.1%
4.4%
6
Water Research
74 papers in training set
Top 0.5%
3.7%
7
PLOS Computational Biology
1633 papers in training set
Top 9%
3.7%
8
Environmental Science & Technology Letters
22 papers in training set
Top 0.1%
2.8%
9
Frontiers in Microbiology
375 papers in training set
Top 3%
2.8%
10
Journal of Chemical Information and Modeling
207 papers in training set
Top 1%
2.7%
11
eLife
5422 papers in training set
Top 34%
2.2%
12
PLOS ONE
4510 papers in training set
Top 47%
2.1%
13
Science of The Total Environment
179 papers in training set
Top 3%
1.9%
50% of probability mass above
14
Scientific Reports
3102 papers in training set
Top 56%
1.7%
15
mSystems
361 papers in training set
Top 5%
1.7%
16
Metabolites
50 papers in training set
Top 0.4%
1.7%
17
iScience
1063 papers in training set
Top 17%
1.5%
18
Environmental Research
46 papers in training set
Top 1.0%
1.4%
19
Bioinformatics
1061 papers in training set
Top 8%
1.3%
20
Scientific Data
174 papers in training set
Top 2%
1.1%
21
mBio
750 papers in training set
Top 9%
1.0%
22
Nucleic Acids Research
1128 papers in training set
Top 15%
0.9%
23
mSphere
281 papers in training set
Top 5%
0.8%
24
Journal of Structural Biology
58 papers in training set
Top 1%
0.8%
25
Protein Science
221 papers in training set
Top 2%
0.8%
26
PeerJ
261 papers in training set
Top 14%
0.8%
27
ACS Synthetic Biology
256 papers in training set
Top 3%
0.8%
28
Chemical Communications
24 papers in training set
Top 1%
0.8%
29
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 43%
0.8%
30
Cell Reports Methods
141 papers in training set
Top 5%
0.7%