Back

Identification and classification of all Cytochrome P450 deposits in the Protein Data Bank

Smieja, P.; Zadrozna, M.; Syed, K.; Nelson, D.; Gront, D.

2026-03-19 bioinformatics
10.64898/2026.03.17.712328 bioRxiv
Show abstract

Cytochrome P450 monooxygenases (CYPs/P450s) form a highly diverse enzyme superfamily central to biotechnology, pharmacology, and environmental science. Despite the large number of available structures, identifying and comparing P450 entries in structural repositories remains challenging due to their extreme sequence divergence and inconsistent annotation practices. In particular, many deposits lack the standardized nomenclature (CYPid) and rather rely on legacy or author-defined common names (like P450cam, P450BM-3 and P450-PCN1), which are often inconsistent in formatting and specificity. This is particularly difficult for a superfamily as sequentially diverse as P450s. This hinders reliable retrieval and cross-referencing, making even identification all P450 structures in the database nontrivial. To overcome these obstacles, we developed a structure-guided discovery and validation workflow combining keyword search, Hidden Markov Models, and structural alignment, enabling robust detection and annotation. This strategy identified 1,513 deposits representing 674 unique sequences. All sequences were reannotated using the P450Atlas server and manually verified, confirming high assignment accuracy. In the process, we have also identified five new CYP subfamilies. The resulting dataset constitutes the first rigorously curated, structure-linked registry of P450 enzymes, integrated into a publicly accessible resource and supported by an automated pipeline that periodically scans newly released entries. By unifying structurally validated identification with standardized CYP nomenclature, this work establishes a reliable framework for accurate retrieval, comparison, and future large-scale analyses of P450 enzymes.

Matching journals

The top 8 journals account for 50% of the predicted probability mass.

1
Nature Communications
4913 papers in training set
Top 7%
17.8%
2
Bioinformatics
1061 papers in training set
Top 4%
6.5%
3
Scientific Reports
3102 papers in training set
Top 16%
6.5%
4
Scientific Data
174 papers in training set
Top 0.3%
4.9%
5
PLOS ONE
4510 papers in training set
Top 36%
4.0%
6
Journal of Chemical Information and Modeling
207 papers in training set
Top 1%
4.0%
7
Computational and Structural Biotechnology Journal
216 papers in training set
Top 1%
3.9%
8
Communications Biology
886 papers in training set
Top 2%
3.6%
50% of probability mass above
9
Protein Science
221 papers in training set
Top 0.6%
2.5%
10
Genomics, Proteomics & Bioinformatics
171 papers in training set
Top 3%
1.9%
11
BMC Bioinformatics
383 papers in training set
Top 4%
1.8%
12
Metabolites
50 papers in training set
Top 0.4%
1.8%
13
NAR Genomics and Bioinformatics
214 papers in training set
Top 2%
1.7%
14
Nucleic Acids Research
1128 papers in training set
Top 12%
1.5%
15
ACS Omega
90 papers in training set
Top 2%
1.3%
16
International Journal of Molecular Sciences
453 papers in training set
Top 9%
1.3%
17
Genome Medicine
154 papers in training set
Top 6%
1.2%
18
Molecules
37 papers in training set
Top 1%
1.1%
19
PLOS Computational Biology
1633 papers in training set
Top 20%
1.1%
20
Journal of Molecular Biology
217 papers in training set
Top 3%
1.0%
21
ACS Synthetic Biology
256 papers in training set
Top 2%
1.0%
22
Advanced Science
249 papers in training set
Top 16%
0.9%
23
Database
51 papers in training set
Top 0.7%
0.9%
24
Acta Pharmaceutica Sinica B
11 papers in training set
Top 0.8%
0.8%
25
Clinical and Translational Science
21 papers in training set
Top 1%
0.8%
26
Toxicological Sciences
38 papers in training set
Top 0.6%
0.8%
27
RSC Advances
18 papers in training set
Top 1%
0.8%
28
Redox Biology
64 papers in training set
Top 1%
0.7%
29
RSC Chemical Biology
32 papers in training set
Top 0.6%
0.7%
30
Analytical Chemistry
205 papers in training set
Top 3%
0.7%