Identification and classification of all Cytochrome P450 deposits in the Protein Data Bank
Smieja, P.; Zadrozna, M.; Syed, K.; Nelson, D.; Gront, D.
Show abstract
Cytochrome P450 monooxygenases (CYPs/P450s) form a highly diverse enzyme superfamily central to biotechnology, pharmacology, and environmental science. Despite the large number of available structures, identifying and comparing P450 entries in structural repositories remains challenging due to their extreme sequence divergence and inconsistent annotation practices. In particular, many deposits lack the standardized nomenclature (CYPid) and rather rely on legacy or author-defined common names (like P450cam, P450BM-3 and P450-PCN1), which are often inconsistent in formatting and specificity. This is particularly difficult for a superfamily as sequentially diverse as P450s. This hinders reliable retrieval and cross-referencing, making even identification all P450 structures in the database nontrivial. To overcome these obstacles, we developed a structure-guided discovery and validation workflow combining keyword search, Hidden Markov Models, and structural alignment, enabling robust detection and annotation. This strategy identified 1,513 deposits representing 674 unique sequences. All sequences were reannotated using the P450Atlas server and manually verified, confirming high assignment accuracy. In the process, we have also identified five new CYP subfamilies. The resulting dataset constitutes the first rigorously curated, structure-linked registry of P450 enzymes, integrated into a publicly accessible resource and supported by an automated pipeline that periodically scans newly released entries. By unifying structurally validated identification with standardized CYP nomenclature, this work establishes a reliable framework for accurate retrieval, comparison, and future large-scale analyses of P450 enzymes.
Matching journals
The top 8 journals account for 50% of the predicted probability mass.