Fine-Tuning Protein Language Models Enhances the Identification and Interpretation of the Transcription Factors
Hassan, M. T.; Gaffar, S.; Zahid, H.; Lee, S. J.
Show abstract
Transcription factors (TFs) are pivotal regulators of gene expression and play essential roles in diverse cellular activities. The three-dimensional organization of the genome and transcriptional regulation are predominantly orchestrated by TFs. By recruiting the transcriptional machinery to gene enhancers or promoters, TFs can either activate or repress transcription, thereby controlling gene activity and various biological pathways. Accurate identification of TFs is vital for elucidating gene regulatory mechanisms within cells. However, experimental identification remains labor-intensive and time-consuming, highlighting the necessity for efficient computational approaches. In this study, we present a two-layer predictive framework utilizing protein language models (pLMs) via full fine-tuning and parameter-efficient fine-tuning. The initial layer robustly classifies and identifies transcription factors, while the subsequent layer predicts TFs with a binding preference for methylated DNA (TFPMs). Our approach further incorporates attention weights and protein sequence motifs to enhance interpretability and predictive capability. By leveraging attention mechanisms, we highlight biologically relevant regions of the protein sequences that contribute most strongly to the predictions. Additionally, motif analysis facilitates the identification of conserved sequence patterns that are critical for TF recognition and function. Across both TF and TFPM classification tasks, the inclusion of these features allowed our methods to consistently surpass contemporary models, as demonstrated by independent test results. KeypointsO_LIDeveloped a two-layer predictive framework using protein language models (pLMs) with both full fine-tuning and parameter-efficient fine-tuning methods. C_LIO_LIThe first layer accurately identifies transcription factors (TFs), and the second layer predicts TFs with binding preference for methylated DNA (TFPMs). C_LIO_LIIntegrated attention weights and protein sequence motifs to enhance model interpretability by highlighting biologically relevant sequence regions and conserved patterns. C_LIO_LIAchieved superior performance compared to state-of-the-art methods, validated by independent testing. C_LI Mir Tanveerul Hassan obtained his M.Tech. in Computer Science from the University of Kashmir, India, in 2020, and later earned his Ph.D. in Electronics and Information Engineering from Jeonbuk National University, Jeonju, South Korea. He is currently serving as a postdoctoral fellow at the Jeonbuk RICE Intelligence Innovation Research Center. His research interests encompass computational biology, bioinformatics, and pattern recognition. Saima Gaffar received her B.Tech. and M.Tech. degrees in Computer Science from the University of Kashmir, Srinagar, India, and her Ph.D. in Electronics and Information Engineering from Jeonbuk National University, South Korea. Her research focuses on bioinformatics, computational biology, deep learning, and image processing. Hamza Zahid received his B.S. degree in Mechatronics Engineering from the University of Engineering and Technology, Peshawar, Pakistan. He is currently pursuing the integrated M.S. and Ph.D. degrees in Electronics and Information Engineering at Jeonbuk National University, South Korea. His primary research interests include the applications of artificial intelligence in computational drug discovery. Sang Jun Lee received his B.S., M.S., and Ph.D. degrees in Electrical Engineering from POSTECH, South Korea. Following his doctoral studies, he worked as a senior researcher at the Samsung Advanced Institute of Technology (SAIT). He is currently an Associate Professor in the Division of Electronics and Information Engineering at Jeonbuk National University, South Korea. His research interests include image analysis, deep learning, and medical image processing.
Matching journals
The top 7 journals account for 50% of the predicted probability mass.