Back

Fine-Tuning Protein Language Models Enhances the Identification and Interpretation of the Transcription Factors

Hassan, M. T.; Gaffar, S.; Zahid, H.; Lee, S. J.

2025-11-28 bioinformatics
10.1101/2025.11.27.691010 bioRxiv
Show abstract

Transcription factors (TFs) are pivotal regulators of gene expression and play essential roles in diverse cellular activities. The three-dimensional organization of the genome and transcriptional regulation are predominantly orchestrated by TFs. By recruiting the transcriptional machinery to gene enhancers or promoters, TFs can either activate or repress transcription, thereby controlling gene activity and various biological pathways. Accurate identification of TFs is vital for elucidating gene regulatory mechanisms within cells. However, experimental identification remains labor-intensive and time-consuming, highlighting the necessity for efficient computational approaches. In this study, we present a two-layer predictive framework utilizing protein language models (pLMs) via full fine-tuning and parameter-efficient fine-tuning. The initial layer robustly classifies and identifies transcription factors, while the subsequent layer predicts TFs with a binding preference for methylated DNA (TFPMs). Our approach further incorporates attention weights and protein sequence motifs to enhance interpretability and predictive capability. By leveraging attention mechanisms, we highlight biologically relevant regions of the protein sequences that contribute most strongly to the predictions. Additionally, motif analysis facilitates the identification of conserved sequence patterns that are critical for TF recognition and function. Across both TF and TFPM classification tasks, the inclusion of these features allowed our methods to consistently surpass contemporary models, as demonstrated by independent test results. KeypointsO_LIDeveloped a two-layer predictive framework using protein language models (pLMs) with both full fine-tuning and parameter-efficient fine-tuning methods. C_LIO_LIThe first layer accurately identifies transcription factors (TFs), and the second layer predicts TFs with binding preference for methylated DNA (TFPMs). C_LIO_LIIntegrated attention weights and protein sequence motifs to enhance model interpretability by highlighting biologically relevant sequence regions and conserved patterns. C_LIO_LIAchieved superior performance compared to state-of-the-art methods, validated by independent testing. C_LI Mir Tanveerul Hassan obtained his M.Tech. in Computer Science from the University of Kashmir, India, in 2020, and later earned his Ph.D. in Electronics and Information Engineering from Jeonbuk National University, Jeonju, South Korea. He is currently serving as a postdoctoral fellow at the Jeonbuk RICE Intelligence Innovation Research Center. His research interests encompass computational biology, bioinformatics, and pattern recognition. Saima Gaffar received her B.Tech. and M.Tech. degrees in Computer Science from the University of Kashmir, Srinagar, India, and her Ph.D. in Electronics and Information Engineering from Jeonbuk National University, South Korea. Her research focuses on bioinformatics, computational biology, deep learning, and image processing. Hamza Zahid received his B.S. degree in Mechatronics Engineering from the University of Engineering and Technology, Peshawar, Pakistan. He is currently pursuing the integrated M.S. and Ph.D. degrees in Electronics and Information Engineering at Jeonbuk National University, South Korea. His primary research interests include the applications of artificial intelligence in computational drug discovery. Sang Jun Lee received his B.S., M.S., and Ph.D. degrees in Electrical Engineering from POSTECH, South Korea. Following his doctoral studies, he worked as a senior researcher at the Samsung Advanced Institute of Technology (SAIT). He is currently an Associate Professor in the Division of Electronics and Information Engineering at Jeonbuk National University, South Korea. His research interests include image analysis, deep learning, and medical image processing.

Matching journals

The top 7 journals account for 50% of the predicted probability mass.

1
Computational and Structural Biotechnology Journal
216 papers in training set
Top 0.1%
10.6%
2
Briefings in Bioinformatics
326 papers in training set
Top 0.3%
10.3%
3
Bioinformatics
1061 papers in training set
Top 3%
9.3%
4
IEEE Transactions on Computational Biology and Bioinformatics
17 papers in training set
Top 0.1%
6.5%
5
IEEE Journal of Biomedical and Health Informatics
34 papers in training set
Top 0.2%
6.5%
6
PLOS Computational Biology
1633 papers in training set
Top 8%
4.4%
7
IEEE/ACM Transactions on Computational Biology and Bioinformatics
32 papers in training set
Top 0.1%
4.0%
50% of probability mass above
8
BMC Bioinformatics
383 papers in training set
Top 3%
3.7%
9
Advanced Science
249 papers in training set
Top 8%
2.1%
10
Communications Biology
886 papers in training set
Top 5%
2.1%
11
Physical Review E
95 papers in training set
Top 0.6%
1.8%
12
Nature Communications
4913 papers in training set
Top 50%
1.7%
13
PLOS ONE
4510 papers in training set
Top 53%
1.7%
14
iScience
1063 papers in training set
Top 14%
1.7%
15
Frontiers in Genetics
197 papers in training set
Top 5%
1.7%
16
Journal of Computational Biology
37 papers in training set
Top 0.3%
1.4%
17
Quantitative Biology
11 papers in training set
Top 0.3%
1.4%
18
Neurocomputing
13 papers in training set
Top 0.3%
1.4%
19
Journal of Molecular Biology
217 papers in training set
Top 2%
1.2%
20
Journal of Bioinformatics and Systems Biology
14 papers in training set
Top 0.3%
1.1%
21
Cell Systems
167 papers in training set
Top 10%
1.0%
22
Scientific Reports
3102 papers in training set
Top 70%
0.9%
23
Nature Machine Intelligence
61 papers in training set
Top 3%
0.9%
24
Patterns
70 papers in training set
Top 2%
0.8%
25
npj Systems Biology and Applications
99 papers in training set
Top 2%
0.8%
26
NAR Genomics and Bioinformatics
214 papers in training set
Top 3%
0.8%
27
Bioinformatics Advances
184 papers in training set
Top 4%
0.8%
28
Journal of Chemical Information and Modeling
207 papers in training set
Top 3%
0.8%
29
Frontiers in Molecular Biosciences
100 papers in training set
Top 5%
0.8%
30
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 45%
0.7%