Back

Predicting transcriptional activation domain function using Graph Neural Networks

Farheen, F.; Broyles, B. K.; Zhang, Y.; Ibtehaz, N.; Erkine, A. M.; Kihara, D.

2024-05-12 bioinformatics
10.1101/2024.05.08.593266 bioRxiv
Show abstract

Analysis of factors that lead to the functionality of transcriptional activation domains remains a crucial and yet challenging task owing to the significant diversity in their sequences and their intrinsically disordered nature. Almost all existing methods that have aimed to predict activation domains have involved traditional machine learning approaches, such as logistic regression, that are unable to capture complex patterns in data or plain convolutional neural networks and have been limited in exploration of structural features. However, there is a tremendous potential in the inspection of the structural properties of activation domains, and an opportunity to investigate complex relationships between features of residues in the sequence. To address these, we have utilized the power of graph neural networks which can represent structural data in the form of nodes and edges, allowing nodes to exchange information among themselves. We have experimented with two kinds of graph formulations, one involving residues as nodes and the other assigning atoms to be the nodes. A logistic regression model was also developed to analyze feature importance. For all the models, several feature combinations were experimented with. The residue-level GNN model with amino acid type, residue position, acidic/basic/aromatic property and secondary structure feature combination gave the best performing model with accuracy, F1 score and AUROC of 97.9%, 71% and 97.1% respectively which outperformed other existing methods in the literature when applied on the dataset we used. Among the other structure-based features that were analyzed, the amphipathic property of helices also proved to be an important feature for classification. Logistic regression results showed that the most dominant feature that makes a sequence functional is the frequency of different types of amino acids in the sequence. Our results consistent have shown that functional sequences have more acidic and aromatic residues whereas basic residues are seen more in non-functional sequences.

Matching journals

The top 9 journals account for 50% of the predicted probability mass.

1
PLOS ONE
4510 papers in training set
Top 15%
12.5%
2
Scientific Reports
3102 papers in training set
Top 9%
8.5%
3
Computational and Structural Biotechnology Journal
216 papers in training set
Top 0.5%
6.5%
4
BMC Bioinformatics
383 papers in training set
Top 2%
4.9%
5
Briefings in Bioinformatics
326 papers in training set
Top 1%
4.9%
6
Computational Biology and Chemistry
23 papers in training set
Top 0.1%
4.4%
7
PLOS Computational Biology
1633 papers in training set
Top 8%
4.2%
8
IEEE/ACM Transactions on Computational Biology and Bioinformatics
32 papers in training set
Top 0.1%
3.3%
9
Journal of Chemical Information and Modeling
207 papers in training set
Top 2%
2.4%
50% of probability mass above
10
Frontiers in Genetics
197 papers in training set
Top 3%
2.1%
11
Computers in Biology and Medicine
120 papers in training set
Top 2%
2.1%
12
Bioinformatics
1061 papers in training set
Top 7%
1.9%
13
Genes
126 papers in training set
Top 0.7%
1.9%
14
Frontiers in Bioinformatics
45 papers in training set
Top 0.1%
1.9%
15
F1000Research
79 papers in training set
Top 2%
1.7%
16
Physical Biology
43 papers in training set
Top 1%
1.7%
17
Journal of Bioinformatics and Systems Biology
14 papers in training set
Top 0.2%
1.3%
18
Biosystems
18 papers in training set
Top 0.2%
1.3%
19
Journal of Computational Biology
37 papers in training set
Top 0.3%
1.2%
20
ACS Omega
90 papers in training set
Top 3%
1.1%
21
Informatics in Medicine Unlocked
21 papers in training set
Top 0.8%
1.0%
22
Molecules
37 papers in training set
Top 1%
0.9%
23
Journal of Biosciences
12 papers in training set
Top 0.1%
0.8%
24
The Journal of Physical Chemistry B
158 papers in training set
Top 2%
0.8%
25
Biology Methods and Protocols
53 papers in training set
Top 2%
0.8%
26
NAR Genomics and Bioinformatics
214 papers in training set
Top 4%
0.8%
27
Biomolecules
95 papers in training set
Top 2%
0.8%
28
BioSystems
11 papers in training set
Top 0.3%
0.8%
29
npj Systems Biology and Applications
99 papers in training set
Top 2%
0.8%
30
PeerJ
261 papers in training set
Top 15%
0.8%