Back

Combining amino acid frequency and 1D convolutional neural network embeddings for the identification of protein-protein interactions using a random forest classifier

Sindhi, N. A.; Pawar, N.; Dixson, J.; Garcia, D.

2026-05-18 bioinformatics
10.64898/2026.05.15.725340 bioRxiv
Show abstract

Predicting protein-protein interactions is a fundamental problem in molecular biology. Experimental approaches for identifying protein-protein interactions are time-consuming and labor-intensive, motivating the development of efficient computational alternatives, including machine learning-based methods. However, conventional machine learning methods often rely on manually engineered features that require substantial domain expertise. In this study, we propose a two-stage framework to address these limitations. In the first stage, a one-dimensional convolutional neural network autoencoder is used to automatically learn latent representations from protein sequences. The quality of these features is evaluated through reconstruction error, reflecting how accurately the model reconstructs the original sequence. In the second stage, these learned features are combined with amino acid frequency-based features to form a hybrid feature set for predicting protein-protein interactions. A systematic comparison is performed between models trained on frequency features alone and those using a hybrid representation. The comparison showed that incorporating one-dimensional convolutional neural network-derived latent features improved the models performance of predicting protein-protein interactions. The dataset was split into training, validation, and test sets. Nested cross-validation was employed, with inner loops for hyperparameter tuning and outer loops for model selection. The random forest classifier achieved the best performance, with a mean receiver operating characteristic-area under curve of 0.91 and a test F1-score of 0.87. These results highlight the effectiveness of integrating deep feature learning with ensemble methods for predicting protein-protein interactions and build upon previous work focused on this fundamental problem. Author SummaryProtein-protein interactions are fundamental in all biological processes. However, predicting these interactions is a key problem in molecular biology. Computational approaches have been tested to address this problem. We applied a mix of machine learning and deep learning to gain insight into the qualities of proteins that engage in interaction. First, we trained a deep learning model, which automatically learned the primary sequence and characters related thereto, reducing bias in the actual prediction process. We combined these features, or latent representations, with amino acid frequency features of protein sequences, and called the two together "hybrid features." Then we performed a systematic comparison of amino acid frequency features-only with hybrid features, among four different machine learning classifiers. Our results suggest that the random forest classifier performed best among all four classifiers at predicting interactions between proteins. We propose that this approach could be used to improve efficiency in testing protein-protein interactions at the bench and may have applications to other biologically relevant molecular interactions.

Matching journals

The top 5 journals account for 50% of the predicted probability mass.

1
BMC Bioinformatics
383 papers in training set
Top 0.2%
22.7%
2
Bioinformatics
1061 papers in training set
Top 2%
14.5%
3
PLOS Computational Biology
1633 papers in training set
Top 7%
4.9%
4
NAR Genomics and Bioinformatics
214 papers in training set
Top 0.3%
4.9%
5
PLOS ONE
4510 papers in training set
Top 33%
4.3%
50% of probability mass above
6
Biology Methods and Protocols
53 papers in training set
Top 0.2%
4.0%
7
Frontiers in Bioinformatics
45 papers in training set
Top 0.1%
3.3%
8
Physical Biology
43 papers in training set
Top 0.5%
3.1%
9
F1000Research
79 papers in training set
Top 0.9%
2.4%
10
Journal of Bioinformatics and Systems Biology
14 papers in training set
Top 0.1%
2.1%
11
Bioinformatics Advances
184 papers in training set
Top 2%
2.1%
12
Computational and Structural Biotechnology Journal
216 papers in training set
Top 3%
1.9%
13
Scientific Reports
3102 papers in training set
Top 52%
1.9%
14
Frontiers in Genetics
197 papers in training set
Top 4%
1.9%
15
PeerJ
261 papers in training set
Top 7%
1.7%
16
BioData Mining
15 papers in training set
Top 0.3%
1.7%
17
Computational Biology and Chemistry
23 papers in training set
Top 0.2%
1.3%
18
BMC Genomics
328 papers in training set
Top 3%
1.2%
19
Proteins: Structure, Function, and Bioinformatics
82 papers in training set
Top 0.6%
1.2%
20
Protein Science
221 papers in training set
Top 1%
1.2%
21
Briefings in Bioinformatics
326 papers in training set
Top 5%
1.1%
22
Frontiers in Molecular Biosciences
100 papers in training set
Top 3%
1.0%
23
Journal of Chemical Information and Modeling
207 papers in training set
Top 3%
0.9%
24
IEEE/ACM Transactions on Computational Biology and Bioinformatics
32 papers in training set
Top 0.5%
0.8%
25
Biomolecules
95 papers in training set
Top 3%
0.7%
26
International Journal of Molecular Sciences
453 papers in training set
Top 17%
0.7%
27
Journal of Computational Biology
37 papers in training set
Top 0.8%
0.5%
28
Biology
43 papers in training set
Top 4%
0.5%