Back

Information Leakage in Enzyme Substrate Prediction

Atabaigi Elmi, V.; Joeres, R.; Kalinina, O. V.

2026-03-01 bioinformatics
10.64898/2026.02.26.708291 bioRxiv
Show abstract

Enzymes are essential catalysts in many cellular processes. Understanding their interactions with small molecules, such as regulators, cofactors, and most importantly, substrates, is crucial for understanding the biochemical processes that occur in cells. Correctly interpreting the roles of small molecules that interact with enzymes is key to elucidating enzyme function. Recently, the field of enzyme-small molecule interaction prediction has gained more interest from computational and, especially, deep-learning methods, and numerous datasets and models with remarkable performances have been published. In this work, we critically examine one of the most popular datasets and three models trained on it, identifying leaked information that may overinflate reported model performance. We show that the inspected models are susceptible to information leakage, and their performance drops to near-random when the leakage is removed.

Matching journals

The top 6 journals account for 50% of the predicted probability mass.

1
PLOS Computational Biology
1633 papers in training set
Top 0.7%
22.5%
2
Briefings in Bioinformatics
326 papers in training set
Top 0.7%
6.8%
3
Journal of Chemical Information and Modeling
207 papers in training set
Top 0.9%
6.4%
4
Bioinformatics
1061 papers in training set
Top 4%
6.3%
5
Computational and Structural Biotechnology Journal
216 papers in training set
Top 0.8%
4.8%
6
Journal of Cheminformatics
25 papers in training set
Top 0.1%
3.6%
50% of probability mass above
7
BMC Bioinformatics
383 papers in training set
Top 3%
3.6%
8
Scientific Reports
3102 papers in training set
Top 43%
2.9%
9
Nature Machine Intelligence
61 papers in training set
Top 1%
2.7%
10
IEEE/ACM Transactions on Computational Biology and Bioinformatics
32 papers in training set
Top 0.1%
2.1%
11
Communications Biology
886 papers in training set
Top 9%
1.7%
12
Advanced Science
249 papers in training set
Top 11%
1.7%
13
Nature Communications
4913 papers in training set
Top 51%
1.7%
14
IEEE Transactions on Computational Biology and Bioinformatics
17 papers in training set
Top 0.2%
1.7%
15
Genomics, Proteomics & Bioinformatics
171 papers in training set
Top 4%
1.5%
16
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 36%
1.3%
17
Journal of Molecular Biology
217 papers in training set
Top 2%
1.3%
18
Computational Biology and Chemistry
23 papers in training set
Top 0.2%
1.3%
19
NAR Genomics and Bioinformatics
214 papers in training set
Top 3%
1.2%
20
International Journal of Molecular Sciences
453 papers in training set
Top 11%
1.2%
21
PLOS ONE
4510 papers in training set
Top 62%
1.1%
22
Computers in Biology and Medicine
120 papers in training set
Top 3%
0.9%
23
Proteins: Structure, Function, and Bioinformatics
82 papers in training set
Top 0.7%
0.9%
24
Journal of Chemical Theory and Computation
126 papers in training set
Top 0.7%
0.9%
25
Patterns
70 papers in training set
Top 2%
0.8%
26
npj Systems Biology and Applications
99 papers in training set
Top 2%
0.8%
27
ACS Omega
90 papers in training set
Top 4%
0.7%
28
eLife
5422 papers in training set
Top 59%
0.7%
29
Frontiers in Genetics
197 papers in training set
Top 10%
0.7%
30
Chemical Science
71 papers in training set
Top 2%
0.6%