Back

Improving the assessment of deep learning models in the context of drug-target interaction prediction

Torrisi, M.; de la Vega de Leon, A.; Climent, G.; Loos, R.; Panjkovich, A.

2022-04-21 bioinformatics
10.1101/2022.04.20.488898 bioRxiv
Show abstract

AO_SCPLOWBSTRACTC_SCPLOWMachine Learning techniques have been widely adopted to predict drug-target interactions, a central area of research in early drug discovery. These techniques have shown promising results on various benchmarks although they tend to suffer from poor generalization. This is typically related to very sparse and nonuniform datasets available, which limits the applicability domain of machine learning techniques. Moreover, widespread approaches to split datasets (into training and test sets) treat a drug-target interaction as an independent entities, when in reality the drug and target involved may take part in other interactions, breaking apart the assumption of independence. We observe that this leads to overly optimistic test results and poor generalization of out-of-distribution samples for various state-of-the-art sequence-based machine learning models for drug-target prediction. We show that previous approaches to reduce bias in binding datasets focus on drug or target information only and, thus, lead to similar pitfalls. Finally, we propose a minimum viable solution to evaluate the generalization capability of a machine learning model based on the systematic separation of test samples with respect to drugs and targets in the training set, thus discerning the three out-of-distribution scenarios seen at test time: (1) drug or (2) target present in the training set, or (3) neither.

Matching journals

The top 3 journals account for 50% of the predicted probability mass.

1
Journal of Chemical Information and Modeling
207 papers in training set
Top 0.1%
29.3%
2
Journal of Cheminformatics
25 papers in training set
Top 0.1%
19.8%
3
Bioinformatics
1061 papers in training set
Top 4%
5.1%
50% of probability mass above
4
Scientific Reports
3102 papers in training set
Top 26%
4.6%
5
PLOS Computational Biology
1633 papers in training set
Top 9%
3.8%
6
Briefings in Bioinformatics
326 papers in training set
Top 2%
3.3%
7
Artificial Intelligence in the Life Sciences
11 papers in training set
Top 0.1%
3.1%
8
Molecules
37 papers in training set
Top 0.5%
2.2%
9
Computational and Structural Biotechnology Journal
216 papers in training set
Top 3%
2.2%
10
BMC Bioinformatics
383 papers in training set
Top 4%
2.2%
11
Computers in Biology and Medicine
120 papers in training set
Top 2%
1.9%
12
Frontiers in Molecular Biosciences
100 papers in training set
Top 3%
1.0%
13
PLOS ONE
4510 papers in training set
Top 63%
0.9%
14
Patterns
70 papers in training set
Top 2%
0.8%
15
International Journal of Molecular Sciences
453 papers in training set
Top 13%
0.8%
16
IEEE/ACM Transactions on Computational Biology and Bioinformatics
32 papers in training set
Top 0.5%
0.8%
17
Frontiers in Pharmacology
100 papers in training set
Top 4%
0.8%
18
Bioinformatics Advances
184 papers in training set
Top 5%
0.8%
19
Communications Biology
886 papers in training set
Top 23%
0.8%
20
npj Systems Biology and Applications
99 papers in training set
Top 2%
0.8%
21
IEEE Transactions on Computational Biology and Bioinformatics
17 papers in training set
Top 0.9%
0.5%
22
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 48%
0.5%