Improving the assessment of deep learning models in the context of drug-target interaction prediction

Torrisi, M.; de la Vega de Leon, A.; Climent, G.; Loos, R.; Panjkovich, A.

2022-04-21 bioinformatics

10.1101/2022.04.20.488898 bioRxiv

Show abstract

AO_SCPLOWBSTRACTC_SCPLOWMachine Learning techniques have been widely adopted to predict drug-target interactions, a central area of research in early drug discovery. These techniques have shown promising results on various benchmarks although they tend to suffer from poor generalization. This is typically related to very sparse and nonuniform datasets available, which limits the applicability domain of machine learning techniques. Moreover, widespread approaches to split datasets (into training and test sets) treat a drug-target interaction as an independent entities, when in reality the drug and target involved may take part in other interactions, breaking apart the assumption of independence. We observe that this leads to overly optimistic test results and poor generalization of out-of-distribution samples for various state-of-the-art sequence-based machine learning models for drug-target prediction. We show that previous approaches to reduce bias in binding datasets focus on drug or target information only and, thus, lead to similar pitfalls. Finally, we propose a minimum viable solution to evaluate the generalization capability of a machine learning model based on the systematic separation of test samples with respect to drugs and targets in the training set, thus discerning the three out-of-distribution scenarios seen at test time: (1) drug or (2) target present in the training set, or (3) neither.

Improving the assessment of deep learning models in the context of drug-target interaction prediction

Matching journals