Back

Performance comparison of TCR-pMHC prediction tools reveals a strong data dependency

Deng, L.; Ly, C.; Abdollahi, S.; Zhao, Y.; Prinz, I.; Bonn, S.

2022-11-24 bioinformatics
10.1101/2022.11.24.517666 bioRxiv
Show abstract

The interaction of T-cell receptors with peptide-major histocompatibility complex molecules plays a crucial role in adaptive immune responses. Currently there are various models aiming at predicting TCR-pMHC binding, while a standard dataset and procedure to compare the performance of these approaches is still missing. In this work we provide a general method for data collection, preprocessing, splitting and generation of negative examples, as well as comprehensive datasets to compare TCR-pMHC prediction models. We collected, harmonized, and merged all the major publicly available TCR-pMHC binding data and compared the performance of five state-of-the-art deep learning models (TITAN, NetTCR, ERGO, DLpTCR and ImRex) using this data. Our performance evaluation focuses on two scenarios: 1) different splitting methods for generating training and testing data to assess model generalization and 2) different data versions that vary in size and peptide imbalance to assess model robustness. Our results indicate that the five contemporary models do not generalize to peptides that have not been in the training set. We can also show that model performance is strongly dependent on the data balance and size, which indicates a relatively low model robustness. These results suggest that TCR-pMHC binding prediction remains highly challenging and requires further high quality data and novel algorithmic approaches.

Matching journals

The top 4 journals account for 50% of the predicted probability mass.

1
ImmunoInformatics
11 papers in training set
Top 0.1%
22.8%
2
Briefings in Bioinformatics
326 papers in training set
Top 0.3%
12.5%
3
PLOS Computational Biology
1633 papers in training set
Top 3%
10.2%
4
Frontiers in Immunology
586 papers in training set
Top 0.7%
8.5%
50% of probability mass above
5
Computers in Biology and Medicine
120 papers in training set
Top 0.3%
6.4%
6
Bioinformatics
1061 papers in training set
Top 5%
4.4%
7
BMC Bioinformatics
383 papers in training set
Top 3%
2.6%
8
Computational and Structural Biotechnology Journal
216 papers in training set
Top 2%
2.6%
9
Scientific Reports
3102 papers in training set
Top 47%
2.4%
10
Frontiers in Physiology
93 papers in training set
Top 2%
2.1%
11
GigaScience
172 papers in training set
Top 1%
1.7%
12
Nature Machine Intelligence
61 papers in training set
Top 2%
1.7%
13
PLOS ONE
4510 papers in training set
Top 53%
1.7%
14
Frontiers in Bioinformatics
45 papers in training set
Top 0.3%
1.5%
15
Nucleic Acids Research
1128 papers in training set
Top 13%
1.2%
16
iScience
1063 papers in training set
Top 26%
0.9%
17
Journal of Chemical Information and Modeling
207 papers in training set
Top 3%
0.8%
18
Expert Systems with Applications
11 papers in training set
Top 0.4%
0.8%
19
IEEE/ACM Transactions on Computational Biology and Bioinformatics
32 papers in training set
Top 0.6%
0.8%
20
npj Systems Biology and Applications
99 papers in training set
Top 2%
0.8%
21
IEEE Journal of Biomedical and Health Informatics
34 papers in training set
Top 2%
0.8%
22
Frontiers in Genetics
197 papers in training set
Top 10%
0.7%
23
Informatics in Medicine Unlocked
21 papers in training set
Top 1%
0.7%
24
Bioinformatics Advances
184 papers in training set
Top 5%
0.7%
25
Cell Reports Methods
141 papers in training set
Top 7%
0.5%
26
International Journal of Molecular Sciences
453 papers in training set
Top 19%
0.5%
27
Frontiers in Pharmacology
100 papers in training set
Top 6%
0.5%