Back

STAPLER: Efficient learning of TCR-peptide specificity prediction from full-length TCR-peptide data

Kwee, B. P. Y.; Messemaker, M.; Marcus, E.; Oliveira, G.; Scheper, W.; Wu, C.; Teuwen, J.; Schumacher, T.

2023-04-28 bioinformatics
10.1101/2023.04.25.538237 bioRxiv
Show abstract

The prediction of peptide-MHC (pMHC) recognition by {beta} T-cell receptors (TCRs) remains a major biomedical challenge. Here, we develop STAPLER (Shared TCR And Peptide Language bidirectional Encoder Representations from transformers), a transformer language model that uses a joint TCR{beta}- peptide input to allow the learning of patterns within and between TCR{beta} and peptide sequences that encode recognition. First, we demonstrate how data leakage during negative data generation can confound performance estimates of neural network-based models in predicting TCR - pMHC specificity. We then demonstrate that, because of its pre-training and fine-tuning masked language modeling tasks, STAPLER outperforms both neural network-based and distance-based ML models in predicting the recognition of known antigens in an independent dataset, in particular for antigens for which little related data is available. Based on this ability to efficiently learn from limited labeled TCR- peptide data, STAPLER is well-suited to utilize growing TCR - pMHC datasets to achieve accurate prediction of TCR - pMHC specificity.

Matching journals

The top 5 journals account for 50% of the predicted probability mass.

1
Nature Machine Intelligence
61 papers in training set
Top 0.1%
23.2%
2
Cell Systems
167 papers in training set
Top 0.9%
10.4%
3
Bioinformatics
1061 papers in training set
Top 3%
8.7%
4
PLOS Computational Biology
1633 papers in training set
Top 5%
6.5%
5
Nature Communications
4913 papers in training set
Top 34%
4.4%
50% of probability mass above
6
Nature Biotechnology
147 papers in training set
Top 2%
3.7%
7
Genome Medicine
154 papers in training set
Top 2%
3.7%
8
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 24%
2.8%
9
Nucleic Acids Research
1128 papers in training set
Top 7%
2.7%
10
Nature Methods
336 papers in training set
Top 4%
2.1%
11
mAbs
28 papers in training set
Top 0.2%
1.7%
12
Patterns
70 papers in training set
Top 0.8%
1.7%
13
Cell Reports Medicine
140 papers in training set
Top 3%
1.7%
14
Bioinformatics Advances
184 papers in training set
Top 3%
1.4%
15
Cell Genomics
162 papers in training set
Top 4%
1.4%
16
Scientific Reports
3102 papers in training set
Top 63%
1.4%
17
eLife
5422 papers in training set
Top 48%
1.3%
18
Briefings in Bioinformatics
326 papers in training set
Top 5%
1.3%
19
PLOS ONE
4510 papers in training set
Top 61%
1.1%
20
Genome Research
409 papers in training set
Top 3%
1.0%
21
Frontiers in Immunology
586 papers in training set
Top 6%
1.0%
22
Science Advances
1098 papers in training set
Top 26%
0.9%
23
Nature Biomedical Engineering
42 papers in training set
Top 2%
0.8%
24
Communications Biology
886 papers in training set
Top 22%
0.8%
25
Advanced Science
249 papers in training set
Top 18%
0.8%
26
Nature Medicine
117 papers in training set
Top 5%
0.8%
27
iScience
1063 papers in training set
Top 33%
0.7%
28
Nature Computational Science
50 papers in training set
Top 2%
0.7%
29
Leukemia
39 papers in training set
Top 0.8%
0.7%
30
BMC Bioinformatics
383 papers in training set
Top 8%
0.5%