Back

EpiBERTope: a sequence-based pre-trained BERT model improves linear and structural epitope prediction by learning long-distance protein interactions effectively

Park, M.; Seo, S.-w.; Park, E.; Kim, J.

2022-03-02 bioinformatics
10.1101/2022.02.27.481241 bioRxiv
Show abstract

MotivationEpitopes are the immunogenic regions of antigen that are recognized by antibodies in a highly specific manner to trigger an immune response. Predicting such regions is extremely difficult yet contains profound implications for complex mechanisms of humoral immunogenicity. ResultsHere, we present a BERT-based epitope prediction model called EpiBERTope, a pre-trained model on the Swiss-Prot protein database, which can predict both linear and structural epitopes using protein sequences only. The model achieves an AUC of 0.922 and 0.667 for linear and structural epitope datasets respectively, outperforming all benchmark classification models including random forest, gradient boosting, naive Bayesian, and support vector machine models. In conclusion, EpiBERTope is a sequence-based model that captures content-based global interactions of antigen sequences, which will be transformative in epitope discovery with high specificity. Contactminjun.park@standigm.com

Matching journals

The top 3 journals account for 50% of the predicted probability mass.

1
ImmunoInformatics
11 papers in training set
Top 0.1%
33.7%
2
Bioinformatics
1061 papers in training set
Top 2%
15.0%
3
Briefings in Bioinformatics
326 papers in training set
Top 0.8%
6.5%
50% of probability mass above
4
Frontiers in Immunology
586 papers in training set
Top 2%
4.4%
5
Computational and Structural Biotechnology Journal
216 papers in training set
Top 1%
4.1%
6
Antibody Therapeutics
16 papers in training set
Top 0.1%
3.7%
7
Computers in Biology and Medicine
120 papers in training set
Top 1%
2.7%
8
PLOS Computational Biology
1633 papers in training set
Top 13%
2.4%
9
Scientific Reports
3102 papers in training set
Top 47%
2.4%
10
PLOS ONE
4510 papers in training set
Top 47%
2.1%
11
BMC Bioinformatics
383 papers in training set
Top 4%
1.7%
12
Genomics, Proteomics & Bioinformatics
171 papers in training set
Top 4%
1.5%
13
Nature Machine Intelligence
61 papers in training set
Top 2%
1.3%
14
GigaScience
172 papers in training set
Top 2%
1.3%
15
mAbs
28 papers in training set
Top 0.3%
0.9%
16
iScience
1063 papers in training set
Top 26%
0.9%
17
Bioinformatics Advances
184 papers in training set
Top 4%
0.8%
18
IEEE Journal of Biomedical and Health Informatics
34 papers in training set
Top 2%
0.8%
19
Communications Biology
886 papers in training set
Top 23%
0.8%
20
BioMed Research International
25 papers in training set
Top 4%
0.7%
21
Cell Reports Methods
141 papers in training set
Top 6%
0.7%
22
BioSystems
11 papers in training set
Top 0.4%
0.7%
23
Protein Science
221 papers in training set
Top 2%
0.7%
24
Nature Communications
4913 papers in training set
Top 67%
0.5%
25
BMC Medical Genomics
36 papers in training set
Top 2%
0.5%