Back

Resolution of recursive data corruption to transform T-cell epitope discovery

Preibisch, G.; Tyrolski, M.; Kucharski, P.; Gizinski, S.; Grzegorczyk, P.; Moon, S.; Kim, S.; Zaro, B.; Gambin, A.

2026-04-01 bioinformatics
10.64898/2026.03.30.710191 bioRxiv
Show abstract

Accurate prediction of MHC class I-presented peptides is essential for any vaccine or T-cell therapy design, yet reported gains on in silico benchmarks have not translated into clinical successes. Here we show that this discrepancy may come from a common methodological error: immunopeptidomics datasets are fundamentally contaminated by existing prediction models through prediction-based deconvolution and filtering, resulting in an iterative confirmation bias. An audit of the IEDB, the biggest database in the field, reveals that as of January 2025, 55.8% of assessable data are labeled by computational models rather than verified experimentally. This inflates in silico benchmarks while degrading real-world applicability on new data, effectively making it impossible to objectively test model performance, which can lead to choosing suboptimal solutions and decreasing the chance of any therapys clinical success. In silico simulation shows that iterative data corruption maintains high AUROC while top-of-list retrieval collapses. We reframe epitope discovery as a protein-centric learning-to-rank task and introduce deepMHCflare, a model evaluated exclusively on clean data. deepMHCflare achieves 0.80 Precision@4 on mono-allelic benchmarks versus 0.55-0.65 for gold-standard prediction models. A preclinical cancer vaccine study validated that 2 of the 4 deepMHCflare-nominated peptides were immunogenic, with a third independently confirmed in the literature.

Matching journals

The top 4 journals account for 50% of the predicted probability mass.

1
Cell Systems
167 papers in training set
Top 0.2%
22.5%
2
Nature Communications
4913 papers in training set
Top 11%
14.4%
3
Nature Machine Intelligence
61 papers in training set
Top 0.3%
6.8%
4
Nature Biotechnology
147 papers in training set
Top 1%
6.4%
50% of probability mass above
5
Advanced Science
249 papers in training set
Top 5%
3.7%
6
Nature Methods
336 papers in training set
Top 3%
3.6%
7
Cell Genomics
162 papers in training set
Top 2%
3.1%
8
Nucleic Acids Research
1128 papers in training set
Top 8%
2.6%
9
Science Advances
1098 papers in training set
Top 10%
2.6%
10
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 25%
2.6%
11
eLife
5422 papers in training set
Top 36%
2.1%
12
Nature
575 papers in training set
Top 10%
1.9%
13
Frontiers in Immunology
586 papers in training set
Top 4%
1.7%
14
Science
429 papers in training set
Top 14%
1.7%
15
Nature Genetics
240 papers in training set
Top 4%
1.7%
16
Nature Cell Biology
99 papers in training set
Top 3%
1.5%
17
Communications Biology
886 papers in training set
Top 12%
1.3%
18
Nature Medicine
117 papers in training set
Top 3%
1.1%
19
Cell Reports Medicine
140 papers in training set
Top 6%
0.9%
20
Nature Biomedical Engineering
42 papers in training set
Top 1%
0.9%
21
iScience
1063 papers in training set
Top 27%
0.9%
22
Cell
370 papers in training set
Top 15%
0.9%
23
Patterns
70 papers in training set
Top 2%
0.8%
24
Nature Chemical Biology
104 papers in training set
Top 3%
0.7%
25
Genome Medicine
154 papers in training set
Top 9%
0.6%
26
Structure
175 papers in training set
Top 4%
0.6%