Resolution of recursive data corruption to transform T-cell epitope discovery
Preibisch, G.; Tyrolski, M.; Kucharski, P.; Gizinski, S.; Grzegorczyk, P.; Moon, S.; Kim, S.; Zaro, B.; Gambin, A.
Show abstract
Accurate prediction of MHC class I-presented peptides is essential for any vaccine or T-cell therapy design, yet reported gains on in silico benchmarks have not translated into clinical successes. Here we show that this discrepancy may come from a common methodological error: immunopeptidomics datasets are fundamentally contaminated by existing prediction models through prediction-based deconvolution and filtering, resulting in an iterative confirmation bias. An audit of the IEDB, the biggest database in the field, reveals that as of January 2025, 55.8% of assessable data are labeled by computational models rather than verified experimentally. This inflates in silico benchmarks while degrading real-world applicability on new data, effectively making it impossible to objectively test model performance, which can lead to choosing suboptimal solutions and decreasing the chance of any therapys clinical success. In silico simulation shows that iterative data corruption maintains high AUROC while top-of-list retrieval collapses. We reframe epitope discovery as a protein-centric learning-to-rank task and introduce deepMHCflare, a model evaluated exclusively on clean data. deepMHCflare achieves 0.80 Precision@4 on mono-allelic benchmarks versus 0.55-0.65 for gold-standard prediction models. A preclinical cancer vaccine study validated that 2 of the 4 deepMHCflare-nominated peptides were immunogenic, with a third independently confirmed in the literature.
Matching journals
The top 4 journals account for 50% of the predicted probability mass.