ImmunoInformatics — Latest Matching Preprints

1

HALPred-B: Host-Aware Linear B-Cell Epitope Prediction: Challenges, Limitations, and Variability Across Species

Gautam, P.; Mitra, P.; Sinha, I.

2026-06-26 bioinformatics 10.64898/2026.06.22.733770 medRxiv

Top 0.1%

28.6%

Show abstract

Predicting linear B-cell epitopes is a basic immunoinformatics task that has a direct impact on vaccine design and antibody engineering. Recent advances in machine learning have improved predictive performance, but most existing approaches are trained on aggregated datasets and assume that antigenic patterns are conserved across host organisms. This assumption ignores the immunological variability depending on the host and prevents generalizing the model across species. This is the first systematic host-wise evaluation where we present a systematic machine learning-based analysis of host-aware linear B-cell epitope prediction using curated datasets from the Immune Epitope Database (IEDB). We build separate datasets for human, mouse, and non-human primate hosts and assess several classification models, including Random Forest, Support Vector Machine (SVM), Gradient Boosting, XGBoost, and K-Nearest Neighbors (KNN). The models exploit feature representations derived from sequences, such as AAIndex descriptors, biochemical properties from ExPASy, and dipeptide composition. Our results show that predictive performance differs substantially across hosts. Models achieve up to 86.07% accuracy and 0.93 ROC-AUC on human datasets but lower performance on mouse and non-human primate datasets. This gap underlies dataset bias and sequence distribution differences, as well as the inability of existing features to capture host-specific immunological context. These results indicate that the prediction of linear B-cell epitopes is intrinsically host-specific, and a single global model does not generalize well across species. We propose to incorporate host-aware modeling strategies and organism-specific features for enhanced predictive reliability and biological relevance.

2

A Simple Generative Model for the Prediction of T-Cell Receptor - Peptide Binding in T-Cell Therapy for Cancer

Papanikolaou, A.; Sivtsov, V.; Zereik, E.; Ruggiero, E.; Bonini, C.; Bonsignorio, F.

2025-03-19 bioinformatics 10.1101/2025.03.18.643937 medRxiv

Top 0.1%

27.1%

Show abstract

ObjectiveTo develop a deep learning model capable of predicting epitope peptides recognized by specific CDR3 (Complementarity-Determining Region 3) sequences of T-cell receptors (TCRs) in the context of Major Histocompatibility Complex (MHC) molecules, addressing the challenges of incomplete datasets and the need for novel sequence generation in adoptive T-cell therapy for cancer. MethodsWe implemented a sequence to sequence generative model named "GRIP" (Generative Reconstruction of antIgen Peptides) using a Long Short-Term Memory (LSTM) network with attention mechanisms. The model was trained and validated on publicly available datasets, employing data balancing, label smoothing, and dynamic learning rate scheduling to enhance performance and generalization. Accuracy was assessed at the amino acid level. ResultsThe model achieved a training accuracy of 97% and a test accuracy of 85% for predicting epitope sequences at the amino acid level. Probabilistic sequence generation allowed GRIP to produce biologically plausible epitope sequences, even for unseen CDR3 inputs. Attention-based interpretability provided insights into the models focus on critical sequence elements. The model outperformed existing approaches in handling data imbalance and generalization to novel epitopes. ConclusionGRIP offers a novel solution to the TCR-epitope binding problem by generating potential epitope sequences instead of matching to known data, addressing a fundamental gap in existing models. This approach has significant implications for personalized immunotherapy, facilitating the design of targeted T-cell therapies for cancer.

3

Automated Discovery of Patterns in T-Cell Receptor Physicochemical Signatures

Shams, Z.; Bishop, E.; Mckee-Reid, L.; Rumbelow, J.

2025-07-10 bioinformatics 10.1101/2025.07.07.663455 medRxiv

Top 0.1%

22.2%

Show abstract

Accurately distinguishing antigen-reactive from non-reactive T-cell receptors (TCRs) is critical for advancing TCR-based immunotherapies and vaccines. Predicting antigen reactivity from physicochemical properties of the TCR sequence alone could enable rapid, low-cost identification of TCRs of interest, accelerating therapeutic discovery. In this paper, we use the Discovery Engine, a novel system for automated knowledge discovery from data, to classify published tumour antigen-reactive and non-reactive TCRs collected from cancer patients. Beyond classification, the Discovery Engine extracts interpretable combinatorial patterns (e.g., combinations of CDR3 length, net charge, and hydrophobicity) that predict whether a TCR is antigen-reactive. These patterns point to biologically meaningful features linked to tumour antigen recognition and could inform rational TCR design and prioritization. Notably, over half of the predictive patterns involve features from both the alpha and beta chains, highlighting the importance of considering both in assessing antigen specificity.

4

T-cell receptor specific protein language model for prediction and interpretation of epitope binding (ProtLM.TCR)

Essaghir, A.; Sathiyamoorthy, N. K.; Smyth, P.; Postelnicu, A.; Ghiviriga, S.; Ghita, A.; Singh, A.; Kapil, S.; Phogat, S.; Singh, G.

2022-11-29 bioinformatics 10.1101/2022.11.28.518167 medRxiv

Top 0.1%

18.3%

Show abstract

The cellular adaptive immune response relies on epitope recognition by T-cell receptors (TCRs). We used a language model for TCRs (ProtLM.TCR) to predict TCR-epitope binding. This model was pre-trained on a large set of TCR sequences (~62.106) before being fine-tuned to predict TCR-epitope bindings across multiple human leukocyte antigen (HLA) of class-I types. We then tested ProtLM.TCR on a balanced set of binders and non-binders for each epitope, avoiding model shortcuts like HLA categories. We compared pan-HLA versus HLA-specific models, and our results show that while computational prediction of novel TCR-epitope binding probability is feasible, more epitopes and diverse training datasets are required to achieve a better generalized performances in de novo epitope binding prediction tasks. We also show that ProtLM.TCR embeddings outperform BLOSUM scores and hand-crafted embeddings. Finally, we have used the LIME framework to examine the interpretability of these predictions.

5

HAIRpred2: Human Host-Specific Prediction of Antibody-Interacting Residues Using Hybrid Physicochemical and Structural Features

Mehta, N. K.; Sahni, R.; Kumar, N.; Raghava, G. P. S.

2026-05-13 bioinformatics 10.64898/2026.05.09.723672 medRxiv

Top 0.1%

18.2%

Show abstract

1.Prediction of conformational B-cell epitopes is critical for vaccine design, immunotherapy, and antibody engineering. To date, several host-independent computational methods have been developed for predicting antibody-interacting residues in antigen structures. However, it is well established that antigen-antibody (Ag-Ab) interactions vary depending on the host immune system indicating the importance of developing host-specific prediction models. In this study, we present, for the first time, a human host-specific method, HAIRpred2, that predicts antibody-interacting residues in an antigen from its tertiary structure. The dataset was derived from HAIRpred and comprises 277 human Ag-Ab complexes, with 221 structures used for training and 56 for independent testing. Preliminary analysis revealed that residues with a relative surface accessibility (RSA) below 0.05, corresponding to buried regions, are highly likely to be non-interacting, underscoring the importance of structural accessibility in antibody recognition. To identify the most informative features, we evaluated multiple feature representations, including RSA, large language model (LLM)-based embeddings, distance-based features, and physicochemical properties. A model trained on single-residue RSA features achieved an AUC of 0.72. Incorporating a sliding window of 15 residues to capture local structural context improved performance to an AUC of 0.75. The best performance (AUC = 0.78 on the independent test set) was achieved by integrating RSA with physicochemical descriptors. Benchmarking against existing antibody-interaction prediction methods on the same independent dataset demonstrated that HAIRpred2 outperforms current tools, further highlighting the advantage of host-specific modeling. HAIRpred2 is freely available as a web server at https://webs.iiitd.edu.in/raghava/hairpred2/. HighlightsO_LIDevelopment of HAIRpred2, the first human host-specific method for predicting antibody-interacting residues. C_LIO_LIAnalysis of 277 human antigen-antibody complexes to capture host-dependent interaction patterns. C_LIO_LIRelative surface accessibility (RSA) identified as a key determinant, with buried residues rarely participating in interactions. C_LIO_LIIntegration of RSA with physicochemical features achieved the best performance (AUC = 0.78) on an independent dataset. C_LIO_LIHAIRpred2 outperforms existing methods and is available as a web server for epitope prediction. C_LI

6

Determining epitope specificity of T cell receptors with TCRGP

Jokinen, E.; Huuhtanen, J.; Mustjoki, S.; Heinonen, M.; Lähdesmäki, H.

2019-08-21 bioinformatics 10.1101/542332 medRxiv

Top 0.1%

18.1%

Show abstract

T cell receptors (TCRs) can recognize various pathogens and consequently start immune responses. TCRs can be sequenced from individuals and methods analyzing the specificity of the TCRs can help us better understand individuals immune status in different diseases. We have developed TCRGP, a novel Gaussian process method to predict if TCRs recognize certain epitopes. This method can utilize CDR sequences from TCR and TCR{beta} chains and learn which CDRs are important in recognizing different epitopes. We have experimented with with epitope-specific data against 29 epitopes and performed a comprehensive evaluation with existing prediction methods. On this data, TCRGP outperforms other state-of-the-art methods in epitope-specificity predictions. We also propose a novel analysis approach for combined single-cell RNA and TCR{beta} (scRNA+TCR{beta}) sequencing data by quantifying epitope-specific TCRs with TCRGP in phenotypes identified from scRNA-seq data. With this approach, we find HBV-epitope specific T cells and their transcriptomic states in hepatocellular carcinoma patients.

7

Revealing the hidden sequence distribution of epitope-specific TCR repertoires and its influence on machine learning model performance

Gielis, S.; Chernigovskaya, M.; Pavlovic, M.; Van Deuren, V.; Vandoren, R.; Valkiers, S.; Laukens, K.; Greiff, V.; Meysman, P.

2024-10-24 bioinformatics 10.1101/2024.10.21.619364 medRxiv

Top 0.1%

16.6%

Show abstract

Numerous efforts have been made to decipher the epitope-T cell receptor (TCR) recognition code. Both simple machine learning techniques and deep learning strategies have been used to train models to predict the binding of epitopes by TCR sequences. A good training data set rests at the basis of every accurate prediction model, yet little attention has been given to the composition of these data sets. In this paper, we studied the natural distribution of TCR sequences within epitope-specific TCR repertoires, i.e. a set of TCRs binding the same epitope, and its impact on the predictability of TCR-epitope interactions. We found that the observed diversity of these repertoires can result from a smaller set of core binding motifs constrained by TCR generation. Moreover, a clear relationship was found between the sequence distribution of the training data and performance metrics, emphasizing the importance of the used ground-truth data when using machine learning models in this domain. Taken together, these findings inform data set composition to help push epitope-TCR prediction models to the next level.

8

SUMO - In Silico Sequence Assessment Using Multiple Optimization Parameters

Evers, A.; Malhotra, S.; Bolick, W.-G.; Najafian, A.; Borisovska, M.; Warszawski, S.; Fomekong Nanfack, Y.; Kuhn, D.; Rippmann, F.; Crespo, A.; Sood, V.

2022-11-22 bioinformatics 10.1101/2022.11.19.517175 medRxiv

Top 0.1%

15.7%

Show abstract

To select the most promising screening hits from antibody and VHH display campaigns for subsequent in-depth profiling and optimization, it is highly desirable to assess and select sequences on properties beyond only their binding signals from the sorting process. In addition, developability risk criteria, sequence diversity and the anticipated complexity for sequence optimization are relevant attributes for hit selection and optimization. Here, we describe an approach for the in silico developability assessment of antibody and VHH sequences. This method not only allows for ranking and filtering multiple sequences with regard to their predicted developability properties and diversity, but also visualizes relevant sequence and structural features of potentially problematic regions and thereby provides rationales and starting points for multi-parameter sequence optimization.

9

Genesis: A Modular Protein Language Modelling Approach to Immunogenicity Prediction

O'Brien, H.; Salm, M.; Morton, L. T.; Szukszto, M.; O'Farrell, F.; Boulton, C.; King, L.; Bola, S. K.; Becker, P.; Craig, A.; Nielsen, M.; Samuels, Y.; Swanton, C.; Mansour, M. R.; Hadrup, S. R.; Quezada, S.

2024-05-26 bioinformatics 10.1101/2024.05.22.595296 medRxiv

Top 0.1%

15.1%

Show abstract

Neoantigen immunogenicity prediction is a highly challenging problem in the development of personalised medicines. Low reactivity rates in called neoantigens result in a difficult prediction scenario with limited training datasets. Here we describe Genesis, a modular protein language modelling approach to immunogenicity prediction for CD8+ reactive epitopes. Genesis comprises of a pMHC encoding module trained on three pMHC prediction tasks, an optional TCR encoding module and a set of context specific immunogenicity prediction head modules. Compared with state-of-the-art models for each task, Genesis encoding module performs comparably or better on pMHC binding affinity, eluted ligand prediction and stability tasks. Genesis outperforms all compared models on pMHC immunogenicity prediction (Area under the receiver operating characteristic curve=0.619, average precision: 0.514), with a 7% increase in average precision compared to the next best model. Genesis shows further improved performance on immunogenicity prediction with the integration of TCR context information. Genesis performance is further analysed for interpretability, which locates areas of weakness found across existing immunogenicity models and highlight possible biases in public datasets.

10

MHCXGraph: A Graph-Based approach to detecting T cell receptor cross-reactivity

Simoes, C. D. M. S.; Maidana, R. L. B. R.; De Assis, S. C.; Guerra, J. V. d. S.; Ribeiro-Filho, H. V.

2026-04-10 bioinformatics 10.64898/2026.04.07.717034 medRxiv

Top 0.1%

14.9%

Show abstract

AO_SCPLOWBSTRACTC_SCPLOWThe T cell receptor (TCR) recognition of multiple peptides presented by the major histocompatibility complex (MHC) is a key natural phenomenon, enabling the T cell repertoire to respond to a broad array of antigens. Despite its importance to the immune response, T cell cross-reactivity poses a major challenge for the development of novel T cell-based therapies. In this study, we present MHCXGraph, a graph-based computational approach for identifying conserved and immunologically relevant regions across multiple structures of peptides bound to MHC molecules (pMHC). Our approach provides three operational modes with user-defined parameters, allowing flexible configuration according to specific scientific needs while delivering fully interpretable results through user-friendly interfaces. We evaluated MHCXGraph across three case studies, including peptides bound to classical MHC Class I, MHC Class II, and unbound HLA alleles, demonstrating its ability to capture conserved structural determinants beyond sequence similarity. By integrating structural information with efficient graph-based analysis, MHCXGraph addresses key limitations of sequence-based methods while maintaining computational scalability. Collectively, these results indicate that MHCXGraph can be readily integrated into computational pipelines for T cell cross-reactivity discovery, especially in the context of de novo pMHC engager design and T cell-based vaccine development.

11

Benchmarking AlphaFold and related deep learning approaches for modeling antibody and TCR antigen recognition

Yin, R.; Saravanakumar, S.; Shi, S. Y.; Park, M.; Lin, V.; Lee, J.; Cheung, M.; Felbinger, N.; Kaufman, S.; Eisenberg, M.; Pierce, B.

2026-07-06 bioinformatics 10.64898/2026.07.04.736425 medRxiv

Top 0.1%

14.8%

Show abstract

Determining the structural basis of antigen recognition by antibodies and T cell receptors (TCRs) provides critical insights into effective immune targeting and can inform design of biotherapeutics and vaccines. Accurate computational modeling of antibodies and TCRs in complex with their targets poses a major challenge for predictive methods, including AlphaFold, which is generally accurate for modeling protein complexes but has shown limited success for immune recognition. In this study we assessed the performance of AlphaFold2, AlphaFold3, increased sampling protocols, and related deep learning methods for modeling antibody-protein, antibody-peptide, and TCR-peptide-major histocompatibility complex (pMHC) recognition. We show that increased sampling and AlphaFold3 generally improve performance relative to default sampling and AlphaFold2, however predictive accuracy and improvement levels varied considerably among interface classes, with antibody-peptide complexes representing a challenge despite their small antigen size. Comparing per-case success across methods showed some complementarity, indicating opportunities for increased success through model pooling approaches, for instance increasing antibody-peptide near-native success from 41% to 59%. Analysis of AlphaFold confidence scores and modeling of a noncanonical complex provided further insights into predictive performance. These results highlight considerations for predictive antibody and TCR complex modeling efforts, while revealing key distinctions among protocols, scoring, and immune complex classes.

12

Revised Adaptive Immune Receptor Data in the Immune Epitope Database

Scheffer, L.; Richardson, E. M.; Vita, R.; Zarebski, L.; Blazeska, N.; Wheeler, D. K.; Cantrell, J. R.; Deleuran, S. N.; Lees, W. D.; Christley, S.; Corrie, B.; Cowell, L. G.; Sette, A.; Peters, B.

2026-06-06 bioinformatics 10.64898/2026.06.03.728549 medRxiv

Top 0.1%

13.4%

Show abstract

The Immune Epitope Database (IEDB, iedb.org) is a freely available resource that catalogs experimentally defined immune epitopes and - if available - the immune receptors that recognize them. Currently, the IEDB records [~]185, 000 T cell receptors and [~]5, 000 B cell receptors/antibodies with experimentally verified epitope specificity. Because these receptor data were manually curated from [~]3, 300 references spanning decades, nomenclature inconsistencies present challenges for computational analyses and user queries. To support integrated analysis of the entire dataset, we revised the IEDB receptor data standardization and validation pipeline to flag and correct inaccuracies. Anomalous receptors from over 800 studies were flagged for re-curation. The updated receptor dataset shows greater conformity through consistent gene nomenclature formatting and harmonized CDR sequence delimitation. Taking advantage of the increased receptor data consistency, the IEDB web interface was expanded to include receptor search features directly on the homepage, support V/J gene and species options in the refined receptor search, and allow direct data export in the Adaptive Immune Receptor Repertoire (AIRR) format. We anticipate that the improved receptor data quality will simplify bioinformatics analyses, and facilitate integration of IEDB data into cross-repository data resources, such as the AIRR Knowledge Commons.

13

POP-UP TCR: Prediction of Previously Unseen Paired TCR-pMHC

Tickotsky, N.

2023-09-29 bioinformatics 10.1101/2023.09.28.560071 medRxiv

Top 0.1%

13.0%

Show abstract

MotivationT lymphocytes (T-cells) major role in adaptive immunity drives efforts to elucidate the mechanisms behind T-cell epitope recognition. ResultsWe analyzed solved structures of T-cell receptors (TCRs) and their cognate epitopes and used the data to train a set of machine learning models, POP-UP TCR, that predict the binding of any peptide to any TCR, including peptide and TCR sequences that were not included in the training set. We address biological issues that should be considered in the design of machine learning models for TCR-peptide binding and suggest that models trained only on beta chains give satisfactory predictions. Finally, we apply our models to large data set of TCR repertoires from COVID-19 patients and find that TCRs from patients in severe/critical condition have significantly lower scores for binding SARS-coV-2 epitopes compared to TCRs from moderate patients (p-value <0.001). Availability and ImplementationPOP-Up TCR is available at: https://github.com/NiliTicko/POP-UP-TCR Contactnilibrac@bgu.ac.il

14

UnivAIRRse: A Unified Framework for Organizing and Comparing Adaptive Immune Receptor Repertoire Simulators

Abdollahi, N.; Kaveh, S.; Shayesteh, S.; Mommahed, S.; Alemzadeh, Y.; Zarrin, R.; Chaker Hosseini Zavareh, F.; Esmaeili, P.; Hassanzadeh, R.; Kossida, S.; Eslahchi, C.

2026-02-19 bioinformatics 10.64898/2026.02.19.706510 medRxiv

Top 0.1%

12.8%

Show abstract

Adaptive immune receptor repertoire sequencing (AIRR-seq) enables large-scale profiling of B- and T-cell receptor diversity and has become a cornerstone of modern computational immunology. However, AIRR-seq provides only a partial and lossy molecular snapshot of immune dynamics, lacking explicit ground truth for clonal ancestry, lineage trajectories, antigen specificity, and longitudinal immune evolution. This limitation complicates benchmarking, method validation, and mechanistic interpretation of repertoire analysis pipelines. Here, we introduce UnivAIRRse, a unified hierarchical framework that organizes AIRR simulators within a shared conceptual coordinate system spanning five operational levels, from observed sequence data to the theoretical generative potential of the adaptive immune system. By explicitly distinguishing sequence-, clonal-, specificity-, repertoire-, and generative-level representations, UnivAIRRse enables systematic comparison of simulator assumptions, biological scope, abstraction level, and application focus. To our knowledge, this is the first review to formalize such a unified structure across biological, computational, and functional layers of AIRR simulation. Using this framework, we review how simulation supports benchmarking, strengthens computational inference, and enables multi-scale investigation of immune repertoire formation and evolution. We identify persistent limitations in existing simulators, including incomplete biological context, limited modularity, restricted interoperability, and overreliance on AIRR-seq as a molecular proxy for complex spatiotemporal immune processes. To operationalize this framework, we provide an interactive web-based AIRR Simulation Landscape Explorer (publicly available at https://www.imgt.org/AIRR-Simulator/) that enables dynamic filtering and comparison of simulators across biological scope, abstraction level, output fidelity, and application focus. Finally, we outline emerging directions toward digital-twin-ready immune simulation, emphasizing modular architectures, longitudinal multi-omic integration, uncertainty quantification, and dynamic model updating. By providing a coherent conceptual and operational coordinate system, UnivAIRRse establishes a foundation for reproducible, interpretable, and clinically actionable modeling of adaptive immune repertoires, bridging current simulation practices with the next generation of predictive and personalized immunological modeling. O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=133 SRC="FIGDIR/small/706510v1_ufig1.gif" ALT="Figure 1"> View larger version (34K): org.highwire.dtl.DTLVardef@7a5a95org.highwire.dtl.DTLVardef@d127f1org.highwire.dtl.DTLVardef@19545c9org.highwire.dtl.DTLVardef@118cc74_HPS_FORMAT_FIGEXP M_FIG C_FIG

15

TCRanalyzer: A user-friendly tool for comprehensive analysis of T-cell diversity, dynamics and potential antigen targets

Seifert, N.; Reinke, S.; Kurz, N. S.; Demmer, J.; Kornrumpf, K.; Beissbarth, T.; Klapper, W.; Altenbuchinger, M.

2025-05-15 bioinformatics 10.1101/2025.05.09.652820 medRxiv

Top 0.1%

12.2%

Show abstract

T cells are critical for immune responses, recognizing antigens via their unique T-cell receptors (TCRs). Analyzing the diverse TCR repertoires, especially the hypervariable CDR3 region, is essential for understanding immune function in health and disease. Current TCR analysis tools often require specialized expertise, computational resources, or sacrifice biological information for efficiency. To address these limitations, we developed TCRanalyzer, a fast and comprehensive TCR analysis pipeline within a user-friendly graphical interface. TCRanalyzer covers all steps from data loading, aggregation and optional sequence clustering, to the analysis of TCR diversity metrics, clonal expansion and antigen specificity. Applied to datasets from patients with either benign or malignant tumors, TCRanalyzer identified changes in TCR clonality, clonal expansion and shifts in antigen specificity across different cohorts or following immunotherapy, thereby demonstrating its potential to dissect critical immunological processes. TCRanalyzer provides a robust and user-friendly tool for TCR sequence analysis, enhancing research in immunology and related fields. AvailabilityTCRanalyzer is available at https://hub.docker.com/r/tcranalyzer/application. Contactnicole.seifert@bioinf.med.uni-goettingen.de

16

Performance comparison of TCR-pMHC prediction tools reveals a strong data dependency

Deng, L.; Ly, C.; Abdollahi, S.; Zhao, Y.; Prinz, I.; Bonn, S.

2022-11-24 bioinformatics 10.1101/2022.11.24.517666 medRxiv

Top 0.1%

12.0%

Show abstract

The interaction of T-cell receptors with peptide-major histocompatibility complex molecules plays a crucial role in adaptive immune responses. Currently there are various models aiming at predicting TCR-pMHC binding, while a standard dataset and procedure to compare the performance of these approaches is still missing. In this work we provide a general method for data collection, preprocessing, splitting and generation of negative examples, as well as comprehensive datasets to compare TCR-pMHC prediction models. We collected, harmonized, and merged all the major publicly available TCR-pMHC binding data and compared the performance of five state-of-the-art deep learning models (TITAN, NetTCR, ERGO, DLpTCR and ImRex) using this data. Our performance evaluation focuses on two scenarios: 1) different splitting methods for generating training and testing data to assess model generalization and 2) different data versions that vary in size and peptide imbalance to assess model robustness. Our results indicate that the five contemporary models do not generalize to peptides that have not been in the training set. We can also show that model performance is strongly dependent on the data balance and size, which indicates a relatively low model robustness. These results suggest that TCR-pMHC binding prediction remains highly challenging and requires further high quality data and novel algorithmic approaches.

17

Self-Contemplating In-Context Learning Enhances T Cell Receptor Generation for Novel Epitopes

Zhang, P.; Bang, S.; Lee, H.

2025-01-28 bioinformatics 10.1101/2025.01.27.634873 medRxiv

Top 0.1%

11.9%

Show abstract

Computational design of T cell receptors (TCRs) that bind to epitopes holds the potential to revolutionize targeted immunotherapy. However, computational design of TCRs for novel epitopes is challenging due to the scarcity of training data, and the absence of known cognate TCRs for novel epitopes. In this study, we aim to generate high-quality cognate TCRs particularly for novel epitopes with no known cognate TCRs, a problem that remains under-explored in the field. We propose to incorporate in-context learning, successfully used with large language models to perform new generative tasks, to the task of TCR generation for novel epitopes. By providing cognate TCRs as additional context, we enhance the models ability to generate high-quality TCRs for novel epitopes. We first unlock the power of in-context learning by training a model to generate new TCRs based on both a target epitope and a small set of its cognate TCRs, so-called in-context training (ICT). We then self-generate its own TCR contexts based on a target epitope, as novel epitopes lack known binding TCRs, and use it as an inference prompt, referred to as self-contemplation prompting (SCP). Our experiments first demonstrate that aligning training and inference distribution by ICT is critical for effectively leveraging context TCRs. Subsequently, we show that providing context TCRs significantly improves TCR generation for novel epitopes. Furthermore, we show TCR generation using SCP-synthesized context TCRs achieves performance comparable to, and sometimes surpassing, ground-truth context TCRs, especially when combined with refined prompt selection based on binding affinity and authenticity metrics.

18

IMMREP25: Unseen Peptides

Richardson, E.; Aarts, Y. J. M.; Altin, J. A.; Baakman, C. A. B.; Bradley, P.; Chen, B.; Clifford, J.; Dhar, M.; Diepenbroek, D.; Fast, E.; Gowthaman, R.; He, J.; Karnaukhov, V.; Marzella, D. F.; Meysman, P.; Nielsen, M.; Nilsson, J. B.; Deleuran, S. N.; Parizi, F. M.; Pelissier, A.; Pierce, B. G.; Rodriguez Martinez, M.; Roran A R, D.; Saravanakumar, S.; Shao, Y.; Smit, N.; Van Houcke, M.; Visani, G. M.; Wan, Y.-T. R.; Wang, X.; Woods, L.; Wuyts, S.; Xiao, C.; Xue, L. C.; IMMREP25 Participant Consortium, ; Barton, J.; Noakes, M.; May, D. H.; Peters, B.

2026-04-01 bioinformatics 10.64898/2026.03.30.715276 medRxiv

Top 0.1%

11.8%

Show abstract

T cell receptors (TCRs) can bind to peptides presented by MHC molecules (pMHC) as a first step to trigger a T cell response. Reliable approaches to predict TCR:pMHC binding would have broad applications in clinical diagnostics, therapeutics, and the fundamental understanding of molecular interactions. IMMREP is a community organized series of prediction contests that asks participants to predict TCR:pMHC binding on unpublished datasets. Previous iterations in 2022 and 2023 showed multiple approaches can predict TCR-pMHC binding with significant accuracy (median AUC_0.1[≥]0.7) for peptides where experimental data is available ("seen" peptides). In contrast, models did not outperform random guessing for peptides that have no such data available ("unseen" peptides). Here we report on the results of IMMREP25, which focused solely on unseen peptides in order to evaluate the cutting edge of the field. We received 126 named submissions predicting the specificity of 1,000 TCRs against twenty unseen peptides restricted by one of two MHC molecules (HLA-A*02:01 and HLA-B*40:01). The best performing methods showed a macro-AUC_0.1 of 0.60, significantly better than random, demonstrating significant advances in the field. The top performing methods incorporated structural modeling into their approach, indicating that especially for unseen peptides, a structural understanding aids in the prediction of TCR:pMHC interactions. The results from this benchmark highlight the significant challenges remaining for TCR:pMHC predictions and will inform future method development.

19

A Macro-scale Comparison Algorithm for Analysis of TCR Repertoire Completeness

Esponda, F.; Sulc, P.; Blattman, J.; Forrest, S.

2021-03-08 bioinformatics 10.1101/2021.03.07.434284 medRxiv

Top 0.1%

10.8%

Show abstract

Recent advances in biotechnology are beginning to generate whole immunome datasets, which will enable the comparison of immune repertoires between individuals, e.g., to assess immunocompetence. Existing algorithms cluster cell types based on the relative expression abundance of about 20 000 genes, but such algorithms have limited utility when comparing immunome datasets with many higher orders of magnitude (>1012) of variation, such as occurs in immunoreceptor sequences in highly polyclonal naive repertoires. In this paper we present a method for comparing immune repertoires by identifying macro-level features that are conserved between similar individuals. Our method allows us to detect some blind spots in naive populations and to assess whether a repertoire is likely complete by examining only a sample of its sequences. Author SummaryIn this paper we present a method for comparing the immune repertoire of different individuals. Repertoires are represented by a sample of genetic sequences. Our technique coarse grains each individuals data into groups, matches groups between individuals and finds significant differences.

20

An improved deep learning model for immunogenic B epitope prediction

Sajeed, R.; Pradhan, S.; Srinivasan, R.; Rana, S.

2025-08-01 bioinformatics 10.1101/2025.07.29.667398 medRxiv

Top 0.1%

10.0%

Show abstract

The recognition of B epitopes by B cells of the immune system initiates an immune response that leads to the production of antibodies to combat bacterial and viral infections. Computational methods for predicting the epitopes on antigens have shown promising results in the development of subunit vaccines and therapeutics. Recently, the use of protein language models (pLMs) for epitope prediction has led to a substantial increase in prediction accuracy. However, further improvements in precision are necessary for practical applications. Here, we develop and evaluate a series of models using different combinations of features and feature fusion techniques on a curated independent test set. Our results show that the models that use protein embeddings along with structural features are better at predicting both linear and conformational B epitopes when compared to a baseline model that uses only protein embeddings as features. Additionally, we show that the embeddings of ESM-2, an evolutionary scale model, likely capture T-B reciprocity.