ImmunoInformatics
○ Elsevier BV
All preprints, ranked by how well they match ImmunoInformatics's content profile, based on 11 papers previously published here. The average preprint has a 0.01% match score for this journal, so anything above that is already an above-average fit. Older preprints may already have been published elsewhere.
Park, M.; Seo, S.-w.; Park, E.; Kim, J.
Show abstract
MotivationEpitopes are the immunogenic regions of antigen that are recognized by antibodies in a highly specific manner to trigger an immune response. Predicting such regions is extremely difficult yet contains profound implications for complex mechanisms of humoral immunogenicity. ResultsHere, we present a BERT-based epitope prediction model called EpiBERTope, a pre-trained model on the Swiss-Prot protein database, which can predict both linear and structural epitopes using protein sequences only. The model achieves an AUC of 0.922 and 0.667 for linear and structural epitope datasets respectively, outperforming all benchmark classification models including random forest, gradient boosting, naive Bayesian, and support vector machine models. In conclusion, EpiBERTope is a sequence-based model that captures content-based global interactions of antigen sequences, which will be transformative in epitope discovery with high specificity. Contactminjun.park@standigm.com
Ko, S.; Li, H.; Kim, H.; Shin, W.-H.; Ko, J.; Choi, Y.
Show abstract
BackgroundInteractions between peptide and MHC class II (pMHC-II) are crucial for T-cell recognition and immune responses, as MHC-II molecules present peptide fragments to T cells, enabling the distinction between self and non-self antigens. Accurately predicting the pMHC-II binding core is particularly important because it provides insights into pMHC-II interactions and T-cell receptor engagement. Given the high polymorphism and peptide-binding promiscuity of MHC-II molecules, computational prediction methods are essential for understanding pMHC-II interactions. While sequence-based methods are widely used, recent advances in AlphaFold-based structure prediction have opened new possibilities for improving pMHC-II binding core predictions. ResultsWe benchmarked four recent pMHC-II prediction methods with a focus on binding core prediction: two sequence-based methods, NetMHCIIpan and DeepMHCII, and two AlphaFold-based structure prediction methods, AlphaFold2 fine-tuned for peptide interactions (AF2-FT) and AlphaFold3 (AF3). The AlphaFold-based methods showed strong performance in predicting positive binders, with AF3 achieving the highest positive recall (0.86) and AF2-FT performing similarly (0.81). However, both methods frequently misclassified unbound peptides as binders. NetMHCIIpan excelled at identifying non-binders, achieving the highest negative recall (0.93), but had lower positive recall (0.44). In contrast, DeepMHCII demonstrated moderate performance without any notable strength. Consensus approaches combining AlphaFold-based methods for binder identification with filtering using NetMHCIIpan improved overall prediction precision (0.94 and 0.87 for known and unknown binding status, respectively). ConclusionsThis study highlights the complementary strengths of AlphaFold-based and sequence-based methods for predicting pMHC-II binding core regions. AlphaFold-based methods excel in predicting positive binders, while NetMHCIIpan is highly effective at identifying non-binders. Future research should focus on improving the prediction of unbound peptides for AlphaFold-based models. Since NetMHCIIpans binding core predictive ability is already high, future efforts should concentrate on enhancing its binding prediction to further improve overall accuracy.
Dorey-Robinson, D. L. W.; Maccar, G.; Borne, R.; Hammond, J.
Show abstract
The advent and continual improvement of high-throughput sequencing technologies has made immunoglobulin repertoire sequencing accessible and informative regardless of study species. However, to fully map changes in polyclonal dynamics, precise annotation of these constantly rearranging genes is pivotal. For this reason, data agnostic tools able to learn from presented data are required. Most sequence annotation tools are designed primarily for use with human and mouse antibody sequences which use databases with fixed species lists, applying very specific assumptions which select against unique structural characteristics. We present IgMAT, which utilises a reduced amino acid alphabet, incorporates multiple HMM alignments into a single consensus and enables the incorporation of user defined databases to better represent their species of interest. Availability and implementationIgMAT has been developed as a python module, and is available on GitHub (https://github.com/TPI-Immunogenetics/igmat) for download under GPLv3 license. Supplementary informationModel Breakdowns
Pham, M.-D. N.; Su, C. T.-T.; Nguyen, T.-N.; Nguyen, H.-N.; Nguyen, D. D. A.; Giang, H.; Nguyen, D.-T.; Phan, M.-D.; Nguyen, V.
Show abstract
MotivationAntigen recognition by T-cell receptors (TCRs) triggers cascades of immune responses. Successful predictions of the TCR and antigen (as peptide) bindings therefore signify the advancements in immunotherapy. However, most of current TCR-peptide interaction predictors fail to predict unseen data. This limitation may be derived from the conventional usage of TCR and/or peptide sequences as input, which may not adequately reflect their structural characteristics. Therefore, incorporating the TCR and peptide structural information into the prediction model to improve the generalizability is necessary. ResultsWe presented epiTCR-KDA as a new predictor of TCR-peptide binding that utilises structural information, specifically the dihedral angles between the residues of both the peptide and the TCR. This structural descriptor was integrated into a model constructed using knowledge distillation to enhance its generalizability. The epiTCR-KDA demonstrated competitive prediction performance, with an AUC of 0.99 for seen data and AUC of 0.86 for unseen data. Across multiple public datasets, epiTCR-KDA consistently outperformed other predictors, such as epiTCR, NetTCR, BERTrand, TEIM-Seq, TEINet, and ImRex, maintaining a median AUC of 0.9 (ranging from 0.82 to 0.91). Further analysis of epiTCR-KDA performance indicated that the cosine similarity of the dihedral angle vectors between the unseen testing data and training data is crucial for its stable performance. In conclusion, our epiTCR-KDA model, with its capacity to predict for unseen data, has brought us one step closer toward the development of a highly effective pipeline for affordable antigen-based immunotherapy. Availability and implementationepiTCR-KDA is available on GitHub (https://github.com/ddiem-ri-4D/epiTCR-KDA)
Alvarez, B.; Nielsen, M.
Show abstract
In this work, we present a deep learning-based approach for motif discovery in the Major Histo-compatibility Complex (MHC) system, which plays a key role in the immune response. We explore the use of convolutional neural networks (CNNs) to identify peptide binding motifs for different MHC alleles. By training models on data from specific MHC Class I and II molecules, we demonstrate how 1-dimensional convolutional filters can effectively capture motifs and binding preferences. Our study introduces a method for extracting motif logos directly from the trained models, providing insights into how internal neural network representations align with known biological motifs. The results show significant alignment with experimental binding motifs, underscoring the utility of deep learning in immunological research and the potential for improving vaccine design and immunotherapy.
Abdollahi, N.; Septenville, A. L. d.; Davi, F.; Bernardes, J. S.
Show abstract
MotivationThe adaptive B-cell response is driven by the expansion, somatic hypermutation, and selection of B-cell clones. Their number, size and sequence diversity are essential characteristics of B-cell populations. Identifying clones in B-cell populations is central to several repertoire studies such as statistical analysis, repertoire comparisons, and clonal tracking. Several clonal grouping methods have been developed to group sequences from B-cell immune repertoires. Such methods have been principally evaluated on simulated benchmarks since experimental data containing clonally related sequences can be difficult to obtain. However, experimental data might contains multiple sources of sequence variability hampering their artificial reproduction. Therefore, the generation of high precision ground truth data that preserves real repertoire distributions is necessary to accurately evaluate clonal grouping methods. ResultsWe proposed a novel methodology to generate ground truth data sets from real repertoires. Our procedure requires V(D)J annotations to obtain the initial clones, and iteratively apply an optimisation step that moves sequences among clones to increase their cohesion and separation. We first showed that our method was able to identify clonally-related sequences in simulated repertoires with higher mutation rates, accurately. Next, we demonstrated how real benchmarks (generated by our method) constitute a challenge for clonal grouping methods, when comparing the performance of a widely used clonal grouping algorithm on several generated benchmarks. Our method can be used to generate a high number of benchmarks and contribute to construct more accurate clonal grouping tools. Availability and implementationThe source code and generated data sets are freely available at github.com/NikaAb/BCR_GTG
Deng, L.; Ly, C.; Abdollahi, S.; Zhao, Y.; Prinz, I.; Bonn, S.
Show abstract
The interaction of T-cell receptors with peptide-major histocompatibility complex molecules plays a crucial role in adaptive immune responses. Currently there are various models aiming at predicting TCR-pMHC binding, while a standard dataset and procedure to compare the performance of these approaches is still missing. In this work we provide a general method for data collection, preprocessing, splitting and generation of negative examples, as well as comprehensive datasets to compare TCR-pMHC prediction models. We collected, harmonized, and merged all the major publicly available TCR-pMHC binding data and compared the performance of five state-of-the-art deep learning models (TITAN, NetTCR, ERGO, DLpTCR and ImRex) using this data. Our performance evaluation focuses on two scenarios: 1) different splitting methods for generating training and testing data to assess model generalization and 2) different data versions that vary in size and peptide imbalance to assess model robustness. Our results indicate that the five contemporary models do not generalize to peptides that have not been in the training set. We can also show that model performance is strongly dependent on the data balance and size, which indicates a relatively low model robustness. These results suggest that TCR-pMHC binding prediction remains highly challenging and requires further high quality data and novel algorithmic approaches.
Ye, C.; Hu, W.; Gaeta, B.
Show abstract
DNA sequencing technologies are providing new insights into the immune response by allowing the large scale sequencing of rearranged immunoglobulin gene present in an individual, however the applications of this approach are limited by the lack of methods for determining the antigen(s) that an immunoglobulin encoded by a given sequence binds to. Computational methods for predicting antibody-antigen interactions that leverage structure prediction and docking have been proposed, however these methods require knowledge of the 3D structures. As a step towards the development of a machine learning method suitable for predicting antibody-antigen binding affinities from sequence data, a weighted nearest neighbor machine learning approach was applied to the problem. A prediction program was coded in Python and evaluated using cross-validation on a dataset of 600 antibodies interacting with 50 antigens. The classification predicting accuracy was around 76% for this dataset. These results provide a useful frame of reference as well as protocols and considerations for machine learning and dataset creation in this area. Both the dataset (in csv format) and the machine learning program (coded in python) are freely available for download.
Azad, I. u. h.; Sohail, M. S.; Quadeer, A. A.
Show abstract
MotivationInterferon-gamma (IFN{gamma}) is a pivotal cytokine that coordinates various aspects of the immune response, notably enhancing T-cell activation, clearing intracellular pathogens, and providing long-term immune protection. Identification of IFN{gamma}-inducing peptides is essential for the advancement of peptide-based vaccines and immunotherapies; however, the experimental determination of these peptides is hampered by the large number of potential peptide candidates present in pathogen proteins. ResultsIn this study, we present IFNBoost, a machine learning model developed to accurately predict IFN{gamma}-inducing peptides by leveraging existing immunological datasets, including both peptide sequences and associated metadata. IFNBoost demonstrates impressive performance metrics, achieving an accuracy of 0.819, an F1 score of 0.798, and a Matthews correlation coefficient (MCC) of 0.634. Evaluation against independent datasets demonstrates that IFNBoost surpasses all current models for predicting IFN{gamma}-inducing peptides, highlighting generalizability of the model. Our comprehensive analysis indicates that, in addition to peptide sequences, metadata features such as the source organism and host significantly enhance predictive accuracy. The predictions produced by IFNBoost have the potential to guide rational vaccine design, thereby improving vaccine efficacy via precise identification of peptides that elicit the desired cytokine responses. Availability and implementationTo improve the accessibility and utility of our model, we have developed a web application available at https://ifnboost.streamlit.app/.
Van Deuren, V. M. L.; Valkiers, S.; Laukens, K.; Meysman, P.
Show abstract
AO_SCPLOWBSTRACTC_SCPLOWO_ST_ABSMotivationC_ST_ABSThe acquisition of T-cell receptor (TCR) repertoire sequence data has become faster and cheaper due to advancements in high-throughput sequencing. However, fully exploiting the diagnostic and clinical potential within these TCR repertoires requires a thorough understanding of the inherent repertoire structure. Hence, visualizing the full space of TCR sequences could be a key step towards enabling exploratory analysis of TCR repertoire, driving their enhanced interrogation. Nonetheless, current methods remain limited to rough profiling of TCR V and J gene distributions. Addressing this need, we developed RapTCR, a tool for rapid visualization and post-analysis of TCR repertoires. ApproachTo overcome computational complexity, RapTCR introduces a novel, simple embedding strategy that represents TCR amino acid sequences as short vectors while retaining their pairwise alignment similarity. RapTCR then applies efficient algorithms for indexing these vectors and constructing their nearest neighbor network. It provides multiple visualization options to map and interactively explore a TCR network as a two-dimensional representation. Benchmarking analyses using epitope-annotated datasets demonstrate that these RapTCR visualizations capture TCR similarity features on a global level (e.g., J gene) and locally (e.g., epitope reactivity). RapTCR is available as a Python package, implementing the intuitive scikit-learn syntax to easily generate insightful, publication-ready figures for TCR repertoires of any size. Availability and ImplementationRapTCR was written in Python 3. It is available as an anaconda package (https://anaconda.org/vincentvandeuren/raptcr), and on github (https://github.com/vincentvandeuren/RapTCR). Documentation and example notebooks are available at vincentvandeuren.github.io/rapTCR_docs/. Contactpieter.meysman@uantwerpen.be
Dudzic, P.; Janusz, B.; Satlawa, T.; Chomicz, D.; Gawlowski, T.; Grabowski, R.; Jozwiak, P.; Tarkowski, M.; Mycielski, M.; Wrobel, S.; Krawczyk, K.
Show abstract
Antibodies are a cornerstone of the immune system, playing a pivotal role in identifying and neutralizing infections caused by bacteria, viruses, and other pathogens. Understanding their structure, and function, can provide insights into both the bodys natural defenses and the principles behind many therapeutic interventions, including vaccines and antibody-based drugs. The analysis and annotation of antibody sequences, including the identification of variable, diversity, joining, and constant genes, as well as the delineation of framework regions and complementarity-determining regions, is essential for understanding their structure and function. Currently analyzing large volumes of antibody sequences is routine in antibody discovery, requiring fast and accurate tools. While there are existing tools designed for the annotation and numbering of antibody sequences, they often have limitations such as being restricted to either nucleotide or amino acid sequences, reliance on non-uniform germline databases, or slow execution times. Here we present Rapid Immunoglobulin Overview Tool (RIOT), a novel open-source solution for antibody numbering that addresses these shortcomings. RIOT handles nucleotide and amino acid sequence processing, comes with a free germline database, and is computationally efficient. We hope the tool will facilitate rapid annotation of antibody sequencing outputs for the benefit of understanding antibody biology and discovering novel therapeutics. AvailabilityRIOT is available at https://github.com/NaturalAntibody/riot_na.
Diallo, A. B.; Cavazzoni, C. B.; Sun, J. E.; Sage, P. T.
Show abstract
MotivationT follicular regulatory (Tfr) cells are a specialized cell subset that controls humoral immunity. Despite a number of individual transcriptomic studies on these cells, core functional pathways have been difficult to uncover due to the substantial transcriptional overlap of these cells with other effector cell types, as well as transcriptional changes occurring due to disease settings. Developing a core transcriptional module for Tfr cells that integrates multiple cell type comparisons as well as diverse disease settings will allow a more accurate prediction of functional pathways. Researchers studying allergic reactions, immune responses to vaccines, autoimmunity and cancer could use this gene set to better understand the roles of Tfr cells in controlling disease progression. Additional cell types beyond Tfr cells that have similar features of transcriptomic complexity within diverse disease settings may also be studied using similar approaches. High-throughput sequencing technologies allow the generation of large datasets that require specific tools to best interpret the data. The development of a core transcriptional module for Tfr cells will allow investigators to determine if Tfr cells may have functional roles within their biological systems with little knowledge of Tfr biology. With this work, we have addressed the need of core gene modules to define specific subsets of immune cells. ResultsWe introduce an integrated "core Tfr cell gene module" that can be incorporated into GSEA analysis using various input sizes. The integrated core Tfr gene module was built using transcriptomic studies in Tfr cells from several different tissues, disease settings, and cell type comparisons. Random forest was used to integrate the transcriptomic studies to generate the core gene module. A GSEA gene set was formulated from the integrated core Tfr gene module for incorporation into end-user friendly GSEA. The gene sets are presented along with random genes taken from the GTEX data set and are presented as GMT files. The user can upload the gene set to the GSEA website or any gene set tool which takes GMT files. We also present the full results of the model including p-values calculated by random forest. This provides users with more flexibility in choosing a p-value cutoff that is most appropriate for the experimental setting. AvailabilityThe core Tfr gene sets are freely available at: https://github.com/alosdiallo/TFR_Model. We have also included all of the code and data used in developing these gene sets. The code and results are released under an MIT license. Supplementary informationSupplementary data are available at Bioinformatics online.
Vardaxis, I.; Simovski, B.; Anzar, I.; Stratford, R.; Clancy, T.
Show abstract
BackgroundThe accurate computational prediction of B cell epitopes can vastly reduce the cost and time required for identifying potential epitope candidates for the design of vaccines and immunodiagnostics. However, current computational tools for B cell epitope prediction perform poorly and are not fit-for-purpose, and there remains enormous room for improvement and the need for superior prediction strategies. ResultsHere we propose a novel approach that improves B cell epitope prediction by encoding epitopes as binary molecular permutation vectors that represent the position and structural properties of the amino acids within a protein antigen sequence that interact with an antibody, rather than the traditional approach of defining epitopes as scores per amino acid on a protein sequence that pertain to their probability of partaking in a B cell epitope antibody interaction. In addition to defining epitopes as binary molecular permutation vectors, the approach also uses the 3D macrostructure features of the unbound 3D protein structures, and in turn uses these features to train another deep learning model on the corresponding antibody-bound protein 3D structures. We demonstrate that the strategy predicts B cell epitopes with improved accuracy compared to the existing tools. Additionally, we demonstrate that this approach reliably identifies the majority of experimentally verified epitopes on the spike protein of SARS-CoV-2 not seen by the model in training and generalizes in very robust manner on dissimilar data not seen by the model in training. ConclusionsWith the approach described herein, a primary protein sequence with the query molecular permutation vector alone is required to predict B cell epitopes in a reliable manner, potentially advancing the use of computational prediction of B cell epitopes in biomedical research applications.
Zhuang, N. Z.; Howells, J. M.
Show abstract
This method describes a computational pipeline for stratifying autoimmune patient groups using exclusively binary autoantibody data. Our method addresses a methodological gap in computational immunology by providing a standardized framework for analyzing categorical serological data commonly found in electronic health records and resource-limited settings. The pipeline integrates three complementary analytical modules: O_LIModule 1: Exploratory screening using statistical association tests. C_LIO_LIModule 2: Quantification of overall immunological similarity and un-certainty. C_LIO_LIModule 3: Prediction modeling and validation against chance. C_LI We demonstrate the methods utility by applying it to two autoimmune disorders. We were successful in recapitulating established clinical relationships in these two closely linked diseases. The pipeline is implemented in Python and includes detailed configuration options for custom disease groups, autoanti-body panels and stratification variables. This method enables researchers to extract meaningful immunological patterns from underutilized binary clinical data, serving as a hypothesis-generation tool to help drive impactful exploration. Graphical Abstract O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=74 SRC="FIGDIR/small/670596v4_ufig1.gif" ALT="Figure 1"> View larger version (19K): org.highwire.dtl.DTLVardef@7c6df3org.highwire.dtl.DTLVardef@116ac5eorg.highwire.dtl.DTLVardef@18e76dcorg.highwire.dtl.DTLVardef@1d6b71_HPS_FORMAT_FIGEXP M_FIG C_FIG
Olsen, T. H.; Abanades, B.; Moal, I. H.; Deane, C. M.
Show abstract
Antibodies with similar amino acid sequences, especially across their complementary-determining regions, often share properties. Finding that an antibody of interest has a similar sequence to naturally expressed antibodies in healthy or diseased repertoires is a powerful approach for the prediction of antibody properties, such as immunogenicity or antigen specificity. However, as the number of available antibody sequences is now in the billions and continuing to grow, repertoire mining for similar sequences has become increasingly computationally expensive. Existing approaches are limited by either being low-throughput, non-exhaustive, not antibody specific, or only searching against entire chain sequences. Therefore, there is a need for a specialized tool, optimized for a rapid and exhaustive search of any antibody region against all known antibodies, to better utilize the full breadth of available repertoire sequences. We introduce Known Antibody Search (KA-Search), a tool that allows for the rapid search of billions of antibody sequences by sequence identity across either the whole chain, the complementarity-determining regions, or a user defined antibody region. We show KA-Search in operation on the [~]2.4 billion antibody sequences available in the OAS database. KA-Search can be used to find the most similar sequences from OAS within 30 minutes using 5 CPUs. We give examples of how KA-Search can be used to obtain new insights about an antibody of interest. KA-Search is freely available at https://github.com/oxpig/kasearch.
Seifert, N.; Reinke, S.; Kurz, N. S.; Demmer, J.; Kornrumpf, K.; Beissbarth, T.; Klapper, W.; Altenbuchinger, M.
Show abstract
T cells are critical for immune responses, recognizing antigens via their unique T-cell receptors (TCRs). Analyzing the diverse TCR repertoires, especially the hypervariable CDR3 region, is essential for understanding immune function in health and disease. Current TCR analysis tools often require specialized expertise, computational resources, or sacrifice biological information for efficiency. To address these limitations, we developed TCRanalyzer, a fast and comprehensive TCR analysis pipeline within a user-friendly graphical interface. TCRanalyzer covers all steps from data loading, aggregation and optional sequence clustering, to the analysis of TCR diversity metrics, clonal expansion and antigen specificity. Applied to datasets from patients with either benign or malignant tumors, TCRanalyzer identified changes in TCR clonality, clonal expansion and shifts in antigen specificity across different cohorts or following immunotherapy, thereby demonstrating its potential to dissect critical immunological processes. TCRanalyzer provides a robust and user-friendly tool for TCR sequence analysis, enhancing research in immunology and related fields. AvailabilityTCRanalyzer is available at https://hub.docker.com/r/tcranalyzer/application. Contactnicole.seifert@bioinf.med.uni-goettingen.de
An, H.; Park, J.
Show abstract
Currently, more than 33 million peoples have been infected by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), and more than a million people died from coronavirus disease 2019 (COVID-19), a disease caused by the virus. There have been multiple reports of autoimmune and inflammatory diseases following SARS-CoV-2 infections. There are several suggested mechanisms involved in the development of autoimmune diseases, including cross-reactivity (molecular mimicry). A typical workflow for discovering cross-reactive epitopes (mimotopes) starts with a sequence similarity search between protein sequences of human and a pathogen. However, sequence similarity information alone is not enough to predict cross-reactivity between proteins since proteins can share highly similar conformational epitopes whose amino acid residues are situated far apart in the linear protein sequences. Therefore, we used a hidden Markov model-based tool to identify distant viral homologs of human proteins. Also, we utilized experimentally determined and modeled protein structures of SARS-CoV-2 and human proteins to find homologous protein structures between them. Next, we predicted binding affinity (IC50) of potentially cross-reactive T-cell epitopes to 34 MHC allelic variants that have been associated with autoimmune diseases using multiple prediction algorithms. Overall, from 8,138 SARS-CoV-2 genomes, we identified 3,238 potentially cross-reactive B-cell epitopes covering six human proteins and 1,224 potentially cross-reactive T-cell epitopes covering 285 human proteins. To visualize the predicted cross-reactive T-cell and B-cell epitopes, we developed a web-based application "Molecular Mimicry Map (3M) of SARS-CoV-2" (available at https://ahs2202.github.io/3M/). The web application enables researchers to explore potential cross-reactive SARS-CoV-2 epitopes alongside custom peptide vaccines, allowing researchers to identify potentially suboptimal peptide vaccine candidates or less ideal part of a whole virus vaccine to design a safer vaccine for people with genetic and environmental predispositions to autoimmune diseases. Together, the computational resources and the interactive web application provide a foundation for the investigation of molecular mimicry in the pathogenesis of autoimmune disease following COVID-19.
Esponda, F.; Sulc, P.; Blattman, J.; Forrest, S.
Show abstract
Recent advances in biotechnology are beginning to generate whole immunome datasets, which will enable the comparison of immune repertoires between individuals, e.g., to assess immunocompetence. Existing algorithms cluster cell types based on the relative expression abundance of about 20 000 genes, but such algorithms have limited utility when comparing immunome datasets with many higher orders of magnitude (>1012) of variation, such as occurs in immunoreceptor sequences in highly polyclonal naive repertoires. In this paper we present a method for comparing immune repertoires by identifying macro-level features that are conserved between similar individuals. Our method allows us to detect some blind spots in naive populations and to assess whether a repertoire is likely complete by examining only a sample of its sequences. Author SummaryIn this paper we present a method for comparing the immune repertoire of different individuals. Repertoires are represented by a sample of genetic sequences. Our technique coarse grains each individuals data into groups, matches groups between individuals and finds significant differences.
Galanis, K. A.; Nastou, K. C.; Papandreou, N. C.; Petichakis, G. N.; Iconomidou, V. A.
Show abstract
Linear B-cell epitope prediction research has received a steadily growing interest ever since the first method was developed in 1981. B-cell epitope identification with the help of an accurate prediction method can lead to an overall faster and cheaper vaccine design process, a crucial necessity in the covid-19 era. Consequently, several B-cell epitope prediction methods have been developed over the past few decades, but without significant success. In this study, we review the current performance and methodology of some of the most widely used linear B-cell epitope predictors which are available via a command-line interface, namely BcePred, BepiPred, ABCpred, COBEpro, SVMTriP, LBtope, and LBEEP. Additionally, we attempted to remedy performance issues of the individual methods by developing a consensus classifier, which combines the separate predictions of these methods into a single output, accelerating the epitope-based vaccine design. While the method comparison was performed with some necessary caveats and individual methods might perform much better for specialized datasets, we hope that this update in performance can aid researchers towards the choice of a predictor, for the development of biomedical applications such as designed vaccines, diagnostic kits, immunotherapeutics, immunodiagnostic tests, antibody production, and disease diagnosis and therapy.
Daouda, T.; Dumont-Lagacé, M.; Feghaly, A.; Benslimane, Y.; Panes, R.; Courcelles, M.; Benhammadi, M.; Harrington, L.; Thibault, P.; Major, F.; Bengio, Y.; Gagnon, E.; Lemieux, S.; Perreault, C.
Show abstract
MHC-I associated peptides (MAPs) play a central role in the elimination of virus-infected and neoplastic cells by CD8 T cells. However, accurately predicting the MAP repertoire remains difficult, because only a fraction of the transcriptome generates MAPs. In this study, we investigated whether codon arrangement (usage and placement) regulates MAP biogenesis. We developed an artificial neural network called Codon Arrangement MAP Predictor (CAMAP), predicting MAP presentation solely from mRNA sequences flanking the MAP-coding codons (MCCs), while excluding the MCC per se. CAMAP predictions were significantly more accurate when using original codon sequences than shuffled codon sequences which reflect amino acid usage. Furthermore, predictions were independent of mRNA expression and MAP binding affinity to MHC-I molecules and applied to several cell types and species. Combining MAP ligand scores, transcript expression level and CAMAP scores was particularly useful to increaser MAP prediction accuracy. Using an in vitro assay, we showed that varying the synonymous codons in the regions flanking the MCCs (without changing the amino acid sequence) resulted in significant modulation of MAP presentation at the cell surface. Taken together, our results demonstrate the role of codon arrangement in the regulation of MAP presentation and support integration of both translational and post-translational events in predictive algorithms to ameliorate modeling of the immunopeptidome. Author summaryMHC-I associated peptides (MAPs) are small fragments of intracellular proteins presented at the surface of cells and used by the immune system to detect and eliminate cancerous or virus-infected cells. While it is theoretically possible to predict which portions of the intracellular proteins will be naturally processed by the cells to ultimately reach the surface, current methodologies have prohibitively high false discovery rates. Here we introduce an artificial neural network called Codon Arrangement MAP Predictor (CAMAP) which integrates information from mRNA-to-protein translation to other factors regulating MAP biogenesis (e.g. MAP ligand score and transcript expression levels) to improve MAP prediction accuracy. While most MAP predictive approaches focus on MAP sequences per se, CAMAPs novelty is to analyze the MAP-flanking mRNA sequences, thereby providing completely independent information for MAP prediction. We show on several datasets that the integration of CAMAP scores with other known factors involved in MAP presentation (i.e. MAP ligand score and mRNA expression) significantly improves MAP prediction accuracy, and further validate CAMAP learned features using an in-vitro assay. These findings may have major implications for the design of vaccines against cancers and viruses, and in times of pandemics could accelerate the identification of relevant MAPs of viral origins.