ImmunoInformatics — Latest Matching Preprints

1

HALPred-B: Host-Aware Linear B-Cell Epitope Prediction: Challenges, Limitations, and Variability Across Species

Gautam, P.; Mitra, P.; Sinha, I.

2026-06-26 bioinformatics 10.64898/2026.06.22.733770 medRxiv

Top 0.1%

28.6%

Show abstract

Predicting linear B-cell epitopes is a basic immunoinformatics task that has a direct impact on vaccine design and antibody engineering. Recent advances in machine learning have improved predictive performance, but most existing approaches are trained on aggregated datasets and assume that antigenic patterns are conserved across host organisms. This assumption ignores the immunological variability depending on the host and prevents generalizing the model across species. This is the first systematic host-wise evaluation where we present a systematic machine learning-based analysis of host-aware linear B-cell epitope prediction using curated datasets from the Immune Epitope Database (IEDB). We build separate datasets for human, mouse, and non-human primate hosts and assess several classification models, including Random Forest, Support Vector Machine (SVM), Gradient Boosting, XGBoost, and K-Nearest Neighbors (KNN). The models exploit feature representations derived from sequences, such as AAIndex descriptors, biochemical properties from ExPASy, and dipeptide composition. Our results show that predictive performance differs substantially across hosts. Models achieve up to 86.07% accuracy and 0.93 ROC-AUC on human datasets but lower performance on mouse and non-human primate datasets. This gap underlies dataset bias and sequence distribution differences, as well as the inability of existing features to capture host-specific immunological context. These results indicate that the prediction of linear B-cell epitopes is intrinsically host-specific, and a single global model does not generalize well across species. We propose to incorporate host-aware modeling strategies and organism-specific features for enhanced predictive reliability and biological relevance.

2

HAIRpred2: Human Host-Specific Prediction of Antibody-Interacting Residues Using Hybrid Physicochemical and Structural Features

Mehta, N. K.; Sahni, R.; Kumar, N.; Raghava, G. P. S.

2026-05-13 bioinformatics 10.64898/2026.05.09.723672 medRxiv

Top 0.1%

18.2%

Show abstract

1.Prediction of conformational B-cell epitopes is critical for vaccine design, immunotherapy, and antibody engineering. To date, several host-independent computational methods have been developed for predicting antibody-interacting residues in antigen structures. However, it is well established that antigen-antibody (Ag-Ab) interactions vary depending on the host immune system indicating the importance of developing host-specific prediction models. In this study, we present, for the first time, a human host-specific method, HAIRpred2, that predicts antibody-interacting residues in an antigen from its tertiary structure. The dataset was derived from HAIRpred and comprises 277 human Ag-Ab complexes, with 221 structures used for training and 56 for independent testing. Preliminary analysis revealed that residues with a relative surface accessibility (RSA) below 0.05, corresponding to buried regions, are highly likely to be non-interacting, underscoring the importance of structural accessibility in antibody recognition. To identify the most informative features, we evaluated multiple feature representations, including RSA, large language model (LLM)-based embeddings, distance-based features, and physicochemical properties. A model trained on single-residue RSA features achieved an AUC of 0.72. Incorporating a sliding window of 15 residues to capture local structural context improved performance to an AUC of 0.75. The best performance (AUC = 0.78 on the independent test set) was achieved by integrating RSA with physicochemical descriptors. Benchmarking against existing antibody-interaction prediction methods on the same independent dataset demonstrated that HAIRpred2 outperforms current tools, further highlighting the advantage of host-specific modeling. HAIRpred2 is freely available as a web server at https://webs.iiitd.edu.in/raghava/hairpred2/. HighlightsO_LIDevelopment of HAIRpred2, the first human host-specific method for predicting antibody-interacting residues. C_LIO_LIAnalysis of 277 human antigen-antibody complexes to capture host-dependent interaction patterns. C_LIO_LIRelative surface accessibility (RSA) identified as a key determinant, with buried residues rarely participating in interactions. C_LIO_LIIntegration of RSA with physicochemical features achieved the best performance (AUC = 0.78) on an independent dataset. C_LIO_LIHAIRpred2 outperforms existing methods and is available as a web server for epitope prediction. C_LI

3

Benchmarking AlphaFold and related deep learning approaches for modeling antibody and TCR antigen recognition

Yin, R.; Saravanakumar, S.; Shi, S. Y.; Park, M.; Lin, V.; Lee, J.; Cheung, M.; Felbinger, N.; Kaufman, S.; Eisenberg, M.; Pierce, B.

2026-07-06 bioinformatics 10.64898/2026.07.04.736425 medRxiv

Top 0.1%

14.8%

Show abstract

Determining the structural basis of antigen recognition by antibodies and T cell receptors (TCRs) provides critical insights into effective immune targeting and can inform design of biotherapeutics and vaccines. Accurate computational modeling of antibodies and TCRs in complex with their targets poses a major challenge for predictive methods, including AlphaFold, which is generally accurate for modeling protein complexes but has shown limited success for immune recognition. In this study we assessed the performance of AlphaFold2, AlphaFold3, increased sampling protocols, and related deep learning methods for modeling antibody-protein, antibody-peptide, and TCR-peptide-major histocompatibility complex (pMHC) recognition. We show that increased sampling and AlphaFold3 generally improve performance relative to default sampling and AlphaFold2, however predictive accuracy and improvement levels varied considerably among interface classes, with antibody-peptide complexes representing a challenge despite their small antigen size. Comparing per-case success across methods showed some complementarity, indicating opportunities for increased success through model pooling approaches, for instance increasing antibody-peptide near-native success from 41% to 59%. Analysis of AlphaFold confidence scores and modeling of a noncanonical complex provided further insights into predictive performance. These results highlight considerations for predictive antibody and TCR complex modeling efforts, while revealing key distinctions among protocols, scoring, and immune complex classes.

4

Revised Adaptive Immune Receptor Data in the Immune Epitope Database

Scheffer, L.; Richardson, E. M.; Vita, R.; Zarebski, L.; Blazeska, N.; Wheeler, D. K.; Cantrell, J. R.; Deleuran, S. N.; Lees, W. D.; Christley, S.; Corrie, B.; Cowell, L. G.; Sette, A.; Peters, B.

2026-06-06 bioinformatics 10.64898/2026.06.03.728549 medRxiv

Top 0.1%

13.4%

Show abstract

The Immune Epitope Database (IEDB, iedb.org) is a freely available resource that catalogs experimentally defined immune epitopes and - if available - the immune receptors that recognize them. Currently, the IEDB records [~]185, 000 T cell receptors and [~]5, 000 B cell receptors/antibodies with experimentally verified epitope specificity. Because these receptor data were manually curated from [~]3, 300 references spanning decades, nomenclature inconsistencies present challenges for computational analyses and user queries. To support integrated analysis of the entire dataset, we revised the IEDB receptor data standardization and validation pipeline to flag and correct inaccuracies. Anomalous receptors from over 800 studies were flagged for re-curation. The updated receptor dataset shows greater conformity through consistent gene nomenclature formatting and harmonized CDR sequence delimitation. Taking advantage of the increased receptor data consistency, the IEDB web interface was expanded to include receptor search features directly on the homepage, support V/J gene and species options in the refined receptor search, and allow direct data export in the Adaptive Immune Receptor Repertoire (AIRR) format. We anticipate that the improved receptor data quality will simplify bioinformatics analyses, and facilitate integration of IEDB data into cross-repository data resources, such as the AIRR Knowledge Commons.

5

Bioinf-Farma: supervised integration of epitope prediction and recombinant protein developability for automated vaccine candidate prioritization

Bondi, H.; Crespi, M.; Orlando, M.; Lescai, F.; Serapian, S. A.; Colombo, G.; Fasano, M.; Pollegioni, L.; Molla, G.

2026-06-18 bioinformatics 10.64898/2026.06.15.732271 medRxiv

Top 0.1%

8.8%

Show abstract

Vaccine antigen discovery requires prioritizing protein candidates according to both immunogenic potential and recombinant expression feasibility. These properties are typically evaluated using separate computational tools, requiring researchers to integrate heterogeneous outputs through ad hoc workflows. Here, we present BIOINF-farma, a modular platform integrating epitope prediction and developability assessment for rational antigen selection within a unified environment. Candidates can be submitted as amino acid sequences or three-dimensional structures. When experimental structures are unavailable, BIOINF-farma automatically searches for models in AlphaFold DB or performs structure prediction using Boltz-2, ensuring a standardized structural representation for downstream analyses. Antigenicity is quantified by combining structure-based conformational epitope signals (MLCE/REBELOT-BEPPE) and sequence-based linear epitope propensity scores (BepiPred 3.0) into a protein-level Antigenicity Score, with a classification threshold optimized on a manually curated validation dataset. Developability is evaluated through two supervised Random Forest meta-learners that integrate three solubility predictors (DeepSoluE, SoluProt, Protein-Sol) and three thermal stability predictors (TemStaPro, ProLaTherm, BertThermo), whose outputs are combined into an Expression Efficiency Score (EES). By integrating complementary predictive signals, the meta-learning framework achieves greater accuracy and robustness than individual predictors while maintaining performance across a broad range of sequence identities. The Antigenicity Score effectively discriminates antigenic from non-antigenic proteins with a large effect size, whereas EES successfully distinguishes soluble from insoluble outcomes on an independent panel of recombinant proteins expressed in Escherichia coli. BIOINF-farma jointly assesses antigenicity and expression feasibility within a single framework. Its modular architecture facilitates the incorporation of future predictive methods, while its web-based interface makes the full pipeline accessible to users without programming expertise, supporting rapid candidate triage in vaccine research and emerging pathogen responses. Author SummaryVaccine development begins with a critical step: identifying, among the many proteins encoded in a pathogen genome, those most suitable as candidate antigens. A promising candidate must satisfy two requirements that are rarely evaluated together. It must be recognized by the immune system, so that vaccination elicits a protective response; and it must be amenable to recombinant production, since antigens that cannot be obtained in sufficient quantity and quality are of limited practical use. Current computational tools typically address only one of these aspects, and researchers must integrate their outputs manually, through procedures that are time-consuming and prone to inconsistency. We developed BIOINF-farma, an automated platform that brings these two assessments into a single analytical framework. Starting from a protein sequence or an experimental structure, the platform retrieves or predicts a three-dimensional model, evaluates the proteins antigenic potential by combining complementary epitope predictors, and estimates its expression feasibility by integrating multiple solubility and stability predictors through supervised machine learning. A web-based interface makes the full workflow available to experimental immunologists and vaccine developers without requiring computational expertise, supporting rational candidate prioritization in routine vaccine research and during emerging pathogen responses.

6

EpiESM-GA: Resource-Efficient Protein Foundation Model Features for Equitable B-Cell Epitope Prediction

Gautam, P.; Mitra, P.

2026-06-26 bioinformatics 10.64898/2026.06.22.733745 medRxiv

Top 0.1%

3.9%

Show abstract

Prediction of B-cell epitopes can assist in reducing costly wet-lab screening in vaccine design, diagnostics, and antibody discovery. However, current predictors often suffer from noisy labels, weak generalization, and structure-dependent workflows. Here we present EO_SCPLOWPIC_SCPLOWESM-GA, an efficient sequenceonly pipeline for linear B-cell epitope prediction. Positive and negative peptide examples are collected from IEDB, which provides experimentally tested epitopes and distinguishes positive and negative epitope records based on assay evidence(Vita et al., 2019). Each peptide is encoded with a frozen ESM-2 protein language model: a bidirectional transformer producing amino acid embeddings for downstream structure and function tasks (Lin et al., 2023). Mean-pooled embeddings are further compressed into a compact 420-feature representation with a genetic algorithm and classified with lightweight Random Forest, XGBoost, or MLP heads. This avoids foundation-model fine-tuning, reduces the number of trainable parameters, improves interpretability, and enables low-resource deployment. On an IEDB-derived benchmark, EO_SCPLOWPIC_SCPLOWESM-GA attains 0.880{+/-} 0.004 AUC-ROC, 0.852{+/-} 0.005 PR-AUC, 82.0 {+/-} 0.6% accuracy, 0.79 {+/-} 0.01 F1, and 0.74{+/-} 0.01 MCC, outperforming dense ESM-2 features and baselines LBCE-XGB, EpitopeVec, and BepiPred-2.0 (mean{+/-} std over five independent random seeds). The framework shows how frozen protein foundation models can enable pandemic preparedness, peptide vaccine prioritization, diagnostic antigen screening, and equitable computational immunology.

7

TRACE: a graph-based workflow for TCR-epitope prioritization and tumor-reactive T-cell identification

Chen, Y.; Giuliano, V.; Dacillo, I.; Lin, W.; Yan, Y.; Luo, P.

2026-05-31 bioinformatics 10.64898/2026.05.27.728217 medRxiv

Top 0.1%

3.3%

Show abstract

Accurate prioritization of T-cell receptor (TCR)-epitope interactions and identification of tumor-reactive T cells are important but difficult steps in immunotherapy-oriented bioinformatics workflows. Existing methods typically address these tasks separately and either model TCR-epitope pairs as independent observations or rely primarily on transcriptomic signatures. In this study, we present TRACE (TCR-epitope pRioritization And T-Cell idEntification), a graph-based computational workflow that unifies both applications within a single heterogeneous graph framework. The protocol represents TCRs, epitopes, and T cells as typed nodes connected by similarity and association edges, and combines pretrained sequence embeddings with edge-aware graph attention, Laplacian positional encoding, and bidirectional cross-domain attention. Applied to the IEDB and VDJdb benchmarks, TRACE achieved AUROC/AUPR values of 0.937/0.922 and 0.992/0.990, respectively, outperforming five state-of-the-art algorithms. In addition, on a single-cell RNA-seq dataset, the workflow achieved an AUROC of 0.984 and an AUPR of 0.984, substantially exceeding transcriptomic signature-based baselines for tumor-reactive T-cell identification. Ablation analysis showed that Laplacian positional encoding provided the largest performance gain, particularly in sparse graph settings. These results suggest that heterogeneous graph modeling can serve as a practical protocol for integrating receptor sequence, antigen context, and cellular phenotype in computational immunology.

8

IgGM2: An All-Atom Foundation Model for Adaptive Immune Receptor Design

Ma, J.; Wu, F.; Yao, L.; Gao, J.; Wang, R.; Li, Q.; Yang, N.; Jiang, S.; Huang, D.; Pan, X.; Zhu, Y.; Hou, T.; Yao, J.; Yan, J.

2026-07-09 bioinformatics 10.64898/2026.07.09.737510 medRxiv

Top 0.1%

3.1%

Show abstract

Accurate immune receptor design requires modeling the coupled variation of amino-acid sequence, full-atom conformation, and target-binding geometry across antibodies, nanobodies, and T-cell receptors (TCRs). Existing methods often address only part of this problem, either by separating structure generation from sequence design, relying on fixed-backbone inverse folding, or focusing on a single receptor class. We introduce IgGM2, a unified all-atom generative framework for immune receptor structure prediction and CDR sequence-structure co-design. IgGM2 follows a structure-to-design strategy: it first learns how immune receptors are positioned around fixed target structures, and then transfers this target-conditioned structural prior to CDR design. Unlike modular design pipelines, IgGM2 jointly generates CDR residue identities and full-atom receptor structures, allowing framework geometry to adapt to designed CDRs without separate inverse folding or external sidechain packing. Unlike continuous residue encodings based on virtual-atom geometry, IgGM2 keeps sequence prediction explicit while using atom14 placeholders only for full-atom representation. On structure prediction benchmarks, IgGM2 better captures receptor-target spatial relationships than AlphaFold3 on FoldBench and achieves strong performance on TCR-pMHC modeling. On sequence design benchmarks, IgGM2 improves amino-acid recovery and Rosetta-based interface preference metrics, suggesting more favorable generated binding interfaces. These results support IgGM2 as a unified all-atom framework for adaptive immune receptor structure prediction and design.

9

Engineering Endogenous T Cell Receptors to Recognize Cancer Neoantigens Using a Hybrid Physics-AI Approach

Weber, J.; Parajuli, G.; Wang, S.; Ratner, V.; Ma, X.; Shoshan, Y.; Zhang, L.; Morrone, J.; Raboh, M.; Hexter, E.; Parthasarathy, P. B.; Gaughan, C.; Makarov, V.; Chu, L.; Hasgur, S.; Juric, I.; Diaz, M.; Srivastava, R.; Knauf, J.; Hassan, K.; Cornell, W.; Alban, T.; Chan, T.

2026-05-19 immunology 10.64898/2026.05.15.725176 medRxiv

Top 0.1%

2.7%

Show abstract

T cell receptors (TCRs) are critical for immune surveillance and successful adaptive immune response against foreign antigens. TCRs drive this key arm of the immune system through recognition of peptide epitopes presented on MHC complexes. However, they are limited due to their stochastic nature and generation via genetic recombination. In silico design of functional TCRs that target defined peptide epitopes would be of considerable utility but has up until now been unsuccessful. Here, we develop an artificial intelligence (AI)-powered approach using a hybrid physics-based simulation and generative AI that successfully engineers TCRs against defined epitopes presented by MHC-I. We use this approach to design TCRs against two cancer antigens, a HERC1 neoantigen and an immunogenic neoepitope in mutant EGFR. We engineer multiple TCRs against the HERC1 neoantigen which activate T cells in response to exposure to peptide-MHC I and kill cancer cells more effectively than a patient-derived TCR. In addition, we used generative AI to design functional TCRs that target the EGFR T790M neoantigen, engineering greater specificity against the mutant sequence. We present an AI-based approach to TCR design with broad utility for efforts to engineer TCRs and for the development of new cell therapies. One sentence summaryArtificial intelligence-based approach enables the directed engineering of functional TCRs with enhanced features that target cancer neoantigens.

10

Design of DNA Aptamers for Lyme disease Diagnosis Combining experimental and numerical approaches

GAYRAUD, G.; Davila Felipe, M.; Padiolleau-Lefevre, S.; Maffucci, I.; Issouani, E. M.; Guerin, M.; Da Ponte, H.

2026-05-15 bioinformatics 10.64898/2026.05.13.724892 medRxiv

Top 0.1%

2.4%

Show abstract

Aptamers are single stranded DNA or RNA molecules selected for their high affinity and specificity to bind target molecules, similar to antibodies. They are commonly selected through the SELEX process, which involves the iterative exposure of a random sequence library to a target and retaining the sequences showing good binding properties. To improve Lyme disease detection, we propose designing aptamers that specifically bind to the CspZ protein on the surface of Borrelia burgdorferi, the bacterium responsible for the disease. Starting with a SELEX process consisting of thirteen rounds, from which selected in vitro sequence candidates have emerged, we aim to propose a holistic process that selects in silico new sequence candidates that are further validated experimentally. Our approach relies on 1) using Machine Learning (ML) techniques, specifically a Restricted Boltzmann Machine (RBM), to digitally replicate the last round of the SELEX process, 2) integrating insights from text analysis methods, such as word2vec and n-grams, into the RBM model trained on the final-round SELEX dataset to represent and compare newly generated sequences with in vitro candidates, 3) selecting in silico sequences with strong potential to bind to CspZ protein, 4) experimentally validating the selected in silico sequences of step 3. Our holistic approach combines biological insights with statistical models to improve the efficiency and outcome of the SELEX process. We enhance the RBM model, designed to replicate the distribution of the final SELEX round, by integrating geometric representations of sequences, which is especially advantageous when dealing with limited datasets relative to the vast sequence space. In addition, it provides in silico sequence candidates with strong binding properties.

11

Multi-Scale Machine Learning for Antibody-Antigen Binding Affinity Prediction Using Deep Mutational Scanning and Structural Features

Sivasubramani, S.

2026-06-23 bioinformatics 10.64898/2026.06.09.730151 medRxiv

Top 0.1%

2.4%

Show abstract

Predicting how mutations alter antibody-antigen binding affinity is essential for antibody engineering and vaccine design, yet current methods generalize poorly to unseen complexes. We present a multi-scale machine learning framework integrating 93 descriptors across four modalities: physicochemical, structural, ESM-2 protein language model, and solvent-accessible surface area (SASA)/{Delta}{Delta}Gfold features. Under leave-one-complex-out deep mutational scanning (LOCO-DMS) cross-validation on AbAgym (36,541 mutations, 68 experiments, 13 pathogens), gradient boosting achieved MCC = 0.206; a confidence-stratified ensemble reached MCC = 0.374 (83.5% accuracy, 25.5% coverage). No single modality exceeds the majority baseline alone; only multi-scale fusion succeeds. Boltzmann ceiling analysis shows 45.9% of mutations are near-neutral (|{Delta}{Delta}G| < kBT), bounding theoretical maximum MCC at 0.473; our method achieves 79.1% of this limit. Five deep learning architectures benchmarked under LOCO-DMS showed self-attention matching gradient boosting (MCC = 0.200). Cross-pathogen transfer failed systematically (mean 46.7%), confirming universal binding predictors remain an open challenge.

12

Nanobodies versus canonical antibodies: an updated comparison of their binding modes

Hauser, A.; Dangla-Pelissier, G.; Cazals, F.

2026-06-04 bioinformatics 10.64898/2026.06.01.729307 medRxiv

Top 0.1%

2.1%

Show abstract

Heavy-chain-only antibodies, produced by the adaptive immune systems of camelids and cartilaginous fish, complement canonical antibodies that contain variable domains from both heavy and light chains. We refine previous studies by providing a detailed analysis of the binding modes of VHHs versus canonical antibodies, using a dataset with a[~] 20-fold increase in the number of cases. We show that VHHs exhibit a larger buried surface area despite relying on a single variable domain than double domain antibodies. This property can be attributed to contributions from both framework regions and CDR3. We further demonstrate that the binding modes of VHHs, characterized by the number of FR and CDR regions contacting the antigen, are more diverse than previously reported. In addition, we find that VHH and canonical antibody interfaces display similar solvation properties, although VHH interfaces are more tightly packed. Finally, we discuss the thermodynamic and kinetic implications of these findings for the design of high-affinity VHHs, an issue of particular importance in protein engineering and design.

13

OpenGerminal: an open-source implementation of the Germinal antibody design pipeline

Han, B.; Li, S.

2026-06-29 bioinformatics 10.64898/2026.06.25.734527 medRxiv

Top 0.1%

2.1%

Show abstract

Germinal is a recently described computational pipeline for de novo antibody design that combines AlphaFold-Multimer hallucination with antibody language model guidance to generate epitope-targeted antibodies. Germinal identified binders with nanomolar-to-low-micromolar affinities by testing only 43-101 designs per target across four diverse antigens, establishing it as a practical tool for epitope-directed antibody design accessible to standard academic laboratories. As this architecture is itself very recent, systematic replacement and benchmarking of its individual components remains largely unexplored, yet offers a valuable opportunity to probe the robustness of the underlying design. We present OpenGerminal, which replaces PyRosetta with a fully open-source stack comprising OpenMM 8.5.1, FreeSASA, FASPR, Biopython, and sc-rs v1.0.0, and adopts AbLang1 (ablang2 v0.2.1) as the sole antibody language model in place of IgLM. Benchmarking on two VHH targets (PD-L1 and IL-3) reveals that OpenGerminal achieves a markedly higher cofolding pass rate (PD-L1: 33.7% vs. 18.6%; IL-3: 24.6% vs. 8.0%) with equivalent or improved Chai-1 structural confidence metrics in accepted designs, at the cost of a modest increase in per-trajectory computation time (>=1.5x). Multi-chain target support is also extended and verified to run without error on the official insulin example. OpenGerminal provides the first systematic benchmarking of IgLM versus AbLang1 within the Germinal architecture, and its fully open-source component stack broadens the range of deployment contexts in which the pipeline can be used.

14

Immunoinformatics-Guided Design and In Silico Evaluation of a Multi-Epitope Vaccine Against Influenza A H10N5 and H3N2 Strains Based on Hemagglutinin and Neuraminidase Proteins

Shabbir, M. Z.; Kumar, P.; Rehman, M. A. U.; Kumar, J.; Urooj, U.; Batool, S. I.; Sourav, C.; Ghazanfar, R.; Nagari, Z.; Hameed, D.; Wahid, A.; Atique, A.; Siddique, M. D.

2026-07-08 bioinformatics 10.64898/2026.07.03.736294 medRxiv

Top 0.1%

1.8%

Show abstract

Influenza A viruses H3N2 and H10N5 represent, respectively, a persistently dominant seasonal pathogen and a newly documented zoonotic threat with the latter strain variants responsible for the first confirmed human fatality in January 2024, yet no vaccine platform currently addresses co-protection against both subtypes within a unified immunogen. We report here the immunoinformatics based vaccine design and multi-layered computational validation of a 419-amino-acid multi-epitope subunit vaccine construct targeting conserved hemagglutinin (HA) and neuraminidase (NA) antigens identified through multiple sequence alignment of the avian H10N5 (A/swine/Hubei/10/2008) and H3N2 human reference strain sequences to identify viral agents undergoing mammalian adaptations. Linear B-cell, cytotoxic T lymphocyte (CTL), and helper T lymphocyte (HTL) epitopes were predicted using ABCpred, BCEpred, BepiPred 2.0, NetMHCpan 2.1, and NetMHCpan 4.0, then filtered through VaxiJen 3.0, AllerTOP v2.1, and ToxinPred to retain only antigenic, non-allergenic, non-toxic candidates. The final construct, incorporating an avian {beta}-defensin N-terminal adjuvant with GPGPG, AAY, and EAAAK linkers, exhibited a molecular weight of 43.9 kDa, instability index of 31.15, and SOLPro solubility probability of 0.763. Tertiary structure modeling via I-TASSER and GalaxyRefine achieved 84.4% Ramachandran-favored residues. Molecular docking against TLR3 and TLR7 yielded binding free energies of -16.1 and -16.8 kcal/mol with picomolar dissociation constants. Molecular dynamics simulations confirmed complex stability over extended trajectories. Furthermore, codon optimization produced a Codon Adaptation Index of 1.0 for E. coli K12 expression. In silico immune simulation demonstrated robust activation of humoral and cellular immunity including elevated IgG1, IgM, IFN-{gamma}, IL-2, rapid NK cell expansion, and broad B-cell clonal diversity. These findings establish a computationally validated candidate capable of providing protection against influenza in multiple host organisms, warranting experimental advancement.

15

Computational design of a multi-epitope vaccine against M. tuberculosis

Buhari, A.; Okutu, P.; Oyeleke, U. A.; Sivakumar, A.; Hameed, S. A.

2026-07-15 bioinformatics 10.64898/2026.07.09.737463 medRxiv

Top 0.1%

1.8%

Show abstract

BackgroundTuberculosis remains a leading global infectious killer, with BCG offering inconsistent adult protection and rising drug-resistant strains demanding novel vaccine strategies. We report the first multi-epitope vaccine construct simultaneously targeting three previously unexplored Mycobacterium tuberculosis virulence proteins; EccB3, MycP, and polyketide synthase which collectively govern nutrient acquisition, ESX secretion integrity, and innate immune evasion. MethodsUsing a reverse vaccinology pipeline, B-cell, CTL, and HTL epitopes were predicted, filtered for allergenicity, toxicity, and IFN-{gamma} induction, then assembled into an 823-residue chimeric construct incorporating beta-defensin and PADRE adjuvants with AAY/GPGPG linkers, covering [~]90% global HLA diversity. The construct underwent AlphaFold structure prediction, 3DRefine refinement, disulfide engineering, PROCHECK/ProSA validation, ClusPro 2.0 docking against TLR1/TLR2, and C-IMMSIM immune simulation. ResultsThe construct (82.3 kDa, instability index 32.48) showed strong structural quality (94.7% favoured Ramachandran residues), stable TLR1/TLR2 binding (weighted energy: -1,371.0 kcal/mol), and robust in silico immune responses and durable memory cell formation following booster simulation. ConclusionThis computationally validated construct represents a promising multi-target TB vaccine candidate warranting experimental advancement.

16

CCK* (Convex Closure K*): A Suite of Algorithms for De Novo L- and D-peptide Design

Childs, H.; McBride, A. C.; Donald, B. R.

2026-06-01 bioinformatics 10.1101/2025.11.21.689740 medRxiv

Top 0.1%

1.7%

Show abstract

The computational design of L-peptides and their mirror-image counterparts, D-peptides, is an active area in drug design. Peptide therapeutics offer exceptional structural diversity and high binding specificity, while D-peptides additionally confer critical advantages such as proteolytic resistance. Progress in de novo D-peptide design has been hindered by the absence of evolutionary context and limited structural data, both of which underpin the deep learning methods widely used in L-peptide design. Consequently, a robust framework capable of designing both L- and D-peptides should integrate data-driven inference with first-principles, physics-based modeling. Here, we introduce a unified computational framework that supports de novo design of both L- and D-peptides, thereby expanding the accessible design space across both chiral spaces. Convex Closure K* (CCK*) is a suite of chirality-agnostic algorithms: SCOPE, MONTAGE, and ARISE. SCOPE uses geometry as a proxy for chemical energetics, computing convex hull representations of rotameric states to rapidly generate multi-sequence protein contact maps. MONTAGE employs geometric hashing in conjunction with the K* algorithm to generate and rank backbone scaffolds according to their suitability for sequence design. ARISE is a K*-based sequence design algorithm that performs iterative residue assignment in an undirected graph to design high-affinity peptide sequences. We apply the full CCK* suite to six de novo design tasks, benchmarking chirality-preserving and chirality-inverting designs in both homochiral and heterochiral complexes.

17

GermRL: Alleviating The Germline Bias In Autoregressive Antibody Language Models Through Reinforcement Learning

Ludwig, L.; Chungyoun, M.; Gray, J. J.

2026-06-11 bioinformatics 10.64898/2026.06.08.730660 medRxiv

Top 0.1%

1.7%

Show abstract

Antibodies are powerful therapeutics whose antigen specificity arises from sequence diversity shaped during development. Recently, language models trained on large antibody repertoire datasets have enabled the generation and screening of novel candidates, but these models retain a strong germline bias. As AI adoption increases in therapeutic workflows, it is crucial to develop models that harness the diversity of antibodies necessary for the discovery of mutations that encode desirable properties. Previous work explored the germline bias in masked antibody language models, yet the bias in generative autoregressive language models has not yet been addressed. Here, we present GermRL, a lightweight and modular reinforcement learning (RL) framework capable of alleviating the germline bias in pre-trained antibody autoregressive language models through group relative policy optimization (GRPO). GermRL achieves consistent one-shot generation of antibodies that satisfy specified mutation thresholds from germline while maintaining structural plausibility. Under the lowest and highest mutation thresholds tested (5 and 35 mutations from germline), GermRL scores 0.992 and 0.950 pass@1, respectively, compared to 0.398 and 0.034 for the pre-trained language model. Within GermRL, we introduce a key pair of modifications to GRPO that increase training efficiency by discouraging reward hacking under our antibody application. Furthermore, comparison of RL generated and natural antibody sequences reveals how RL based optimization can explore alternative evolutionary mutational patterns and residue compositional strategies while preserving key global properties of natural antibodies, including identifiable germline assignments, embedding-level similarity and comparable developability profiles. Thus, RL-trained generative models optimized to promote antibody mutations through diversity from germline provide a promising framework for navigating the antibody sequence landscape, enabling exploration of novel yet biologically plausible candidates for therapeutic design.

18

Benchmarking Boltz-2 for Screening of Therapeutic Antibody-Antigen Interactions

Fieux-Castagnet, A.; Waton, J.; Glukhonemykh, A.; Snow, E.; Ashokkumar, R.; Fleming, J.; Champagne, D.; Devenyns, T.; Peluffo, A.; Anagnostopoulos, C.

2026-05-14 bioinformatics 10.64898/2026.05.13.724924 medRxiv

Top 0.1%

1.6%

Show abstract

Protein structure prediction models (such as AlphaFold, Chai, Boltz) have transformed structural biology and are increasingly explored for drug discovery; however, their utility for large-scale screening of antibody-antigen (AB-AG) interactions remains unclear, particularly for distinguishing true binding from non-binding pairs at scale. To our knowledge, there has not been an exhaustive exploration of Boltz-2 inference settings on this high impact problem, and in this paper we set out to describe and implement a novel benchmarking framework that can accelerate progress in the field. We evaluated Boltz-2 (NVIDIA NIM implementation) on 519 therapeutic monoclonal antibodies from Thera-SAbDab, pairing each antibody with its cognate target and a randomly assigned non-cognate antigen. We developed a novel evaluation framework that systematically captures variability across stochastic seeds while benchmarking different inference settings, including datasets with and without crystallographically resolved antibody structures. Across settings, Boltz-2-derived confidence metrics showed weak, though above-chance, discrimination (0.5 < ROC-AUC < 0.60). Among evaluated metrics, the minimum value of the interface predicted TM-score (ipTM-min) across seed-samples, captured the strongest signal. Interestingly, additional feature aggregation and multivariate modelling provided little to no improvement. Increasing the number of stochastic predictions yielded front-loaded gains, with diminishing returns beyond [~]15-20 seed-samples, suggesting limited value of extensive sampling in practical workflows. Notably, inference without multiple sequence alignments (MSAs) slightly improved performance on non-crystallized antibodies ({Delta}AUROC {approx} +0.027) while reducing runtime by [~]8 seconds per prediction compared to shallow MSA settings. Overall, these results indicate that off-the-shelf confidence metrics from general-purpose structure prediction models may be insufficient for reliable target-antibody screening and highlight the need for task-specific optimization, while confirming that modest amounts of sampling can be helpful, but not in itself sufficient to improve performance significantly as gains plateau relatively quickly.

19

Constrained Evolutionary Design of Matrixyl Analogs: Balancing Permeability and Functional Preservation Through Computational Optimization

Komianos, N.; Prakash, P.

2026-05-14 bioinformatics 10.64898/2026.05.12.724473 medRxiv

Top 0.1%

1.6%

Show abstract

Matrixyl (palmitoyl pentapeptide-4, KTTKS core) is a collagen-stimulating peptide used in topical anti-ageing products, but its in-use efficacy is limited by poor permeation through the stratum corneum. We describe a deterministic computational workflow that combines a tournament genetic algorithm and NSGA-II with exact RDKit molecular descriptors to search the fixed-length, edit-distance-2 neighbourhood of KTTKS (3,706 candidate sequences) for analogs with descriptors more favourable for passive transdermal diffusion. The search returns a 9-member Pareto frontier that quantifies the trade-off between predicted permeability and motif preservation. Five of the nine frontier members carry the same substitution, lysine to proline at position 4 (K4P). This single change lowers the topological polar surface area by 25.6%, removes the +1 charge contributed by lysine, and reduces the functional-preservation score from 1.00 (KTTKS) to 0.67. The frontier ranking is unchanged by {+/-}30% perturbations to the TPSA and Mw penalty weights and by a 30% increase in the LogP penalty; only a 30% reduction in the LogP penalty produces rank movement. The frontier matches the ground-truth Pareto set obtained by exhaustive enumeration of all 3,706 candidates (precision and recall both 100%). On the basis of these results we recommend three sequences for experimental validation: PTTPS (largest predicted gain), KTTPS (single-mutation, conservative), and KTTPP (backup). All code, results, and figures are released under MIT and CC BY 4.0.

20

Benchmarking AI-Driven PTIm-mAb Across Eleven FDA-Approved Bispecific Antibodies: A Cross-Tool Validation Study

Addepalli, M. K.; Prattipati, M.

2026-07-10 bioinformatics 10.64898/2026.07.07.736933 medRxiv

Top 0.1%

1.5%

Show abstract

BackgroundLate-stage attrition in therapeutic antibody discovery is dominated by developability liabilities: aggregation, polyspecificity, charge-driven non-specific binding, and chain-mispairing artefacts. Bispecific antibodies amplify these risks because each additional binding arm adds a new biophysical envelope that must be jointly satisfied. The existing in-silico ecosystem addresses individual axes of this problem (humanization, structure prediction, single-metric developability scoring) but few platforms integrate them end-to-end. PTIm-mAb (SANSHI Bio Solutions Pvt Ltd) is a multi-objective, AI/ML-driven antibody design platform that jointly optimizes sequence liabilities, surface aggregation, charge balance, humanness, and predicted binding affinity, and recommends a bispecific architecture in a single workflow. MethodsWe applied PTIm-mAb to the published sequences of eleven FDA-approved bispecific antibodies using the platforms default-parameter Pareto-acceptance optimization loop, run to convergence or to the internal iteration ceiling, with no human curation between the platform run and the external profiler. Both wild-type and platform-optimized sequences were profiled independently with three publicly available developability tools: Aggrescan, CamSol, and the Therapeutic Antibody Profiler (TAP). Paired-sample tests (Wilcoxon signed-rank, exact binomial sign test, McNemar exact test) evaluated the direction and significance of changes. ResultsAcross the 17 evaluable paired arms profiled by TAP, PTIm-mAb cleared four wild-type CDR-vicinity Positive Charge Patch (PPC) flags Blinatumomab-Arm1 (1.9952 [->] 0.6885), Mosunetuzumab-Arm1 (1.3391 [->] 0.0568), Linvoseltamab-Arm2 (0.8060 [->] 0.0), and the headline Elranatamab-Arm1 case (1.7981 [->] 0.5799) achieved without trading off any other in-range metric and corroborated by Aggrescan and CamSol on the same arm. Total CDR length was significantly shortened across the cohort (Wilcoxon two-sided p = 0.0075, one-sided p = 0.0037, effect size r = 0.65): significant improvement on the metric most directly under the optimizers control. The directional shift on Aggrescan integrated aggregation propensity was also significant by sign test (24 of 36 chains improved, 2 unchanged, 10 worsened; p = 0.021). On the already-clean Zenocutuzumab profile the optimizer identified residual headroom (PPC 0.1191 [->] 0.0; SFvCSP 12.5 [->] 6.0), demonstrating that the platforms value extends to candidates that pass all flags. Three results: Teclistamab Arm-1, Emicizumab, and Talquetamab Arm-2 did not clear all flags and are presented as candidates for iterative re-invocation of the platform pipeline on the optimized output (planned follow-up; Section 5). The remaining TAP metrics (PSH, PPC magnitude, PNC, |SFvCSP|) trended in the improvement direction without reaching significance in this cohort, a pattern consistent with the expected statistical signature of a multi-objective optimizer applied to molecules already within the clinical-stage envelope. The platform reported a mean of 12.8 months and USD 723,889 of computational front-loading per project across the nine-project cohort (range 9.0-16.0 months; USD 510,000-960,000); the underlying cost assumptions are tabulated in Supplementary Table S3. ConclusionPTIm-mAb produces externally verifiable, literature-aligned improvements on the metrics most directly under its control, clears CDR-vicinity charge-patch flags on a meaningful fraction of flagged candidates, and front-loads substantial design-iteration work. The cohort-level pattern is consistent with a calibrated multi-objective optimizer operating at the edge of detectable headroom on a deliberately hard benchmark. We position the platform as an early-stage triage and lead-optimization layer in bispecific antibody discovery. For molecules whose first-pass result does not clear all flags, iterative re-invocation of the pipeline on the optimized output is a natural follow-up direction.