ImmunoInformatics
○ Elsevier BV
Preprints posted in the last 30 days, ranked by how well they match ImmunoInformatics's content profile, based on 11 papers previously published here. The average preprint has a 0.01% match score for this journal, so anything above that is already an above-average fit.
Mehta, N. K.; Sahni, R.; Kumar, N.; Raghava, G. P. S.
Show abstract
1.Prediction of conformational B-cell epitopes is critical for vaccine design, immunotherapy, and antibody engineering. To date, several host-independent computational methods have been developed for predicting antibody-interacting residues in antigen structures. However, it is well established that antigen-antibody (Ag-Ab) interactions vary depending on the host immune system indicating the importance of developing host-specific prediction models. In this study, we present, for the first time, a human host-specific method, HAIRpred2, that predicts antibody-interacting residues in an antigen from its tertiary structure. The dataset was derived from HAIRpred and comprises 277 human Ag-Ab complexes, with 221 structures used for training and 56 for independent testing. Preliminary analysis revealed that residues with a relative surface accessibility (RSA) below 0.05, corresponding to buried regions, are highly likely to be non-interacting, underscoring the importance of structural accessibility in antibody recognition. To identify the most informative features, we evaluated multiple feature representations, including RSA, large language model (LLM)-based embeddings, distance-based features, and physicochemical properties. A model trained on single-residue RSA features achieved an AUC of 0.72. Incorporating a sliding window of 15 residues to capture local structural context improved performance to an AUC of 0.75. The best performance (AUC = 0.78 on the independent test set) was achieved by integrating RSA with physicochemical descriptors. Benchmarking against existing antibody-interaction prediction methods on the same independent dataset demonstrated that HAIRpred2 outperforms current tools, further highlighting the advantage of host-specific modeling. HAIRpred2 is freely available as a web server at https://webs.iiitd.edu.in/raghava/hairpred2/. HighlightsO_LIDevelopment of HAIRpred2, the first human host-specific method for predicting antibody-interacting residues. C_LIO_LIAnalysis of 277 human antigen-antibody complexes to capture host-dependent interaction patterns. C_LIO_LIRelative surface accessibility (RSA) identified as a key determinant, with buried residues rarely participating in interactions. C_LIO_LIIntegration of RSA with physicochemical features achieved the best performance (AUC = 0.78) on an independent dataset. C_LIO_LIHAIRpred2 outperforms existing methods and is available as a web server for epitope prediction. C_LI
GAYRAUD, G.; Davila Felipe, M.; Padiolleau-Lefevre, S.; Maffucci, I.; Issouani, E. M.; Guerin, M.; Da Ponte, H.
Show abstract
Aptamers are single stranded DNA or RNA molecules selected for their high affinity and specificity to bind target molecules, similar to antibodies. They are commonly selected through the SELEX process, which involves the iterative exposure of a random sequence library to a target and retaining the sequences showing good binding properties. To improve Lyme disease detection, we propose designing aptamers that specifically bind to the CspZ protein on the surface of Borrelia burgdorferi, the bacterium responsible for the disease. Starting with a SELEX process consisting of thirteen rounds, from which selected in vitro sequence candidates have emerged, we aim to propose a holistic process that selects in silico new sequence candidates that are further validated experimentally. Our approach relies on 1) using Machine Learning (ML) techniques, specifically a Restricted Boltzmann Machine (RBM), to digitally replicate the last round of the SELEX process, 2) integrating insights from text analysis methods, such as word2vec and n-grams, into the RBM model trained on the final-round SELEX dataset to represent and compare newly generated sequences with in vitro candidates, 3) selecting in silico sequences with strong potential to bind to CspZ protein, 4) experimentally validating the selected in silico sequences of step 3. Our holistic approach combines biological insights with statistical models to improve the efficiency and outcome of the SELEX process. We enhance the RBM model, designed to replicate the distribution of the final SELEX round, by integrating geometric representations of sequences, which is especially advantageous when dealing with limited datasets relative to the vast sequence space. In addition, it provides in silico sequence candidates with strong binding properties.
Aldas-Bulos, V. D.; Plisson, F.
Show abstract
Machine learning continues to accelerate peptide and protein design through the rapid prediction and generation of sequences with desired characteristics. Many applications focus on predicting properties, functions, and structures, as well as generating point mutations and de novo designs. Nevertheless, many models prove less generalizable than initially claimed. Most predictors and generators are trained on sequential datasets, where imbalances can be addressed during preprocessing. In contrast, structural bias, a subtype of algorithmic bias arising from uneven representation of structural classes in training datasets, and the limitations of early protein structure predictors have frequently remained undetected and uncorrected. The recent surge in powerful protein structure prediction tools, such as the AlphaFold and RosettaFold series and their variants, now presents opportunities to mitigate this issue. We hypothesize that such structural sampling biases influence the downstream performance of ML models. Using antimicrobial peptides as a case study, we audited the structural biases in 16 state-of-the-art predictors for antimicrobial activity and tested whether structural information constrains their predictions. Our analysis revealed that models explicitly trained on sequential data still produce predictions biased by uneven fold representations and data leakage. These findings highlight the importance of integrating balanced structural data or implementing bias-mitigating strategies to develop agnostic models that maximize bioactive protein discovery and multi-objective optimization.
Sen, E.; Steiger, S.; Basic, M.; Prokoph, N.; Syed, A. P.; Seufert, I.; Rehman, U.-U.; Schumacher, S.; Baumann, A.; Feuring, M.; Weinhold, N.; Lübbert, M.; Döhner, H.; Döhner, K.; Raab, M. S.; Mallm, J.-P.; Stegle, O.; Rippe, K.
Show abstract
BackgroundSingle-cell multi-omics profiling of hematopoietic malignancies frequently involves pooling of patient samples before library preparation to reduce costs. Demultiplexing and quality control of the resulting sequencing data depend on experimental design, sequencing depth, and computational methods. Existing approaches benchmark individual tools, auto-select a single best method, or apply majority voting. However, none systematically exploit disagreement patterns among orthogonal strategies as a diagnostic signal for cell quality. ResultsWe introduce Split-flow, a modular Nextflow pipeline that runs hashing-based and SNP-based demultiplexing, and transcriptome-based doublet detection in parallel. It classifies cells into quality strata through a concordance-based decision framework. Validation on multiplexed CITE-seq data from 14 multiple myeloma patients across eight Chromium channels demonstrates high reproducibility and shows that discordant cells cluster within specific cell types and quality strata. TCR clonotype cross-referencing against VDJdb confirms that concordance-based classification enriches for biologically genuine immune receptor sequences, with a 5.3-fold enrichment of confirmed public TCR sequences in the high-confidence stratum. Downsampling analysis reveals that SNP-based methods are more depth-sensitive than hash-based approaches, supporting the recommendation to combine both strategies. The framework transfers to AML samples across three assay types (snMultiome-seq, scRNA-seq, scATAC-seq), where ATAC-based demultiplexing resolves donor assignment discordance under low hashing efficiency. ConclusionsSplit-flow demonstrates that combining of orthogonal preprocessing methods yields structured information about cell quality and offers a concordance-based framework that transforms this disagreement into a diagnostic signal. It introduces a preprocessing approach that can be exploited beyond hematopoietic malignancies in multiplexed single-cell applications. Graphical abstract O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=114 SRC="FIGDIR/small/724135v1_ufig1.gif" ALT="Figure 1"> View larger version (26K): org.highwire.dtl.DTLVardef@1f36dbcorg.highwire.dtl.DTLVardef@a9799forg.highwire.dtl.DTLVardef@6fca94org.highwire.dtl.DTLVardef@15cc1f3_HPS_FORMAT_FIGEXP M_FIG C_FIG Highlights and main findingsO_LIIntroduces Split-flow, a modular Nextflow DSL2 pipeline for preprocessing of multiplexed single-cell multi-omics sequencing data from hematopoietic malignancy samples via a post hoc concordance-based decision framework. C_LIO_LIProvides practical guidance for the experimental design of multiplexed single-cell multi-omics experiments, including the recommendation to combine antibody-based hashing with a SNP genotype reference for orthogonal demultiplexing. C_LIO_LIReveals that SNP-based demultiplexing is more sensitive to sequencing depth than hash-based approaches, and that the combined strategy mitigates depth-dependent biases in cell-type recovery. C_LIO_LIDemonstrates that disagreement between demultiplexing methods contains structured diagnostic information about cell quality, with concordance categories reflecting genuine quality gradients in multiple myeloma CITE-seq samples. C_LIO_LIValidates the concordance framework using T cell receptor sequences as an orthogonal biological readout, with a 5.3-fold enrichment of confirmed public TCR sequences in the high-confidence stratum. C_LIO_LIApplies the preprocessing framework to AML patient samples across three assay types (snMultiome-seq, scRNA-seq, and scATAC-seq) and demonstrates that ATAC-based demultiplexing can resolve donor-assignment discordance. C_LI
Garcia-Valiente, R.; Triantafyllou, C.; van Schaik, B.; Jongejan, A.; Pollastro, S.; Anang, D. C.; Guikema, J. E.; de Vries, N.; Hoefsloot, H. C.; van Kampen, A. H. C.
Show abstract
High-throughput sequencing of B-cell and T-cell immune receptor repertoires provides unprecedented insight into adaptive immune responses. The data produced are structured by clonal relationships and somatic mutation signatures, and yield extremely rich information in sequence-derived features, including physicochemical properties and compositional patterns. However, integrated analysis across datasets, conditions, and time points remains challenging. Current analytical tools typically focus only on certain features within individual repertoires, without enabling integrated, multivariable comparisons across datasets, conditions, and time points to address their diversity and variability. Here we present AbSolution, a user-friendly and flexible interactive application for comprehensive exploration of immune repertoires and their sequence-based properties. AbSolution enables multiscale analysis of thousands of sequence-derived features across receptor regions, while accounting for V(D)J usage, clonal composition and experimental groupings. We demonstrate its utility by identifying distinct sequence-based profiles associated with dominant (highly abundant) and non-dominant B-cell clones in peripheral blood BCR repertoires from patients with idiopathic inflammatory myopathies, and with antigen-responsive T-cell populations over time in a longitudinal in vitro antigen-stimulation dataset. Through interactive, interlinked visualizations, statistical feature selection and multi-sample comparisons, AbSolution facilitates integrated feature profiling that supports the interpretation of immune selection processes and enables systematic analysis of complex repertoire datasets.
Fang, H.; Tan, T.
Show abstract
Background: The development of personalised mRNA cancer vaccines holds considerable promise for oncology, yet a significant translational gap persists between neoantigen identification and the selection of therapeutically impactful targets. Current approaches predominantly prioritise human leukocyte antigen (HLA) binding affinity and immunogenicity, often overlooking the systems-level biological context of the target. This can inadvertently favour immunogenic but biologically peripheral peptides that exert limited influence on tumour signalling networks, thereby constraining vaccine efficacy. Furthermore, mRNA therapeutics must satisfy additional design requirements, including favourable codon usage and favourable secondary-structure stability, which directly affect in vivo translation and half-life. A unified computational framework that integrates neoantigen discovery with network biology is therefore critically needed. Results: Here, we present PimRNA, a Priority index (Pi)-centric computational medicine framework that bridges this gap by unifying neoantigen identification, mRNA sequence optimisation, and gene interaction network analysis. First, high-confidence tumour-specific HLA class I and II neoantigenic peptides are identified from paired tumour-normal genomic and tumour transcriptomic data using NeoDisc. Second, the coding sequences of these peptides are optimised for stability and translational efficiency with LinearDesign, yielding a core set of neoantigen-encoding mRNAs. Third, a random walk with restart algorithm is applied to a knowledgebase of gene interactions to identify peripheral genes exhibiting significant network connectivity to core genes, generating a gene-predictor matrix in which each gene is assigned an affinity score reflecting its network proximity to immunogenic neoantigens. These scores are consolidated into a single, unified priority rating (0-5) for each gene, followed by subnetwork analysis that reveals therapeutically relevant gene modules. Application of PimRNA to breast cancer and melanoma datasets demonstrates that it successfully selects high-confidence immunogenic neoantigen candidates embedded within biologically meaningful tumour-specific networks. Conclusion: PimRNA provides a systems biology foundation for mRNA vaccine design, moving beyond isolated immunogenicity to prioritise targets that are both highly presented and central to tumour-relevant biological networks. This framework offers a generalisable strategy for the rational discovery and prioritisation of mRNA therapeutics, significantly advancing the field of computational medicine towards personalised cancer vaccines.
Fieux-Castagnet, A.; Waton, J.; Glukhonemykh, A.; Snow, E.; Ashokkumar, R.; Fleming, J.; Champagne, D.; Devenyns, T.; Peluffo, A.; Anagnostopoulos, C.
Show abstract
Protein structure prediction models (such as AlphaFold, Chai, Boltz) have transformed structural biology and are increasingly explored for drug discovery; however, their utility for large-scale screening of antibody-antigen (AB-AG) interactions remains unclear, particularly for distinguishing true binding from non-binding pairs at scale. To our knowledge, there has not been an exhaustive exploration of Boltz-2 inference settings on this high impact problem, and in this paper we set out to describe and implement a novel benchmarking framework that can accelerate progress in the field. We evaluated Boltz-2 (NVIDIA NIM implementation) on 519 therapeutic monoclonal antibodies from Thera-SAbDab, pairing each antibody with its cognate target and a randomly assigned non-cognate antigen. We developed a novel evaluation framework that systematically captures variability across stochastic seeds while benchmarking different inference settings, including datasets with and without crystallographically resolved antibody structures. Across settings, Boltz-2-derived confidence metrics showed weak, though above-chance, discrimination (0.5 < ROC-AUC < 0.60). Among evaluated metrics, the minimum value of the interface predicted TM-score (ipTM-min) across seed-samples, captured the strongest signal. Interestingly, additional feature aggregation and multivariate modelling provided little to no improvement. Increasing the number of stochastic predictions yielded front-loaded gains, with diminishing returns beyond [~]15-20 seed-samples, suggesting limited value of extensive sampling in practical workflows. Notably, inference without multiple sequence alignments (MSAs) slightly improved performance on non-crystallized antibodies ({Delta}AUROC {approx} +0.027) while reducing runtime by [~]8 seconds per prediction compared to shallow MSA settings. Overall, these results indicate that off-the-shelf confidence metrics from general-purpose structure prediction models may be insufficient for reliable target-antibody screening and highlight the need for task-specific optimization, while confirming that modest amounts of sampling can be helpful, but not in itself sufficient to improve performance significantly as gains plateau relatively quickly.
Casals-Franch, R.; Nonell, L.; Villa-Freixa, J.; Lopez Garcia de Lomana, A.
Show abstract
Reconstructing dynamic immune cell state transitions from single-cell transcriptomic data requires coordinated analytical strategies that capture both phenotypic progression and underlying regulatory programs. This protocol describes a step-by-step computational workflow for analyzing human tumor-infiltrating T cells using the sequential application of dimensionality reduction, pseudotime trajectory inference, regulon activity analysis, and transcription factor-transcription factor network reconstruction. The workflow outlines data preprocessing and quality control, trajectory rooting and parameter selection, branch-specific differential analysis, and the integration of regulon inference to contextualize transcriptional programs along inferred trajectories. Regulon-based TF-TF network reconstruction is used as a downstream interpretive layer to identify regulatory modules associated with distinct cell-state transitions. Publicly available at GitHub repository https://github.com/rogercasalsfr/immuno-trajectory-grn-integrative-workflow, this protocol emphasizes practical considerations including parameter sensitivity, trajectory robustness, and consistency between phenotypic and regulatory outputs. The protocol supports reproducible analysis and interpretation of immune cell dynamics in human tumor microenvironment studies using single-cell RNA sequencing data.
Cisterna Garcia, A.; Gonzalez Lopez, A. M.; Vozi, A.; Esteban, M. A.; Egli, A.; Jutzeler, C.; Palma, J.; Sanchez-Ferrer, A.; Botia, J. A.
Show abstract
Antimicrobial resistance (AMR) has a profound impact on animal and human health and is associated with substantial morbidity, mortality and public health costs. There is a clear need to develop novel, effective antibiotic agents, which can overcome the current AMR crisis. Antimicrobial peptides (AMPs) may offer such a solution and have attracted growing attention for their potential to combat AMR. In parallel, the growing availability of peptide sequences in public databases has stimulated the development of numerous machine learning and deep learning tools to predict antimicrobial activity computationally. However, it remains unclear how reliably these tools can be compared, as existing studies often rely on heterogeneous datasets and inconsistent evaluation protocols that may lead to data leakage and inflated performance estimates. This raises a central question: what evaluation criteria and benchmark resources are needed to enable fair, reproducible, and biologically meaningful assessment of AMP prediction tools? We address this question by focusing specifically on antibacterial peptides (ABPs). We first provide an overview of AMP databases relevant to antibacterial activity and compare their content, redundancy, and experimental metadata. We then critically assess existing computational tools for ABP prediction, highlighting key limitations related to dataset construction, affinity to certain sequences, data leakage, and inconsistent performance reporting. Based on these limitations, we propose a reference evaluation framework designed to improve comparability, reproducibility, and practical utility in ABP prediction. Finally, we provide targeted recommendations for AMP databases and future tool development to support more robust progress in the computational discovery of ABPs.
Sindhi, N. A.; Pawar, N.; Dixson, J.; Garcia, D.
Show abstract
Predicting protein-protein interactions is a fundamental problem in molecular biology. Experimental approaches for identifying protein-protein interactions are time-consuming and labor-intensive, motivating the development of efficient computational alternatives, including machine learning-based methods. However, conventional machine learning methods often rely on manually engineered features that require substantial domain expertise. In this study, we propose a two-stage framework to address these limitations. In the first stage, a one-dimensional convolutional neural network autoencoder is used to automatically learn latent representations from protein sequences. The quality of these features is evaluated through reconstruction error, reflecting how accurately the model reconstructs the original sequence. In the second stage, these learned features are combined with amino acid frequency-based features to form a hybrid feature set for predicting protein-protein interactions. A systematic comparison is performed between models trained on frequency features alone and those using a hybrid representation. The comparison showed that incorporating one-dimensional convolutional neural network-derived latent features improved the models performance of predicting protein-protein interactions. The dataset was split into training, validation, and test sets. Nested cross-validation was employed, with inner loops for hyperparameter tuning and outer loops for model selection. The random forest classifier achieved the best performance, with a mean receiver operating characteristic-area under curve of 0.91 and a test F1-score of 0.87. These results highlight the effectiveness of integrating deep feature learning with ensemble methods for predicting protein-protein interactions and build upon previous work focused on this fundamental problem. Author SummaryProtein-protein interactions are fundamental in all biological processes. However, predicting these interactions is a key problem in molecular biology. Computational approaches have been tested to address this problem. We applied a mix of machine learning and deep learning to gain insight into the qualities of proteins that engage in interaction. First, we trained a deep learning model, which automatically learned the primary sequence and characters related thereto, reducing bias in the actual prediction process. We combined these features, or latent representations, with amino acid frequency features of protein sequences, and called the two together "hybrid features." Then we performed a systematic comparison of amino acid frequency features-only with hybrid features, among four different machine learning classifiers. Our results suggest that the random forest classifier performed best among all four classifiers at predicting interactions between proteins. We propose that this approach could be used to improve efficiency in testing protein-protein interactions at the bench and may have applications to other biologically relevant molecular interactions.
Raposo, P.; Martinez Marin, J. S.; Kim, G.; Insana, G.; Jyothi, D.; Luo, J.; Tunstall, T.; Consortium, U.; Orchard, S.; Steinegger, M.; Martin, M.
Show abstract
MotivationThe ongoing revolution in genome sequencing is delivering an unprecedented number of genome assemblies to global repositories, resulting in an overwhelming amount of data imported to UniProt in the form of proteomes. To manage this growth sustainably, there is a need for a systematic workflow to select the best proteomes. ResultsWe propose a novel pipeline for cellular organisms to select the best Reference Proteomes, i.e. those that best represent the protein space of a species. The pipeline uses a clustering algorithm based on MMseqs2 to select the minimum number of Reference Proteomes whilst maximising the representation of the protein space for each species. Additionally, we aligned our viral Reference Proteomes with the exemplar genome set defined by the International Committee on Taxonomy of Viruses. Because this method ensures that all species are represented with at least one Reference Proteome, the UniProt Knowledgebase increased the number of Reference Proteomes of 36% and covering 34% more species in the Tree of Life. The UniProt Knowledgebase will mainly retain proteins from Reference Proteomes and therefore this method reduces the overall number of proteins by 43%, leading to a more concise yet representative knowledgebase. Availability and Implementationhttps://www.uniprot.org/proteomes Contactraposo@ebi.ac.uk Supplementary informationSupplementary data are available at Bioinformatics online.
Sharma, S.; Das, R.; Pennati, A.; Hedican, C.; Barroilhet, L.; Patankar, M. S.; Galipeau, J.
Show abstract
BackgroundCytokines are immunomodulatory proteins that play central roles in regulating immune responses and represent attractive targets for cancer therapy. However, as single agents, cytokines have shown limited clinical benefit due to systemic toxicities and a short in vivo half-life. Our group has focused on engineering fusion cytokines (fusokines) that couple two cytokines into a single biologic to reprogram immune cell responses by enforcing non-canonical receptor engagement and signaling. A chimeric IL-6/IL-1{beta} fusokine was engineered to test the hypothesis that enforced co-engagement of IL-6 and IL-1{beta} signaling pathways would confer a gain-of-function phenotype in T cells and promote robust anti-tumor immunity. Here, we describe the immunomodulatory properties of IL6/1 fusokine and a method to deliver this fusokine to produce inhibition of ovarian tumor growth in a pre-clinical mouse model. MethodsLentiviral vectors encoding murine or human IL6/1 were designed using Vector Builder and expressed in either HEK293, CHO or ID8-F3 (p53-/-) cells depending on the downstream experiment to be conducted. IL6/1 expression was validated by ELISA and flow cytometry. Effects of human IL6/1 (hIL6/1) on T cell function (proliferation, memory phenotype, activation induced apoptosis) were monitored by flow cytometry. For in vivo studies, ID8-F3 murine ovarian cancer cells expressing mouse IL6/1 (mIL6/1) were administered intraperitoneally (I.P.) as a cell-based therapy to C57BL/6 female mice bearing established ID8-F3 luciferase tumors. Tumor progression was monitored by bioluminescence (BLI) imaging, and overall survival was evaluated. ResultshIL6/1 significantly enhanced T cell survival and selectively promoted activation and expansion of CD45RO memory T cells. mIL6/1 expressing ID8-F3 cells (ID8IL6/1) demonstrated stable transduction and sustained cytokine secretion. In vivo, ID8IL6/1 cell therapy significantly reduced tumor growth and improved overall survival compared to control groups, with 2 of 8 mice achieving complete tumor clearance. ConclusionThese findings indicate that IL6/1 fusokine enhances T cell survival and proliferation while promoting memory responses. Engineered cancer cells (ID8-F3) expressing mIL6/1 fusokine induced a strong anti-tumor response when delivered as a therapeutic vaccine in ovarian cancer mouse model. What is already known on this topicO_LIFusokines are a class of bifunctional proteins designed to achieve synergistic immune modulation. Previous studies in our lab have shown fusokine exhibit gain-of-function immunomodulating activity. Individually, IL-6 and IL-1{beta} are recognized for their roles in promoting T-cell proliferation and effector function. However, the potential for a fused IL-6/1 fusokine to reprogram the immune system and elicit a superior anti-tumor response in vivo in ovarian cancer model is not yet studied. C_LI What this study addsO_LIThis study develops a novel fusion cytokine (fusokine), combining IL-6 and IL-1{beta}, and demonstrate robust activation of T cells. In a preclinical ovarian cancer model, engineered cancer cells expressing IL6/1 used as a therapeutic vaccine showed significant tumor reduction and improved overall survival. C_LI How this study might affect research, practice or policyO_LIThis study demonstrates that in comparison to individual cytokines, fusokines have greater potential to activate T cell function and when delivered as a cell therapy, achieve clear therapeutic efficacy in an ovarian cancer model. Further translational and clinical studies may enable the development of novel and more effective fusokine cell therapy approaches for patients with ovarian cancer. C_LI
Komianos, N.; Prakash, P.
Show abstract
Matrixyl (palmitoyl pentapeptide-4, KTTKS core) is a collagen-stimulating peptide used in topical anti-ageing products, but its in-use efficacy is limited by poor permeation through the stratum corneum. We describe a deterministic computational workflow that combines a tournament genetic algorithm and NSGA-II with exact RDKit molecular descriptors to search the fixed-length, edit-distance-2 neighbourhood of KTTKS (3,706 candidate sequences) for analogs with descriptors more favourable for passive transdermal diffusion. The search returns a 9-member Pareto frontier that quantifies the trade-off between predicted permeability and motif preservation. Five of the nine frontier members carry the same substitution, lysine to proline at position 4 (K4P). This single change lowers the topological polar surface area by 25.6%, removes the +1 charge contributed by lysine, and reduces the functional-preservation score from 1.00 (KTTKS) to 0.67. The frontier ranking is unchanged by {+/-}30% perturbations to the TPSA and Mw penalty weights and by a 30% increase in the LogP penalty; only a 30% reduction in the LogP penalty produces rank movement. The frontier matches the ground-truth Pareto set obtained by exhaustive enumeration of all 3,706 candidates (precision and recall both 100%). On the basis of these results we recommend three sequences for experimental validation: PTTPS (largest predicted gain), KTTPS (single-mutation, conservative), and KTTPP (backup). All code, results, and figures are released under MIT and CC BY 4.0.
Xie, L.; Ye, E.; Wang, H.; Zhang, T.; Zhen, Q.; Liang, F.; Liu, D.; Zhang, G.
Show abstract
BackgroundThe function of a protein is intrinsically linked to its three-dimensional fold, and deep learning has revolutionized the field by enabling high-accuracy structure prediction at an unprecedented scale. Nevertheless, the growing deployment of these predictive pipelines in drug discovery and structural biology reveals a critical bottleneck that lies in the lack of independent and rigorous estimation of model accuracy (EMA) methodologies. ResultsHere we present DeepUMQA-Global, a single-model deep learning framework for estimating accuracy of protein structure models. Our method employs a structure-sequence cross-consistency mechanism to evaluate the bidirectional compatibility between the predicted structure and the input sequence, enabling comprehensive characterization of fold accuracy. DeepUMQA-Global outperforms the self-assessment confidence scores of AlphaFold3, achieving improvements of 57.8% in Pearson correlation and 49.0% in Spearman correlation. With respect to the CASP16 retrospective benchmark, DeepUMQA-Global outperforms all single-model accuracy estimation methods that participated in CASP16 and achieves performance comparable to that of the top consensusObased methods. A lightweight consensus strategy built upon DeepUMQA-Global ranks first among all CASP16 participants, surpassing all other methods, including consensus approaches, and highlighting the strength of our method. Remarkably, DeepUMQA-Global demonstrates a strong ability to discriminate between alternative conformational states of proteins, as evidenced in the CASP unique alternative conformation protein complex target and the CoDNaS benchmark. ConclusionsOur results indicate that DeepUMQA-Global can be extended to broader protein modeling tasks, moving beyond static evaluation to offer a foundation for dynamic conformation EMA, where it accurately discriminates alternative conformational states and demonstrates reliable predictive fidelity in model accuracy estimation.
Gingrich, P. W.; Biswas, A.; Mica, I. L.; Brammer, K. M.; Shu, Z.; Maxwell, D. S.; Russell, K. P.; Al-Lazikani, B.
Show abstract
Abstract SummaryReliable structure-based prediction of small-molecule druggability is hindered by a fundamental labeling problem. Experimentally confirmed liganded sites (positives) are observable, but credible "undruggable" pockets (negatives) are almost impossible to define. Standard supervised machine learning consequently relies on arbitrary definitions of undruggable, leading to bias and false negatives. Here we introduce PocketBagger, a positive-unlabeled (PU) learning framework for pocket druggability prediction trained exclusively on experimentally determined Protein Data Bank1 (PDB) structures. PocketBagger uses PU bagging to learn key features associated with reliable druggable pockets and considers all remaining pockets in the structurally characterized proteome as unlabeled. We demonstrate the capability of PocketBagger through the training of a simple Random Forest classifier and demonstrate its power in recall (0.804), even when challenged with increasingly difficult generalizability assessments and entire protein-family hold outs. We benchmark and demonstrate the added value of PU learning by comparing PocketBagger to a leading deep-learning predictor. However, PocketBagger is intended to be used as a framework for any model architecture. Along with the code, the data generated by PocketBagger are deployed in canSAR.ai, providing scalable, generalizable pocket druggability predictions to the drug discovery community.
Jung, K. J.; Qiu, J.; Cho, S.; McDonough, E.; Chadwick, C.; Ghose, S.; West, R. B.; Brooks, J. D.; Ginty, F.; Machiraju, R.; Mallick, P.
Show abstract
Accurate prognostic assessment of prostate cancer (PCa) requires an integrated understanding of tissue morphology-encompassing cell structure, glandular architecture, and tissue organization-and the immune environment. We present Prostate-TriMod, a novel tri-modal histology dataset designed to integrate high-resolution visual morphology with spatial tissue maps, immune infiltration patterns, and clinical outcomes. This dataset, generated from the Cell DIVE multiplexed imaging platform, consists of three synchronized modalities: (1) multiscale virtual H&E tiles (224px, 256px, 512px, and 2040px) providing visual morphological context, (2) spatial tissue maps identifying cancerous/non-cancerous epithelial cells, stroma and immune cell populations (via TOPAZ and CAT models), and (3) text captions generated from single-cell data and patterns. The dataset includes comprehensive clinical annotations, including Grade Groups and biochemical recurrence (BCR) status. By providing high-fidelity alignment between visual features, spatial tissue maps, and textual descriptions, Prostate-TriMod empowers the development of advanced multimodal AI frameworks. We expect this resource to support reuse in multimodal representation learning, spatial analysis, and benchmarking studies that link histology morphology and immune context to clinical outcomes in prostate cancer.
Lasch, P.
Show abstract
1.Over the last two decades, matrix-assisted laser desorption/ionization time-of-flight mass spectrometry (MALDI-ToF MS) has become the standard method for identifying bacteria and has found a wide range of applications, especially in clinical microbiology. The methods high taxonomic resolution, minimal sample preparation, and complete, ready-to-use commercial systems, which include instrumentation, experimental protocols, spectral databases, and identification analysis software, were key factors in the success of MALDI-ToF MS as the standard for identifying microorganisms in routine diagnostic laboratories. However, despite the availability of these commercial solutions, there is also a growing need for efficient, cost-effective, vendor-neutral databases and analysis tools. These tools would enable the compilation of user-defined mass spectral databases and the testing of new analysis methods and algorithms, particularly in an academic context. To this end, MicrobeMS software has been developed to cover all stages of MALDI-ToF MS-based identification analysis. MicrobeMS is an easy-to-use desktop application for analyzing mass spectra from microorganisms and performing tasks related to spectrum database compilation. It includes routines for direct data import and export, biomarker peak searches, management of spectrum metadata, testing of spectrum quality, supervised and unsupervised identification analysis and intuitive result display. MicrobeMS is implemented in MATLAB and is freely available as MATLAB pcode for Windows and Linux, as well as a standalone application. Over the last fifteen years, the software has undergone continuous development and is now used routinely in various settings at the Centre for Biological Threats and Special Pathogens (ZBS) at the Robert Koch Institute (RKI) in Berlin, Germany, for example in supporting spectrum database compilation, to identify special or rare pathogenic bacteria by advanced identification analysis concepts, or to test in silico MALDI-ToF MS databases derived from microbial genomes. In this software publication the versatility and capabilities of MicrobeMS are demonstrated using a test data set from highly pathogenic bacteria (HPB) which has been obtained as part of a published European Union (EU)-funded External Quality Assurance Exercise (EQAE). MicrobeMS and HPB test data can both be downloaded from https://wiki.microbe-ms.com/. The goal of this software publication is twofold: to raise awareness of MicrobeMS within the scientific community and to encourage the testing of the software and custom-developed MALDI-ToF MS databases of the RKI, which are published at the ZENODO data repository (https://doi.org/10.5281/zenodo.7702374).
Garcia-Valiente, R.; Langton, S. H.; van Kampen, A. H. C.
Show abstract
Reproducibility and transparency in computational analyses are essential in science, although achieving these goals often requires significant knowledge and systematic organization. Graphical interactive applications simplify the conduct of analyses and make them accessible to a broader audience. However, there is currently no consensus on how to and to which extent implement reproducibility in interactive applications. We recently developed AbSolution, a user-friendly and flexible interactive web-based R Shiny application for exploring immune repertoires and their sequence-based features, and we established the ENCORE framework to enhance transparency and reproducibility by guiding researchers in structuring and documenting computational projects. In this work, as a proof-of-concept we integrate AbSolution, ENCORE and specific R packages to address reproducibility challenges. This enables a single-step export of raw, processed, and meta-data, the software environment, the underlying generated code and a HTML report containing results and figures, operating system, hardware and R session details, and researcher notes. Its reproducibility has been independently validated by the CODECHECK initiative. This paper demonstrates how the combination of several approaches can improve and automate reproducibility of interactive applications.
Pawar, P.; samarasinghe, s.
Show abstract
Tuberculosis (TB) remains a formidable global health challenge, exacerbated by the emergence of drug-resistant Mycobacterium tuberculosis strains that threaten to render existing drug therapies and vaccine ineffective. Despite the availability of the Bacillus Calmette-Guerin (BCG) vaccine, its limited efficacy--primarily in infants and young children--falls short of reducing TB prevalence or offering adequate protection to adults. Therefore, developing a new TB vaccine with enhanced efficacy and the capability to generate a robust reservoir of memory cells is essential. Addressing the challenge of drug-resistant tuberculosis requires a deep understanding of bacterial evolution and developing robust countermeasures. This study aims to design a next-generation TB vaccine that provides broad-spectrum protection against various Mycobacterium tuberculosis strains, including drug-resistant ones. By conducting an in-depth investigation into pathogen-human interactions, the research proposes a holistic framework that leverages computational vaccinology to tackle challenges posed by pathogen polymorphism and overcome the limitations of conventional vaccines. By targeting conserved proteins across diverse TB strains and enhancing both humoral and cell-mediated immunity, this study proposes a new strategy for an epitope-based vaccine that provides long-lasting, universal coverage. An extensive proteomic, reverse vaccinology and immunoinformatics analysis of 159 TB strains yielded 27 highly conserved, immunogenic, non-toxic, and non-allergenic epitopes. These epitopes, consisting of 14-cytotoxic T-lymphocytes (CTL), 5-helper T-lymphocytes (HTL), and 8-B-cell epitopes, were used to construct a three-dimensional, multi-epitope TB vaccine designed based on a new concept introduced in this research for maximising vaccine efficacy. Molecular docking and immune simulation studies demonstrated a significant affinity between the vaccine constructs and toll-like receptors, indicating a strong potential for effective immune system engagement. The crucial features of the epitope-based TB vaccine constructed in this research include sequence conservancy, robust antigenicity, exclusion of self-peptides and potential for diverse allelic interactions. The proposed epitope-based vaccine is poised to be highly effective, safe, and capable of providing universal coverage, potentially paving the way for global TB eradication. Validation in laboratory and clinical settings will be essential to confirm its efficacy and real-world applicability.
Weber, J.; Parajuli, G.; Wang, S.; Ratner, V.; Ma, X.; Shoshan, Y.; Zhang, L.; Morrone, J.; Raboh, M.; Hexter, E.; Parthasarathy, P. B.; Gaughan, C.; Makarov, V.; Chu, L.; Hasgur, S.; Juric, I.; Diaz, M.; Srivastava, R.; Knauf, J.; Hassan, K.; Cornell, W.; Alban, T.; Chan, T.
Show abstract
T cell receptors (TCRs) are critical for immune surveillance and successful adaptive immune response against foreign antigens. TCRs drive this key arm of the immune system through recognition of peptide epitopes presented on MHC complexes. However, they are limited due to their stochastic nature and generation via genetic recombination. In silico design of functional TCRs that target defined peptide epitopes would be of considerable utility but has up until now been unsuccessful. Here, we develop an artificial intelligence (AI)-powered approach using a hybrid physics-based simulation and generative AI that successfully engineers TCRs against defined epitopes presented by MHC-I. We use this approach to design TCRs against two cancer antigens, a HERC1 neoantigen and an immunogenic neoepitope in mutant EGFR. We engineer multiple TCRs against the HERC1 neoantigen which activate T cells in response to exposure to peptide-MHC I and kill cancer cells more effectively than a patient-derived TCR. In addition, we used generative AI to design functional TCRs that target the EGFR T790M neoantigen, engineering greater specificity against the mutant sequence. We present an AI-based approach to TCR design with broad utility for efforts to engineer TCRs and for the development of new cell therapies. One sentence summaryArtificial intelligence-based approach enables the directed engineering of functional TCRs with enhanced features that target cancer neoantigens.