Back

ImmunoInformatics

Elsevier BV

Preprints posted in the last 30 days, ranked by how well they match ImmunoInformatics's content profile, based on 11 papers previously published here. The average preprint has a 0.01% match score for this journal, so anything above that is already an above-average fit.

1
IMMREP25: Unseen Peptides

Richardson, E.; Aarts, Y. J. M.; Altin, J. A.; Baakman, C. A. B.; Bradley, P.; Chen, B.; Clifford, J.; Dhar, M.; Diepenbroek, D.; Fast, E.; Gowthaman, R.; He, J.; Karnaukhov, V.; Marzella, D. F.; Meysman, P.; Nielsen, M.; Nilsson, J. B.; Deleuran, S. N.; Parizi, F. M.; Pelissier, A.; Pierce, B. G.; Rodriguez Martinez, M.; Roran A R, D.; Saravanakumar, S.; Shao, Y.; Smit, N.; Van Houcke, M.; Visani, G. M.; Wan, Y.-T. R.; Wang, X.; Woods, L.; Wuyts, S.; Xiao, C.; Xue, L. C.; IMMREP25 Participant Consortium, ; Barton, J.; Noakes, M.; May, D. H.; Peters, B.

2026-04-01 bioinformatics 10.64898/2026.03.30.715276 medRxiv
Top 0.1%
14.0%
Show abstract

T cell receptors (TCRs) can bind to peptides presented by MHC molecules (pMHC) as a first step to trigger a T cell response. Reliable approaches to predict TCR:pMHC binding would have broad applications in clinical diagnostics, therapeutics, and the fundamental understanding of molecular interactions. IMMREP is a community organized series of prediction contests that asks participants to predict TCR:pMHC binding on unpublished datasets. Previous iterations in 2022 and 2023 showed multiple approaches can predict TCR-pMHC binding with significant accuracy (median AUC_0.1[≥]0.7) for peptides where experimental data is available ("seen" peptides). In contrast, models did not outperform random guessing for peptides that have no such data available ("unseen" peptides). Here we report on the results of IMMREP25, which focused solely on unseen peptides in order to evaluate the cutting edge of the field. We received 126 named submissions predicting the specificity of 1,000 TCRs against twenty unseen peptides restricted by one of two MHC molecules (HLA-A*02:01 and HLA-B*40:01). The best performing methods showed a macro-AUC_0.1 of 0.60, significantly better than random, demonstrating significant advances in the field. The top performing methods incorporated structural modeling into their approach, indicating that especially for unseen peptides, a structural understanding aids in the prediction of TCR:pMHC interactions. The results from this benchmark highlight the significant challenges remaining for TCR:pMHC predictions and will inform future method development.

2
Computational Development of a GluN1 Synthetic Peptide Mimetic for Neutralization of Autoantibodies in Anti-NMDAR Autoimmune Encephalitis

Misra, P.; Movva, N. S. V.; Shah, R.

2026-03-30 bioinformatics 10.64898/2026.03.26.714496 medRxiv
Top 0.1%
6.3%
Show abstract

Purpose/ObjectiveThis study aimed to design and computationally evaluate a synthetic GluN1-mimetic peptide as a decoy to bind and neutralize pathogenic autoantibodies in anti-NMDA receptor (NMDAR) encephalitis, a severe autoimmune neurological disorder affecting approximately 1.5 per million individuals annually. MethodsKey GluN1 epitope residues (351-390 of the amino-terminal domain) were identified from crystallographic evidence and patient-derived antibody binding studies. Multiple peptide variants were rationally designed to mimic the antibody-binding interface. AlphaFold2 was used to predict peptide structures. Rigid-body docking simulations were conducted with HADDOCK 2.4 to model peptide-antibody complexes, and binding affinities were quantified using PRODIGY. A scrambled peptide control was included to establish docking specificity. ResultsThe top-performing peptide demonstrated favorable predicted binding ({Delta}G = -21.5 kcal/mol, Kd = 1.7 x 10-{superscript 1} M) with an average pLDDT score of 90%, a buried surface area of 3,255.5 [A]{superscript 2}, and 18 intermolecular hydrogen bonds. Relative to the scrambled control ({Delta}G = -8.3 kcal/mol), the designed peptide showed substantially stronger predicted binding. Conclusion/ImplicationsThese results support the validity of an epitope-mimicry design strategy and establish a scalable computational framework for prioritizing peptide decoy candidates applicable to other antibody-mediated autoimmune disorders. Experimental validation remains necessary to confirm real-world efficacy.

3
KyDab - a comprehensive database of antibody discovery selection campaigns.

Zhou, Q.; Chomicz, D.; Melvin, D.; Griffiths, M.; Yahiya, S.; Reece, S.; Le Pannerer, M.-M.; Krawczyk, K.

2026-03-27 bioinformatics 10.64898/2026.03.25.713450 medRxiv
Top 0.1%
4.9%
Show abstract

Preclinical antibody discovery relies on progressive screening and down-selection of candidate antibodies from large immune repertoires, yet this critical process is poorly represented in existing public databases. Here we introduce KyDab (Kymouse Antibody Database), a well-curated database of antibody discovery selection data generated using standardized workflows on the Kymouse humanized mouse platform. The current release includes 11 Kymouse platform mice immunisation studies covering 51 immunogens, more than 120,000 paired heavy-light chain sequences, and binding measurements for a selected subset of experimentally characterized clones. By capturing full-funnel selection data with consistent metadata and both positive and negative experimental outcomes, KyDab provides a valuable data resource for the development and evaluation of artificial intelligence models for antibody discovery. KyDab is accessible https://kydab.naturalantibody.com, and the database will be continuously updated as new datasets become available.

4
GRIMM-II: A Two-Stage Real-Time Algorithm for Nine-Locus HLA Imputation and Matching with Up to Three Mismatches

Kirshenboim, O.; Kabya, A.; Yehezkel-Imra, R.; Tshuva, Y.; Maiers, M.; Gragert, L.; Bashyal, P.; Israeli, S.; Louzoun, Y.

2026-03-31 bioinformatics 10.64898/2026.03.28.715027 medRxiv
Top 0.1%
1.6%
Show abstract

BackgroundThe success of hematopoietic stem cell transplantation (HSCT) depends critically on human leukocyte antigen (HLA) matching between donor and recipient. While traditional matching focuses on five classical HLA loci (A, B, C, DRB1, DQB1), clinical practice increasingly considers extended typing at nine loci, including DPA1, DQA1, DPB1, and DRB3/4/5. Furthermore, emerging evidence supports transplantation with up to three HLA mismatches under post-transplant cyclophosphamide (PTCy) regimens. However, current donor search algorithms cannot efficiently identify donors with multiple mismatches across extended HLA loci in real-time. MethodsWe developed GRIMM-II (GRaph IMputation and Matching, version II), which comprises two novel algorithms: ML-GRIM (Multi-Locus GRIM) for HLA imputation across multiple loci, and ML-GRMA (Multi-Locus GRMA) for real-time donor-patient matching with up to three mismatches. Both algorithms employ a two-stage approach that combines efficient candidate reduction through graph-theoretic frameworks with detailed genotype comparison. ML-GRIM partitions genotypes into class I (HLA-A, B, C) and class II (remaining loci) components, enabling memory-efficient storage and rapid candidate identification. ML-GRMA searches a pre-imputed donor graph composed of donor genotypes and their sub-components, then computes asymmetric graft-versus-host (GvH) and host-versus-graft (HvG) mismatch probabilities to provide clinically relevant compatibility assessments. Both imputation and matching tools are available as a web application at https://grimmard.math.biu.ac.il/ and through GitHub repositories at https://github.com/nmdp-bioinformatics/py-graph-imputation (imputation) and https://github.com/nmdp-bioinformatics/py-graph-match (matching). ResultsWe validated ML-GRMA and ML-GRIM using the WMDA3 (World Marrow Donor Association) validation dataset, successfully reproducing all previously reported matches while identifying numerous additional candidate donors not detected by previous algorithms. Further validation of ML-GRMA using 3,000 patients with artificially introduced mismatches (0-3 allele substitutions) demonstrated 100% sensitivity and specificity in identifying matching donors at expected mismatch levels. We validated ML-GRIM using simulated nine-locus typings derived from 8,078,224 US donors in the NMDP registry. The algorithm successfully imputed genotypes across variable numbers of typed loci while incorporating multiethnic haplotype frequencies. The algorithm achieved real-time performance with typical imputation times under one second and matching times of 1-13 seconds per patient for up to three mismatches, even when searching databases exceeding 8 million donors. Notably, ML-GRMA identified substantially more potentially suitable donors than traditional algorithms by accounting for the biological reality that GvH and HvG mismatches often differ, particularly for donors homozygous at specific loci. To evaluate ML-GRIM performance with low-resolution typing, we tested it on simulated 3-locus typings from the same population. The resulting imputation accuracy correlated with the mutual information between typed loci and complete genotypes. ConclusionsGRIMM-II provides a scalable, memory-efficient solution for nine-locus HLA imputation and real-time identification of donors with up to three mismatches. The graph-based framework supports dynamic registry updates and can readily accommodate additional HLA loci and matching criteria as clinical knowledge evolves. By expanding the pool of acceptable donors while maintaining computational efficiency, GRIMM-II addresses a critical need in contemporary transplantation practice, particularly for patients from underrepresented ethnic minorities who face lower probabilities of finding perfectly matched donors.

5
immunoPETE: A DNA-based integrated B-cell and T-cell receptor profiling platform

Zhao, H.; Mirebrahim, H.; Telman, D.; Dannebaum, R.; McNamara, S.; Tabari, E.; Lin, H.; Rubelt, F.; Berka, J.; Luong, K.; Joseph, M.; Bryan, R.; Ward, D.; Hayday, A.; Utiramerur, S.; Kumar, D.; Asgharian, H.

2026-03-20 immunology 10.64898/2026.03.17.712532 medRxiv
Top 0.1%
1.5%
Show abstract

The vast diversity of B and T cell receptors generated through the recombination of Variable (V), Diversity (D), and Joining (J) gene segments plays a critical role in adaptive immunity. Profiling immune repertoires at the DNA level provides a robust and stable approach to capture the clonal composition of these receptors. immunoPETE is an assay designed to target recombined human T-cell Receptor Beta (TRB), T-cell Receptor Delta (TRD), and Immunoglobulin Heavy (IGH) chain genes directly from genomic DNA. Simultaneous profiling of B and T cell receptor chains in a single reaction provides internally normalized clone counts and facilitates the study of B-T cell interactions. Full-length amplicon consensus sequences representative of original template DNA molecules are accurately reconstructed using Unique Molecular Identifiers (UMIs). An in-house pipeline compiles VDJ rearrangements from the Complementarity-Determining Region 3 (CDR3) of TRB, TRD and IGH chains into comprehensive readouts at cell-level resolution. In this study, we describe the immunoPETE end-to-end workflow, followed by a comprehensive benchmarking of its performance in adaptive immune profiling. Where applicable, we used both natural and contrived samples and characterized the assays accuracy, linearity, and reproducibility across several metrics: retrieving CDR3 sequences, determining B and T cell ratios, total cell count, yield, fraction of functional rearrangements, clonal diversity, composition of dominant clones, pairwise similarity, and V/J gene usage frequencies. Furthermore, we assessed its quantitative limits concerning the total number of lymphocytes and the detection of rare clones. As an example of its applications, we show that adding immune biomarkers extracted from immunoPETE data to clinical factors improves prediction of progression-free survival in a cohort of non-muscle invasive bladder cancer (NMIBC) patients. Finally, we discuss the broad applications of immunoPETE in the study of aging, cancers, infections, and autoimmune disorders with reference to select published studies.

6
Effects of protein interface mutations on protein quality and affinity

de Kanter, J. K.; Smorodina, E.; Minnegalieva, A.; Arts, M.; Blaabjerg, L. M.; Frolenkova, M.; Rawat, P.; Wolfram, L.; Britze, H.; Wilke, Y.; Weissenborn, L.; Lindenburg, L.; Engelhart, E.; McGowan, K. L.; Emerson, R.; Lopez, R.; van Bemmel, J. G.; Demharter, S.; Spreafico, R.; Greiff, V.

2026-03-26 molecular biology 10.64898/2026.03.24.713863 medRxiv
Top 0.1%
1.3%
Show abstract

Accurately modeling antibody-antigen interactions requires distinguishing intrinsic binding affinity ("protein-interaction") from protein biophysical properties ("protein-quality"), including folding, stability, and expression. However, high-throughput mutational measurements commonly used to train and benchmark computational models often conflate these effects, obscuring the true determinants of molecular recognition. Here, we present an experimental and analytical framework to disentangle protein-interaction effects from protein-quality effects in single-domain antibody (VHH)-antigen binding. Using a large-scale deep mutational scanning (DMS) dataset spanning four VHH-antigen complexes, with single and double mutations in both partners, we introduce control binders to quantify protein-quality changes independently of protein-interaction. This enables decomposition of experimentally measured affinity into protein-interaction and protein-quality components at scale. Leveraging the disentangled dataset, we evaluated state-of-the-art structure- and sequence-based models for protein-quality and protein-interaction prediction and show that their performance largely reflects protein-quality rather than protein-interaction effects. Our results highlight a major confounder in current datasets and suggest that accounting for protein-quality will be essential for training next-generation affinity-prediction models. Nomenclature Antibody related termsO_LIPrimary VHH: The VHH of a VHH-antigen complex for which the paratope and the epitope weremutated. C_LIO_LIControl VHH: A second VHH that binds to the same antigen as the primary VHH but has non-overlapping epitope positions and therefore does not bind to any of the mutated antigen positions. C_LI Affinity-related termsO_LIReal Affinity: "The strength of the interaction between two [...] molecules that bind reversibly (interact)" 1. In the context of antibody-antigen binding, it quantifies interactions between active proteins (which are expressed and correctly folded 2 and are therefore functionally and biologically active (see below). It is commonly quantified by the equilibrium dissociation constant, KD. C_LIO_LIObserved affinity ({degrees}KD): The interaction strength experimentally measured between two molecules. Unlike real affinity, this value is confounded by the biophysical properties of the individual binding partners, specifically their folding, stability, and expression levels. Consequently, the observed affinity often differs from the real/intrinsic affinity if a significant fraction of the protein population is inactive 3. NOTE: Unless otherwise specified, {degrees}KD is reported in - log10 space. For example, a {degrees}KD of -9 corresponds to 10-9M or 1nM. C_LIO_LIChange in observed affinity ({Delta}{degrees}KD): The shift in the observed affinity between two proteins upon mutation, reported as the log10-transformed fold change. A value of 1 reflects a 10-fold difference, a value of 2 a 100-fold difference, etc. This aggregate change resolves into two distinct biophysical components 2, 4: O_LIProtein-interaction change: The change in the intrinsic thermodynamic affinity between the two binding partners, each in its active state (i.e., the specific change in interface Gibbs free energy because both enthalpy and entropy are considered). C_LIO_LIProtein-quality change: The change in the fraction of the mutated protein population that is biologically active - meaning it is expressed, correctly folded, and stable 2, 5. O_LIFolding: The process that guides the polypeptide chain toward its native conformation, which is a prerequisite for forming a functional binding site. C_LIO_LIStability: The thermodynamic capacity to maintain the folded structure over time and under physiological conditions. Stability (decrease in Gibbs free energy from the unfolded to the folded state) ensures the binding interface remains intact and prevents competing processes such as aggregation 6. C_LIO_LIExpression: The steady-state abundance of the protein. This is largely dependent on proper folding and stability, as cellular quality control mechanisms degrade proteins that fail to fold or remain stable at functional concentrations. C_LI C_LI C_LIO_LIChange in relative affinity ({Delta}{Delta}{degrees}KD): the difference between the {Delta}{degrees}KD of the primary VHH compared to the control VHH for a given epitope mutation. C_LI Model-related termsO_LIESM-IF1 sc: Single-chain (sc) structure-conditioned inverse folding model (ESM-IF1), using the isolated monomer structure of the mutated protein: either the VHH or the antigen 7. C_LIO_LIESM-IF1 mc: Multi-chain (mc) structure-conditioned model (ESM-IF1), using the full complex structure (both antibody and antigen) 7. C_LIO_LIStability prediction score: Score that represents the predicted change in stability based on a single mutation, normally represented as {Delta}{Delta}G. C_LI

7
Comprehensive characterization of V(D)J recombination from long-read transcriptomic data with VDJcraft

Hu, K.; Rosenberg, A. F.; Song, Y.; Fan, C.-H.; Peng, Z.; Gao, M.; Chong, Z.

2026-04-05 bioinformatics 10.64898/2026.04.01.715879 medRxiv
Top 0.1%
1.2%
Show abstract

V(D)J recombination generates antigen receptor diversity in developing B and T cells. Long-read transcriptome technologies (e.g., PacBio Iso-Seq, Nanopore RNA/cDNA) capture full-length transcripts and thus resolve V(D)J events more accurately than short-read platforms. However, existing short-read tools are not applicable to or optimized for long-read data. We developed VDJcraft, the first integrated pipeline designed for V(D)J recombination analysis using long-read transcriptome sequencing data. The workflow uses a two-pass alignment strategy: global alignment to the GENCODE reference with minimap2, followed by local realignment and annotation using the international ImMunoGeneTics information system (IMGT). A customized module enhances D-gene detection sensitivity and positional precision. Sequencing errors are reduced through consensus-based correction toward the predominant subclass. Antigen-binding regions are annotated using IMGT-defined motifs to characterize CDRs and binding site composition. VDJcraft was validated on simulated and Human Genome Structural Variation Consortium (HGSVC) datasets and applied to disease datasets. It accurately recovered full-length V(D)J-C sequences and outperformed existing methods in gene detection and recombination accuracy. Long-read calls also showed significantly higher concordance with high-confidence short-read calls (Mann-Whitney U test, p = 1.55 x 10-4). Additionally, we identified 31 putative novel gene subclasses absent from the IMGT database from HGSVC datasets. Analyses of longitudinal blood samples from a COVID-19 patient revealed distinct V(D)J recombination patterns and segment enrichment, characterized by increased IGHV1-2 usage, enrichment of the IGHV3-7/IGHD6-9/IGHJ5_02 rearranged clonotype, and a transient peak in IgG2 levels at day 4 followed by a gradual return to baseline. In conclusion, VDJcraft provides a robust framework for long-read V(D)J characterization and enables the discovery of disease-associated immune signatures.

8
MHCBind: A Pan- and Allele-Specific Model for Predicting Class I MHC-Peptide Binding Affinity

Peddi, N.; Bijjula, D. R.; Gogte, S.; Kondaparthi, V.

2026-03-23 bioinformatics 10.64898/2026.03.20.713120 medRxiv
Top 0.2%
0.8%
Show abstract

Major Histocompatibility Complex (MHC) molecules are essential to the immune system because they bind and present peptide antigens to T cells, enabling immune recognition and response. The specificity of MHC-peptide interactions is crucial for understanding immune-related diseases, developing personalized immunotherapies, and designing effective vaccines. Current computational methods, while powerful, often rely on a single type of molecular information, usually sequence, and implicitly model the interaction between the two molecules. To address these limitations, we introduce MHC-Bind, a novel deep learning framework that captures a more comprehensive and biologically relevant view of the binding event. MHCBinds architecture employs a dual-view feature extraction strategy for both the MHC and the peptide. A Graph Attention Network (GAT) learns topological features from predicted residue contact maps, while a parallel 1D Convolutional Neural Network (CNN) captures multi-scale patterns from sequence embeddings. These four distinct feature sets are then integrated in a cross-fusion module that uses an attention mechanism to model interactions between the two molecules. Finally, a multi-layer perceptron (MLP) regression head maps the fused interaction signature to a precise binding affinity score. In rigorous comparative benchmarks against recent variants, such as NetMHCpan, MHCFlurry, and MHCnuggets, MHCBind demonstrates superior performance, achieving a significantly lower average prediction error (RMSE: 0.1485) and a higher correlation (PCC: 0.7231) in allele-specific contexts. For pan-allele tasks, it excels at correctly ranking peptides with a superior Spearmans Correlation (SCC: 0.7102), a crucial advantage for practical applications. The frameworks design is inherently flexible, excelling in both allele-specific and pan-allele prediction tasks.

9
Dynamic multimodal survival prediction in multiple myeloma integrating gene expression, longitudinal laboratories, and treatment history

JIA, S.; Lysenko, A.; Boroevich, K. A.; Sharma, A.; Tsunoda, T.

2026-04-01 bioinformatics 10.64898/2026.03.30.715136 medRxiv
Top 0.3%
0.5%
Show abstract

Prognostic stratification in multiple myeloma (MM) relies on staging systems that assign patients to fixed categories at diagnosis and discard the temporal information that accumulates during treatment. We developed a dynamic multimodal framework that predicts residual overall survival using observation windows ranging from 1 to 18 months post-diagnosis. The model integrates DeepInsight-transformed gene expression representation, longitudinal laboratory measurement trajectories across 10 analytes, and treatment history for three drug classes through an adaptive fusion mechanism that accounts for missing clinical observations. On the MMRF CoMMpass cohort (n = 752), five-fold cross-validation yielded a concordance index (C-index) of 0.773 {+/-} 0.024 and a time-dependent AUC at a 1-year prediction horizon (tdAUC1yr) of 0.789 {+/-} 0.021, outperforming all evaluated baseline methods including DeepSurv (0.633 {+/-} 0.095) and random survival forests (0.636 {+/-} 0.024) on matched cross-validation splits. Modality ablation identified longitudinal laboratory measurements as the strongest individual contributor (C-index 0.693); the DeepInsight spatial encoding of gene expression yielded higher discrimination than a multilayer perceptron (MLP) baseline operating on the same features (0.624 vs. 0.596). Kaplan-Meier analysis showed significant prognostic group separation at all primary landmarks (log-rank p < 0.001; hazard ratios 3.46-3.93). A distilled student model retaining only the DeepInsight representation and five baseline clinical features achieved C-index 0.672 and tdAUC1yr 0.740 on an independent microarray cohort (GSE24080, n = 507) without retraining. Interpretability analysis identified prognostic associations consistent with established myeloma biology, including ubiquitin-proteasome pathway genes, endoplasmic reticulum stress markers, and Interferon Alpha Response pathway enrichment.

10
Structure-Guided Computational Analysis of Linker effects in an scFv Targeting Guanylyl Cyclase C

Melo, R.; Viegas, T.

2026-04-01 bioinformatics 10.64898/2026.03.30.714862 medRxiv
Top 0.3%
0.4%
Show abstract

Single-chain variable fragments (scFvs) are widely used in diagnostic and therapeutic applications. These antibody fragments comprise two antibody variable domains connected by a flexible peptide linker whose properties critically influence folding, stability, oligomeric state, and antigen-binding. Therefore, careful linker selection represents a key step in scFv design. Guanylyl Cyclase C (GUCY2C) is a tumor-associated cell surface receptor expressed in gastrointestinal malignancies, including more than 90% of colorectal cancer (CRC) cases across all disease stages. Its restricted physiological expression pattern makes GUCY2C an attractive target for immunotherapy and precision oncology therapies. Here, we investigated the structural and functional consequences of incorporating alternative linker designs into an anti-GUCY2C scFv. Using molecular modeling, protein-protein docking, and molecular dynamics (MD) simulations, we evaluated the conformational stability, interdomain organization, and antigen-binding interactions of each construct. Our results provide a dynamic, structure-based assessment of how linker composition influences GUCY2C recognition and scFv structural behavior. Furthermore, this work establishes a computational framework for the rational optimization of GUCY2C-targeted antibody fragments.

11
ABAG-Rank: Improving Model Selection of AlphaFold Antibody-Antigen Complexes by Learning to Rank

Tadiello, M.; Ludaic, M.; Viliuga, V.; Elofsson, A.

2026-03-19 bioinformatics 10.64898/2026.03.17.712376 medRxiv
Top 0.4%
0.3%
Show abstract

MotivationAlphaFold has transformed structural biology with an unprecedented accuracy in modeling protein structures and their interactions with biomolecules, with AlphaFold3 (AF3) achieving state-of-the-art performance. However, AF3 and other methods often struggle to accurately predict the structure of protein complexes that lack strong co-evolutionary information, such as antibody-antigen (Ab-Ag) complexes. One of the fundamental issues is that AF3 often generates accurate predictions, but fails to reliably distinguish them from the much larger set of incorrect ones. ResultsTo address this, we propose ABAG-Rank, a deep neural network that provides an efficient and robust solution for model selection of Ab-Ag interactions from a pool of structural ensembles predicted with AlphaFold. Built on the permutation-invariant DeepSets architecture, ABAG-Rank can process variable-sized ensembles of structural decoys and is directly applicable to prediction settings in which the number of candidates may vary. We train a model on a redundancy-reduced set of all known antibody-antigen complexes and find that simple geometric descriptors, along with confidence scores from AlphaFold, provide rich information about interface quality without requiring intensive physics-based calculations. Our experiments demonstrate that ABAG-Rank significantly outperforms AF3 internal scoring and the ranking performance of existing deep learning baselines. ImplementationSource code can be found at: https://github.com/tadteo/ABAG-Rank

12
A Cross-Study Multi-Organ Cell Atlas ofMacaca fascicularis Informed by Human Foundation Model Annotation: A Resource for Translational Target Assessment

Souza, T. M.; Gamse, J. T.; Moreno, L.; van Rumpt, M.; Nunez-Moreno, G.; Khatri, I.; van Asten, S. D.; Khusial, N. V.; Baltasar-Perez, E.; Adhav, R.; Abdelaal, T.; Wojtuszkiewicz, A.; Calis, J. J. A.; Csala, A.; Dahlman, A.; Fuller, C. L.; Thalhauser, C. J.; Kolder, I. C. R. M.

2026-03-19 bioinformatics 10.64898/2026.03.17.711997 medRxiv
Top 0.4%
0.3%
Show abstract

Non-human primates (NHPs), particularly Macaca fascicularis (cynomolgus macaque), represent an essential model for preclinical assessment of biologics due to their high genetic and physiological similarity to humans. However, mounting regulatory pressure to reduce NHP use and the lack of a unified, well-annotated single-cell atlas currently limits both target qualification and mechanistic interpretation of toxicity in this species. To address this gap, we assembled and harmonized the largest single-cell transcriptomic atlas of M. fascicularis to date, integrating 30 publicly available studies spanning 57 anatomical regions, 43 organs and 14 physiological systems. We implemented a scalable framework for cross-species cell type annotation by embedding both cynomolgus monkeys and human (Tabula Sapiens V2) datasets into a shared reference space using Universal Cell Embeddings (UCE), enabling consistent harmonization of cell identities. In total, 27 organs were annotated using human reference labels, while the remaining sets retained author-provided annotations or labels transferred from other cynomolgus studies with available annotations. The resulting atlas comprises over 2.5 million cells and demonstrates concordance in cell-type-specific expression patterns between cynomolgus and humans, including tissue-specific markers and targets relevant for biologics development. Through translational use cases, we illustrate how this resource can be applied to assess target expression in tissues affected by concordant human-NHP toxicities, investigate ocular adverse events associated with antibody-drug conjugates (ADCs), and identify species-specific features of immune cell subtypes with known safety implications. By enabling scalable, high-resolution, cross-species comparisons of gene expression across organs, tissues, and cell states, this atlas supports improved target qualification, more mechanistic interpretation of toxicities, and evidence-based decisions on the relevance and design of NHP studies. Collectively, this work provides a unified cross-species single-cell resource for cynomolgus monkey and a modular computational framework that advances new approach methodologies and contributes to the refinement and reduction of NHP use in preclinical research.

13
TCRseek: Scalable Approximate Nearest Neighbor Search for T-Cell Receptor Repertoires via Windowed k-mer Embeddings

Yang, Y.

2026-03-24 bioinformatics 10.64898/2026.03.20.713313 medRxiv
Top 0.5%
0.3%
Show abstract

The rapid growth of T-cell receptor (TCR) sequencing data has created an urgent need for computational methods that can efficiently search CDR3 sequences at scale. Existing approaches either rely on exact pairwise distance computation, which scales quadratically with repertoire size, or employ heuristic grouping that sacrifices sensitivity. Here we present TCRseek, a two-stage retrieval framework that combines biologically informed sequence embeddings with approximate nearest neighbor (ANN) indexing for scalable search over TCR repertoires. TCRseek first encodes CDR3 amino acid sequences into fixed-length numerical vectors through a multi-scale windowed k-mer embedding scheme derived from BLOSUM62 eigendecomposition, then indexes these vectors using FAISS-based structures (IVF-Flat, IVF-PQ, or HNSW-Flat) that support sublinear-time search. A second-stage reranking module refines the shortlisted candidates using exact sequence alignment scores (Needleman-Wunsch with BLOSUM62), Levenshtein distance, or Hamming distance. We benchmarked TCRseek against tcrdist3, TCRMatch, and GIANA on a 100,000-sequence corpus with precomputed exact ground truth under three distance metrics. Under cross-metric evaluation--where the reranking and ground truth metrics differ, providing the most informative test of generalization--TCRseek achieved NDCG@10 = 0.890 (Levenshtein ground truth) and 0.880 (Hamming ground truth), ranking highest among the retained baselines under Hamming and remaining competitive with tcrdist3 (0.894) under Levenshtein. When the reranking metric matches the ground truth definition (BLOSUM62 alignment), NDCG@10 reached 0.993, confirming that the ANN shortlist captures >99% of true neighbors--the expected ceiling of the two-stage design. On the 100,000-sequence corpus, TCRseek achieved 3.6-39.6x speedup over exact brute-force search depending on index type and distance metric, with the largest gains for alignment-based retrieval. These results demonstrate that embedding-based ANN search provides a practical and scalable alternative for TCR repertoire analysis.

14
PhagePickr: A bacteria-centric computational tool for designing evolution-proof phage cocktails

Oneto, A.; Okamoto, K. W.

2026-03-23 microbiology 10.64898/2026.03.23.713575 medRxiv
Top 0.5%
0.3%
Show abstract

As antibiotic resistance poses a major threat to global health, phage therapy offers an alternative to antibiotic treatments in the face of multidrug-resistant bacteria. However, host resistance to phages is also well-documented. Current computational tools for phage cocktail design do not explicitly address the evolution of phage resistance, let alone through the profiling of bacterial receptors whose variability drives much of phage resistance. We introduce PhagePickr, a computational pipeline for the automated design of phage cocktails that minimize host resistance. Unlike other tools, PhagePickr selects phages based on bacterial surface receptor similarity and prioritizes phage diversity to prevent cross-resistance. The tool uses NCBI datasets, a Nearest Neighbors algorithm, and Multiple Sequence Alignment to identify phenotypically similar hosts and ensure phylogenetic diversity in the final cocktail. We evaluated the utility of PhagePickr on ESKAPE pathogens and two understudied bacteria species. The cocktails included candidate phages predicted to target diverse receptors, comprising both lytic phages with confirmed therapeutic potential and novel candidates from similar species. We demonstrate the tools utility in generating cocktails and its capacity to scale as current databases are updated. PhagePickr provides a novel bacteria-centric framework for designing resistance-proof cocktails by exploring shared phenotypes. Author SummaryWe present PhagePickr, a novel computational tool to design bacteriophage cocktails against pathogenic bacteria. Antibiotic resistance poses a major threat to global health, and phage therapy, the use of viruses that kill bacteria, is a promising alternative treatment. However, bacteria are also under immense selective pressure to develop resistance to phages, and existing tools for automated cocktail design have yet to address this challenge. Our tool is designed to circumvent resistance through two steps: it uses bacterial receptor configurations as predictors of phage infection and constructs a cocktail that maximizes phage diversity to target multiple receptors, making it harder for bacteria to evolve simultaneous resistance. We demonstrated the utility of PhagePickr on two understudied species and the ESKAPE pathogens, a group of multidrug-resistant bacteria responsible for the majority of deaths associated with antibiotic resistance worldwide. The tool identified both well-characterized therapeutic phages and novel candidates, and is designed to scale as databases expand. Our approach represents a key step toward the rational design of evolution-proof phage cocktails for clinical use.

15
Maturation of HIV-1 neutralizing antibodies in a germinal center conditional expression mouse model

Tian, M.; Davis, J.; Cheng, H.-L.; Thompson, L. M.; Tuchel, M.-E.; Williams, A. C.; Yin, A.; Wilder, B.; DiBiase, I.; Seaman, M.; Alt, F. W.

2026-04-01 immunology 10.64898/2026.03.30.715358 medRxiv
Top 0.5%
0.3%
Show abstract

In germinal centers, activated B cells modify their antigen receptors through somatic hypermutation (SHM), followed by antigenic selection that favors expansion of high affinity B cells. The affinity maturation process is critical for development of broadly neutralizing antibodies (bnAbs) against the human immunodeficiency virus-1 (HIV-1). BnAbs have been isolated from some people living with HIV-1. Because these antibodies target conserved epitopes of the HIV-1 Envelope (Env) protein, they inhibit a broad spectrum of viruses. Eliciting bnAbs by vaccination is a top priority for HIV-1 prevention, but reproducing the lengthy maturation of bnAbs is a major challenge. The problem is typified by VRC01 class antibodies, which recognize the CD4 binding site of HIV-1 Env protein. To reach the CD4 binding site, antibodies need to navigate through adjacent glycans. Accommodating the glycans requires multiple SHMs in germinal center (GC) B cells, including infrequent events. For this reason, VRC01 vaccine development often stalls at this point. We have generated a mouse model aimed at providing a potential solution for navigating this vaccine design impediment. To this end, we made a mouse model that expresses a stalled VRC01 intermediate conditionally in GC B cells. This system has three advantages: 1) direct expression of the intermediate obviates prior immunization steps, thereby shortening the immunization scheme; 2) the conditional expression system bypasses tolerance control checkpoints that sometimes delete B cells expressing bnAbs; 3) the intermediate responds to immunization in GCs, the physiological site of affinity maturation. With this model, we established an immunization method to mature the VRC01 intermediate into heterologous neutralizing antibodies against viruses with a native glycan shield. Since high mutation load is common among bnAbs, the germinal center conditional expression system could provide a general tool for boost immunogen design to overcome roadblocks in the maturation pathway. Author summaryIn response to antigenic stimulation, cognate B cells become activated and form germinal centers in lymphoid tissues. Germinal center B cells modify their antigen receptors through somatic hypermutation (SHM) of immunoglobulin variable region gene exons, with antigen selecting for high affinity B cells by providing survival advantage. This mechanism accounts for antibody affinity maturaton over the gradual course of an immune response. Affinity maturation is critical for generating potent, neutralizing antibodies against diverse strains of the human immunodeficiency virus-1 (HIV-1). These broadly neutralizing antibodies (bnAbs) are heavily mutated, reflecting lengthy affinity maturation over years of chronic infection. Recapitulating the affinity maturation process is a major challenge for bnAb induction by vaccination. In immunization experiments, bnAb development often stalls at rate limiting steps that involve infrequent, but functionally important, mutational events. Overcoming such obstacles requires boost immunogens that can stimulate the stalled B cells to acquire the requisite mutations. To this end, we recapitulated the maturation arrest of a bnAb lineage by expressing a stalled antibody in mouse germinal center B cells. Using this mouse model, we developed boost immunization conditions that advanced the antibody maturation beyond a roadblock to attain neutralizing activities against heterogenous viruses.

16
AI-guided design of candidate BMPR1A-binding peptides for cartilage regeneration: a multi-tool computational benchmarking study

Ahmadov, A.; Ahmadov, O.

2026-03-25 bioinformatics 10.64898/2026.03.22.713519 medRxiv
Top 0.5%
0.3%
Show abstract

Bone morphogenetic protein receptor type IA (BMPR1A) is a key mediator of chondrogenesis and a validated therapeutic target for cartilage repair, yet existing BMP mimetic peptides suffer from low potency and the full-length protein (rhBMP-2) carries significant safety risks. Generative AI tools for protein design can now produce de novo peptide binders, but none have been applied to cartilage regeneration targets. Here, we benchmarked four architecturally distinct AI tools--RFdiffusion, BindCraft, PepMLM, and RFpeptides--to design candidate BMPR1A-binding peptides. We generated 192 candidates alongside 98 negative controls (290 total) and evaluated all complexes using AlphaFold 3 structure prediction, dual physics-based energy scoring (PyRosetta and FoldX), and contact recapitulation against the crystallographic BMP-2:BMPR1A interface (PDB: 1REW). A four-metric composite ranking identified a 15-residue PepMLM design (pepmlm_L15_0026) as the top candidate, combining favorable binding energy (PyRosetta dGseparated = -45.9 REU; FoldX {Delta}G = -19.4 kcal/mol) with the highest contact recapitulation among top-ranked peptides (11/30 gold-standard interface residues). Designed candidates significantly outperformed controls on ipTM (p = 0.002) and FoldX {Delta}G (p < 0.001). BindCraft candidates achieved the highest structural confidence (ipTM up to 0.81) but exhibited moderate contact recapitulation (mean 0.224), consistent with the computational hypothesis that they may engage alternative BMPR1A binding surfaces rather than the native BMP-2 interface. Physicochemical filtering yielded a shortlist of 54 candidates across all four tools. These results establish a reproducible computational framework for AI-guided peptide design targeting cartilage regeneration and identify specific candidates for future experimental validation via binding assays and chondrocyte differentiation studies. Author summaryDamaged cartilage has limited capacity to heal, and current biological therapies based on bone morphogenetic protein 2 (BMP-2) carry serious safety concerns including ectopic bone formation and inflammation. Short peptides that mimic BMP-2s interaction with its receptor BMPR1A could offer a safer, more targeted alternative, but designing such peptides from scratch is challenging. We used four different artificial intelligence tools--each employing a distinct computational strategy--to generate 192 candidate peptides designed to bind BMPR1A. We then evaluated all candidates using multiple independent computational methods to assess binding quality, energy favorability, and whether each peptide targets the correct site on the receptor. Our analysis identified a shortlist of 54 promising candidates, with a 15-residue peptide from the language model-based tool PepMLM emerging as the top-ranked design. We also found evidence that one tool (BindCraft) may produce peptides that bind BMPR1A at sites different from the natural BMP-2 interface, highlighting the importance of validating not just whether a peptide binds, but where it binds. Our computational framework and candidate peptides provide a foundation for future laboratory testing toward cartilage repair therapies.

17
emb2dis: a novel protein disorder prediction tool based on ResNets, dilated convolutions & protein language models

Duarte, S. A.; Mehdiabadi, M.; Bugnon, L. A.; Aspromonte, M. C.; Piovesan, D.; Milone, D. H.; Tosatto, S.; Stegmayer, G.

2026-04-01 bioinformatics 10.64898/2026.03.30.715414 medRxiv
Top 0.5%
0.3%
Show abstract

Intrinsically disordered proteins (IDPs) play an important role in a wide range of biological functions and are linked to several diseases. Due to technical difficulties and the high cost of experimental determination of disorder in proteins, combined with the exponential increase of unannotated protein sequences, the development of computational methods for disorder prediction became an active area of research in the last few decades. In this work, we present emb2dis, a deep learning model that uses protein language models (pLMs) to predict disorder from sequence. The emb2dis tool is a pre-trained model that receives as input a protein sequence, calculates its pLM embedding and passes it to a deep learning model. In contrast to existing approaches, emb2dis integrates informative sequence representations with a novel architecture that combines residual networks (ResNets) and dilated convolutions. This design effectively enlarges the receptive field of the convolution operation, enabling the model to better capture an extended context of each amino acid. At the output, emb2dis assigns a disorder propensity score to each residue in the sequence. The model was evaluated on datasets from the latest CAID3 blind benchmark for disorder prediction, where it achieved first place in the Disorder-PDB category, exhibiting strong performance with high AUC and Fmax scores. Additionally, it ranked among the top ten methods on the Disorder-NOX dataset. We provide a freely available web-demo for emb2dis and a source code repository for local installation. Weblink for the toolhttps://sinc.unl.edu.ar/web-demo/emb2dis/ The importance of the emb2dis tool is that it provides a new deep learning approach and significant improvements in the prediction of protein disorder, with a simple web interface and graphical output detailing per-residue disorder.

18
Claw4Science: A Dataset and Platform for the OpenClaw Scientific Agent Ecosystem

Xu, M.; Chen, J.; Zhang, Z.

2026-04-01 bioinformatics 10.64898/2026.03.30.715118 medRxiv
Top 0.5%
0.3%
Show abstract

Large language models have enabled a new class of scientific software in the form of AI agents that can execute research workflows across bioinformatics, drug discovery, and related domains. Among these systems, OpenClaw introduced a skill-based design that allows workflows to be expressed as structured Markdown files, lowering the barrier to contribution and enabling rapid ecosystem growth. However, this growth has led to fragmentation. Projects are distributed across independent repositories, skills vary widely in quality, naming is inconsistent, and there is no unified way to discover or compare available tools. In this work, we construct the first curated dataset of the OpenClaw scientific ecosystem. The dataset includes 91 projects organized by functional role and 2,230 skills spanning 34 scientific categories. Based on this dataset, we perform a systematic analysis of the structure, distribution, and emerging patterns of scientific agent development. To make this ecosystem accessible in practice, we further build Claw4Science, a public platform at https://claw4science.org, which is built on top of our dataset. The platform organizes projects and aggregates distributed skill repositories into a unified interface, with a focus on bioinformatics and scientific workflows, providing a practical entry point for navigating the ecosystem. Our results show that the OpenClaw ecosystem reflects a shift from isolated systems to a more modular and shareable model of scientific computation. At the same time, challenges in evaluation, reproducibility, and governance remain open. We argue that our dataset provides a foundation for future benchmark development and standardized infrastructure for scientific AI agents.

19
Cellector: A tool to detect foreign genotype cells in scRNAseq data with applications in leukemia and microchimerism.

Heaton, H.; Behboudi, R.; Ward, C.; Weerakoon, M.; Kanaan, S.; Reichle, S.; Hunter, N.; Furlan, S.

2026-03-30 bioinformatics 10.64898/2026.03.26.714571 medRxiv
Top 0.5%
0.2%
Show abstract

The existence of rare, genetically distinct cells can occur in various samples such as transplant patients, naturally occurring microchimerism between maternal and fetal tissues, and cancer samples with sufficient mutational burden. Computational methods for detecting these foreign cells are vital to studying these biological conditions. An application that is of particular interest is that of leukemia patients post hematopoietic cell transplant (HCT). In many leukemias, a primary therapy is HCT, after which, the primary genotype of the bone marrow and blood cells should be of donor origin. If cells exist that are of the patients genotype and the cell type lineage of the particular leukemia, this is known as measurable residual disease (MRD). If the MRD is high enough, this may represent a relapse of the patients leukemia. Furthermore, accurately estimating the MRD is important for driving clinical decision making for these patients. Here we present Cellector, a computational method for identifying rare foreign genotype cells in single cell RNAseq (scRNAseq) datasets. We show cellector accurately detects microchimeric cells down to an exceedingly low percentage of these cells present (0.05% or lower).

20
Computational Prediction of Plasmodium falciparum Antigen-T-cell Receptor Interactions via Molecular Docking: Implications for Malaria Vaccine Design

Kipkoech, G.; Kanda, W.; Irungu, B.; Nyangi, M.; Kimani, C.; Nyangacha, R.; Keter, L.; Atieno, D.; Gathirwa, J.; Kigondu, E.; Murungi, E.

2026-03-20 bioinformatics 10.64898/2026.03.18.712575 medRxiv
Top 0.6%
0.2%
Show abstract

Malaria is one of the deadliest diseases in sub-Saharan Africa and Southeast Asia. The majority of the fatalities occur mostly in children under 5 years and pregnant women and this is due to infection by Plasmodium spp, of which Plasmodium falciparum is the most virulent and is responsible for most of the morbidity and mortality. Despite various public health interventions such as use of insecticide-treated bed nets, spraying of homes with insecticides and use of WHO recommended artemisinin-based combination therapies (ACT), malaria prevention still faces major setback due to drug and insecticide resistance by P. falciparum and mosquitoes respectively. The study uses molecular docking and immunoinformatics to screen various Plasmodium spp antigens and evaluate their antigenicity and suitability as vaccine candidates. The P. falciparum antigens and T-cell receptor (TCR) structures were obtained from Protein Data Bank (PDB) based on a range of factors related to their role in the lifecycle of the parasite and their status as vaccine targets. Protein structures not available in the PDB were predicted using AlphaFold. The 3D structures of selected P. falciparum antigens and TCR structures were downloaded in PDB format then all water molecules, Hetatm, and bound ligands were deleted from the protein structures using BIOVIA Discovery Studio Visualizer. Subsequently, molecular docking was done using ClusPro v2.0 server and docked complexes were compared. The findings of this study gave valuable insights into the interaction of human immune response with P. falciparum antigens. The best three ranked antigen complexes are PfCyRPA, PfMSP10 and PfCSP and this confirm their use as potential candidates for vaccine development. This study highlights the usefulness of computational docking in identifying P. falciparum antigens of excellent immunogenic potential as vaccine candidates.