Protein Engineering, Design and Selection — Latest Matching Preprints

1

FLIP2: Expanding Protein Fitness Landscape Benchmarks for Real-World Machine Learning Applications

Didi, K.; Alamdari, S.; Lu, A. X.; Wittmann, B.; Johnston, K. E.; Amini, A. P.; Madani, A. K.; Czeneszew, M.; Dallago, C.; Yang, K. K.

2026-02-24 bioengineering 10.64898/2026.02.23.707496 medRxiv

Top 0.1%

4.9%

Show abstract

Machine learning methods that predict protein fitness from sequence remain sensitive to changes in data distributions, limiting generalization across common conditions encountered in protein engineering. Practically, protein engineers are thus left wondering about the effective utility of ML tools. The FLIP benchmark established protocols for testing generalization under some domain shifts, but it was limited to measurements of thermostability, binding, and viral capsid viability. We introduce FLIP2, a protein fitness benchmark spanning seven new datasets, including enzymes, protein-protein interactions, and light-sensitive proteins, as well as splits that measure generalization relevant to real-world protein engineering campaigns. Evaluating a suite of benchmark models across these datasets and suites reveals that simpler models often matched or outperformed fine-tuned protein language models on FLIP2, challenging the utility of existing transfer learning techniques. Provenance for all datasets has been recorded and we redistribute all data CC-BY 4.0 to facilitate continued progress.

2

CombinGym: a benchmark platform for machine learning-assisted design of combinatorial protein variants

Chen, Y.; Fu, L.; Lu, X.; Li, W.; Gao, Y.; Wang, Y.; Ruan, Z.; Si, T.

2026-03-25 synthetic biology 10.64898/2026.03.24.714074 medRxiv

Top 0.1%

3.8%

Show abstract

Combinatorial mutagenesis is essential for exploring protein sequence-function landscapes in engineering applications. However, while large-scale machine learning benchmarks exist for protein function prediction, they are primarily limited to single-mutant libraries, leaving a critical gap for combinatorial mutagenesis. Here we introduce CombinGym, a benchmarking platform featuring 14 curated combinatorial mutagenesis datasets spanning 9 proteins with diverse functional properties including binding affinity, fluorescence, and enzymatic activities. We evaluated nine machine learning algorithms from five methodological categories (alignment-based, protein language, structure-based, sequence-label, and substitution-based) across multiple prediction tasks, assessing both zero-shot and supervised learning performance using Spearmans {rho} and Normalized Discounted Cumulative Gain metrics. Our analysis reveals the substantial impact of measurement noise and data processing strategies on model performance. By implementing hierarchical dataset splits (0-vs-rest, 1-vs-rest, 2-vs-rest, and 3-vs-rest scenarios), we demonstrate the value of lower-order mutation data for empowering machine learning models to predict higher-order mutant properties. We validated this capacity through both in silico simulation (improving fluorescence brightness of an oxygen-independent fluorescent protein) and experimental validation (engineering enzyme substrate specificity), achieving a substantial increase in specific activity. All datasets, benchmarks, and metrics are available through an interactive website (https://www.combingym.org), facilitating collaborative dataset expansion and model development through integration with automated biofoundry platforms.

3

Staged heavy-chain filtering enables Fab discovery from combinatorially intractable library spaces

Kim, Y.; Kwon, H.; Hong, J.; Kang, C. K.; Park, W. B.; Kim, H.-R.; Lee, C.-H.

2026-05-13 bioengineering 10.64898/2026.05.10.724059 medRxiv

Top 0.1%

3.6%

Show abstract

BackgroundCombinatorial fragment antigen-binding (Fab) libraries encode an immense heavy-light chain pairing space, often exceeding 10{superscript 1} possible combinations, which far surpasses the diversity that can be experimentally constructed and screened in display systems. As a result, direct Fab screening samples only a small fraction of the theoretical search space, creating a practical bottleneck for functional binder discovery. ResultsHere, we frame Fab discovery as a staged search problem by decoupling heavy-chain (HC) and light-chain (LC) exploration. We implemented a sequential HC preselection-remating workflow in yeast surface display, in which antigen-reactive HC variants are first enriched and subsequently recombined with a diverse LC repertoire to reconstruct a focused Fab library. In a SARS-CoV-2 spike-targeted campaign, HC and LC libraries of 2.05 x 10 and 2.33 x 10 members corresponded to a theoretical pairing space of approximately 4.8 x 10{superscript 1} combinations. Sequential HC enrichment followed by LC remating allowed recovery of multiple functional Fab clones from a tractable library scale of approximately 10, including clones that shared a common HC scaffold but carried distinct LC partners. A representative recombinant IgG output showed broad but heterogeneous spike/RBD binding, measurable pseudovirus neutralization activity (EC = 11.1 nM), and compatibility with standard early biophysical characterization after full-length IgG reformatting. ConclusionsThese results provide proof of principle that combinatorial Fab discovery can be approached as a staged exploration problem under realistic library-size constraints. By focusing downstream Fab reconstruction on an antigen-compatible HC subspace, sequential HC preselection followed by LC remating offers a practical strategy for exploring otherwise intractable antibody pairing landscapes in eukaryotic display systems.

4

What comes after de novo? Automated lead optimization of proteins with CRADLE-1

Bixby, E.; Brunner, G.; Danciu, D.; Dela Rosa, R.; Deutschmann, N.; Ferragu, C.; Geiger, F.; Holberg, C.; Kidger, P.; Lindoulsi, A.; Lutz, N.; McColgan, T.; Milius, S.; Shah, J.; Vandeloo, M.; Vidas, P.; Ziegler, J. D.; van Rossum, H.; van der Vorm, D.; Baldi, N.; IJSpeert, C.; Monza, E.; Schriek, A.

2026-03-08 bioengineering 10.64898/2026.03.06.710001 medRxiv

Top 0.1%

3.1%

Show abstract

Lead optimization remains the longest and most expensive step in pre-clinical drug discovery, typically consuming 12-36 months whilst costing $5M-$15M per candidate. We introduce O_SCPLOWCRADLEC_SCPLOWO_SCPCAP-1C_SCPCAP, an automated framework for protein engineering. While O_SCPLOWCRADLEC_SCPLOWO_SCPCAP-1C_SCPCAP supports the full process of drug discovery and industrial protein engineering pipelines, including hit identification and de novo binder design, this work focuses on its application to multi-property lead optimization across protein modalities (VHHs, scFvs, IgGs, peptides, enzymes, CRISPR systems, vaccines). We show it is 4-7x faster than rational design, as measured by the number of wet lab rounds required. We provide in-vitro validation across all of the above modalities, typically optimizing multiple properties simultaneously (single and polyspecific binding down to picomolar, activity, thermostability,...). Technically, O_SCPLOWCRADLEC_SCPLOWO_SCPCAP-1C_SCPCAP starts with pre-trained foundation protein language models (PLMs), which are fine-tuned in unsupervised fashion on evolutionary neighborhoods, in supervised fashion using lab-in-the-loop data, and then deployed in a multi-model workflow. Of additional interest, we find that (a) the end-to-end system may be run in automated fashion; (b) wet lab data may be consumed in black box fashion without knowledge of the underlying biochemical mechanisms; (c) structural data may largely be superseded by sequence-function pairs.

5

Benchmarking and Experimental Validation of Machine Learning Strategies for Enzyme Engineering

Zeng, Z.; Jin, J.; Xu, R.; Luo, X.

2026-03-30 bioengineering 10.64898/2026.03.29.715152 medRxiv

Top 0.1%

2.6%

Show abstract

Enzyme-directed evolution increasingly relies on computational tools to prioritize mutations, yet their practical value is difficult to assess because kinetic data are often aggregated across heterogeneous assay conditions, inflating apparent generalization. Here we introduce EnzyArena, a curated benchmark that groups kinetic parameters (kcat, Km, kcat/Km) into condition-matched experimental subsets to enable realistic evaluation. Using this resource, we benchmark 10 representative models from two arising strategy families--zero-shot fitness prediction and supervised kinetic-parameter prediction--across BRENDA- and SABIO-RK-derived subsets and 25 independent mutagenesis datasets. Kinetic-parameter predictors perform strongly on database-derived subsets but lose their advantage on independent datasets, whereas zero-shot predictors show more consistent generalization. A simple consensus of multiple zero-shot models further improves the precision of identifying beneficial mutants. We prospectively validated these findings in a wet-lab campaign (150 mutants) comparing random mutants, UniKP-prioritized mutants and ESM-1v-prioritized mutants (representing supervised kinetic-parameter prediction and zero-shot fitness prediction, respectively), where ESM-1v achieved the highest utility and UniKP underperformed the random baseline. Together, this study establishes realistic baselines for computational mutant prioritization and highlights consensus zero-shot strategies as a practical starting point for enzyme engineering.

6

GROQ-seq Datasets Across Transcription Factors (LacI, RamR, VanR), T7 RNA Polymerase and TEV Protease

Spinner, A.; Sreenivasan, S.; McLellan, J. R.; Ikonomova, S. P.; Cortade, D. L.; dOelsnitz, S.; Sheldon, K.; Vasilyeva, O. B.; Alperovich, N. Y.; Chadha, A.; Nematollahi, L.; Dhroso, A.; Sisson, Z.; Hudson, C. M.; DeBenedictis, E.; Kelly, P. J.; Reider Apel, A.; Ross, D.; Baranowski, C.

2026-04-18 bioengineering 10.64898/2026.04.15.718744 medRxiv

Top 0.1%

2.3%

Show abstract

Predicting any proteins function from its sequence alone would be a significant breakthrough in molecular biology. Although machine learning approaches have sought to tackle this, their limited generalizability reflects the absence of sufficiently large, open, diverse, and unified datasets. To address this data gap, we developed a high-throughput experimental platform called GROQ-seq (Growth-based Quantitative Sequencing). In GROQ-seq, a proteins function can be linked to a sequencing-based readout that enables scalable characterization of large variant libraries in Escherichia coli. Here, we present pilot datasets demonstrating its performance across three distinct protein function classes: transcription factors, polymerases, and proteases. The objective of this report is to present the datasets and to provide users with a clear and transparent characterization of their properties, including both the strengths and limitations.

7

Library docking for Cannabinoid-2 Receptor ligands

Rachman, M. M.; Iliopoulos-Tsoutsouvas, C.; Dominic Sacco, M.; Xu, X.; Wu, C.-G.; Santos, E.; Glenn, I. S.; Paris, L.; Cahill, M. K.; Ganapathy, S.; Tummino, T. A.; Moroz, Y. S.; Radchenko, D. S.; Okorie, M.; Tawfik, V. L.; Irwin, J. J.; Makriyannis, A.; Skiniotis, G.; Shoichet, B. K.

2026-03-21 biochemistry 10.64898/2026.03.19.713017 medRxiv

Top 0.1%

1.5%

Show abstract

Cannabinoid receptors are therapeutically promising GPCRs that are also interesting test systems for structure-based methods, which have targeted them previously. Here we used the CB2 receptor as a template to explore several topical questions in library docking. Whereas an earlier campaign against the CB1 receptor led to potent but relatively non-selective ligands, here we found that targeting interactions with polar, orthosteric site residues led to subtype-selective ligands. Docking hit rate and especially hit affinity improved in moving from a 7 million to a 2.6 billion molecule library. Similar to earlier studies, docking against active and inactive states of the receptor did not reliably bias toward the discovery of agonists or inverse agonists. Cryo-EM structures of two of the new agonists, each in a different chemotype, superposed well on the docking predictions. Correspondingly, structure-based optimization led to 10- to 140-fold improvements within three different series, also consistent with well-behaved ligand families. Hit rates with a fully enumerated 2.6 billion molecule library resembled those of an implied 11 billion molecule library from a building-block method, consistent with the latters ability to explore this space, though higher affinities were discovered from the fully enumerated set. Overall, eight diverse families of ligands, with potencies <100 nM and mostly unrelated to previously known ligands were found. Implications for future studies are considered.

8

Surface Display For Phage Assisted Continuous Evolution: A Platform For Evolving / Screening Nanobodies In Prokaryote Systems

Flores-Mora, F. E.; Brodsky, J.; Cerna, G. M.; Tse, A.; Hoover, R. L.; Bartelle, B. B.

2026-04-04 synthetic biology 10.64898/2026.04.03.716437 medRxiv

Top 0.1%

1.5%

Show abstract

Despite >50 years of methods development, specific antibodies are still generated at low throughput and remain in high demand across biotechnology. Most biologics and immunoprobes are monoclonal antibodies, developed using a combination of inoculating animals with a target antigen, engineered candidate libraries, and multiple rounds of selection using phage or yeast display. Here we introduce a synthetic biology scheme to eliminate the need for nearly all of these steps, by combining Surface display on E. coli and Phage display with the microvirus {Phi}X174, Assisting Continuous Evolution (SurPhACE). Instead of building libraries for screening, SurPhACE runs a closed evolutionary program. A typical experiment can have 1011 mutant candidates under active selection, with complete turnover of the mutant population every 30min, or >5x1012 unique mutants per day, using less than 100mL of bacterial culture media. We demonstrate SurPhACE for optimizing a nanobody to a related epitope, and develop novel nanobodies for an arbitrary target using a minimal starting library to establish a proof of concept and identify best practices for this scalable method for generating protein binders.

9

An Energy Landscape Approach to Miniaturizing Enzymes using Protein Language Model Embeddings

Lala, J.; Agrawal, H.; Dong, F.; Wells, J.; Angioletti-Uberti, S.

2026-03-05 bioinformatics 10.64898/2026.03.04.709378 medRxiv

Top 0.1%

1.3%

Show abstract

AO_SCPLOWBSTRACTC_SCPLOWWe present a general approach to find amino acid sequences corresponding to the most compact enzyme likely to retain the structure of a given catalytic site. Our approach is based on using Monte Carlo (MC) simulations to sample an energy landscape where minima correspond, by construction, to sequences with the aforementioned properties. Building on previous work (Wu et al., 2025) and with the BAGEL package (Lala et al., 2025), we implement a route to achieve this goal using only the information extracted from a protein language model (PLM), without structural information. After generating a set of candidate sequences with this PLM-guided BAGEL optimization, we further filter potential candidates for downstream experimental validation using a two-stage protocol. First, deep-learning-based structure prediction models (ESMFold, Chai-1, Boltz-2) are used to identify a structural consensus among designs with highly conserved active-site geometries, yielding many candidates with active-site RMSD below a few angstroms relative to the wild-type and pLDDT scores above 80. Second, molecular dynamics simulations are performed on a filtered subset of sequences (based on active-site RMSD and SolubleMPNN log-likelihoods) to evaluate active-site stability when including thermal fluctuations. For the most promising enzymes, these yield RMSF values in the active site below 1.0 [A] and an active-site RMSD drift between 0.5 and 1.5 [A], making these mini-variants comparable to the wild type, though outcomes vary across enzymes. Given the protocols generality, we believe these results represent a step forward in AI-guided enzyme design. To facilitate rapid experimental validation by the broader community, we open-source all sequences generated by our computational pipeline. These include designs for four representative enzymes of this study: PETase, subtilisin Carlsberg (serine protease), Taq DNA polymerase, and VioA.

10

CARTiBASE: an interactive knowledge base for CAR sequence retrieval and similarity analysis

Le Compte, G.; Ceylan, H.; Meysman, P.; Laukens, K.

2026-02-26 immunology 10.64898/2026.02.25.707638 medRxiv

Top 0.1%

1.3%

Show abstract

SummaryChimeric Antigen Receptors (CARs) are modular synthetic constructs that have transformed cellular immunotherapy, enabling targeted recognition and killing of malignant cells. Their clinical success has driven an explosive growth in new receptor designs, but these sequences are dispersed across heterogeneous sources such as publications, patents and supplementary files. This fragmentation and inconsistency limits comparative analysis, reproducibility and the reuse of existing constructs. To address this, we curated and standardized more than 10,000 CAR sequences into a single, harmonized resource. CARTiBASE is a web-based platform that provides standardized annotation, interactive browsing and fast similarity search across this curated collection. This unique database was leveraged to analyse the diversity in current CAR constructs within the public domain, revealing common design trends and lineages, as well as highlighting potential avenues for future CAR development. Availability and ImplementationCARTiBASE is freely available for non-commercial use at https://www.cartibase.org, without mandatory registration. The web server is implemented with a Python/Flask API backend and a Vue-based frontend and supports all major browsers. Users can search and filter thousands of CARs, inspect domain boundaries across signal peptide, antigen-binding domain, hinge, transmembrane, co-stimulatory and intracellular signaling regions, compare constructs and download sequences as FASTA files for downstream use.

11

GROQ-seq Enables Cross-site Reproducibility for High-Throughput Measurement of Protein Function

Spinner, A.; Ross, D.; Cortade, D.; Ikonomova, S.; Baranowski, C.; Dhroso, A.; Reider Apel, A.; Sheldon, K.; Duquette, C.; Kelly, P. J.; DeBenedictis, E.; Hudson, C.

2026-04-09 bioengineering 10.64898/2026.04.07.716961 medRxiv

Top 0.1%

1.0%

Show abstract

High-throughput functional assays are increasingly used to generate large-scale protein function datasets for protein engineering and machine learning applications. However, the utility of such datasets depends on the reproducibility of the underlying measurements. Here we report reproducible, quantitative measurements of protein sequence-to-function data at scale across two facilities. We analyze GROQ-seq (Growth-based Quantitative Sequencing) measurements of three bacterial transcription factors. Independent barcode measurements of the same sequence produce highly consistent functional estimates, demonstrating strong biological reproducibility (across all transcription factors the mean Root Mean Square Deviation [RMSD] {approx} 0.53 and mean Spearman {approx} 0.63). We also compared experiments performed at two facilities using a shared protocol, but with differing levels of automation and system integration. We observe strong agreement between measurements taken at the two sites (mean RMSD {approx} 0.41 and mean Spearman {approx} 0.730). Orthogonal tests further support this agreement: a classifier trained to distinguish data by site performs near random (AUC = 0.559), and top-ranking variants show strong statistical overlap between experiments. Together, these results demonstrate that GROQ-seq enables reproducible, scalable measurement of protein function suitable for large aggregated datasets.

12

Integrating Diffusion and Liquid AI Models for Predicting Peptide Affinity from mRNA Display Selections

Leaf, C. M.; Qi, P.; Gandhi, Y. P.; Jalali-Yazdi, F.; Ong, J. N.; Takahashi, T. T.; Kalia, R.; Roberts, R. W.

2026-05-11 bioengineering 10.64898/2026.05.05.723033 medRxiv

Top 0.1%

0.9%

Show abstract

In vitro selection and directed evolution technologies such as mRNA display, explore large libraries ([≥]1014 variants) and generate thousands to millions of functional polypeptide ligands to a variety of targets. Denoising diffusion implicit machine learning models (DDIMs) trained using display-derived deep sequencing data can greatly expand these functional sequences beyond what is accessible experimentally. However, methods are needed to predict peptide properties such as binding free energies ({Delta}G{degrees}). Here, we applied machine learning methods to predict binding free energies of both experimental and DDIM-generated peptide ligands against a target of interest, the oncogenic protein Bcl-xL. To do this, we trained a Closed-form Continuous (CfC) neural network using a dataset of 15,700 peptide ligands where pairs of sequences and their corresponding binding free energies ({Delta}G{degrees}) were used as inputs. This type of model was chosen due to its ability to represent irregular series. The resulting CfC model accurately predicts the rank order, within error, and binding free energies ({Delta}G{degrees}) for both experimental and DDIM-generated peptides, identifying five DDIM-generated peptides with single-digit picomolar affinities. Combining trained DDIM and CfC models offers a unified route to expand the scope of experimental ligand discovery, predict the molecular properties of both experimental and generated ligands, and highlights the utility of large quantitative datasets for making accurate in silico predictions of high-affinity peptide candidates. StatementHigh-throughput sequencing analysis of mRNA display libraries enables generating novel peptide ligands and expands the scope of functional sequences beyond what is accessible experimentally. Closed-form Continuous neural networks trained using sequences and their corresponding free energies accurately predict the binding free energies of both experimental and machine learning-generated peptides, enabling a route to quantitatively predict peptide properties using directed evolution data.

13

Teaching Diffusion Models Physics: Reinforcement Learning for Physically Valid Diffusion-Based Docking

Broster, J. H.; Popovic, B.; Kondinskaia, D.; Deane, C. M.; Imrie, F.

2026-03-27 bioinformatics 10.64898/2026.03.25.714128 medRxiv

Top 0.1%

0.8%

Show abstract

Molecular docking aims to predict the binding conformation of a small molecule to its protein target. Recent work has proposed diffusion models for this task, from rigid-body docking that diffuses over ligand degrees of freedom to co-folding approaches that jointly generate protein structure and ligand pose. However, diffusion-based docking models have been shown to frequently produce physically implausible poses and fail to consistently recover key protein-ligand interactions. To address this, we introduce a reinforcement learning framework for training diffusion-based docking models directly on non-differentiable objectives. Fine-tuning DiffDock-Pocket for physical validity with our approach substantially increases the number of generated poses that are physically valid and interaction-preserving, with no increase in inference-time compute. Importantly, this comes without sacrificing structural accuracy; in fact, our approach increases the proportion of structures with near-native poses. These effects are most pronounced for protein targets that are dissimilar to the training data. Our fine-tuned DiffDock-Pocket model outperforms both classical docking algorithms and machine learning-based approaches on the PoseBusters set. Our results demonstrate that reinforcement learning can teach diffusion-based docking models to better respect physical constraints and recover key interactions, without the requirement to rely on inference-time corrections.

14

An improved workflow for rapid, large-scale protein production in HEK293 cells via antibiotic enrichment after lentiviral transduction

Elegheert, J.; Behiels, E.; Nair, A.; Doridant, A.

2026-03-08 biochemistry 10.64898/2026.03.07.710266 medRxiv

Top 0.1%

0.7%

Show abstract

Lentiviral transduction of HEK293-derived expression cells provides a robust and scalable approach for large-scale protein production for structural and biochemical studies. Building on our previously reported platform, we introduce an improved workflow that decouples cell enrichment from target protein expression by enabling constitutive antibiotic selection of transduced cells prior to induction. The key advance is the use of orthogonal antibiotic-resistance cassettes to stringently enrich transduced cells, eliminate non-transduced cells, improve population homogeneity, and enable multi-vector co-selection for heteromeric assemblies and complexes. We provide two complementary transfer-vector suites. pHR-AB-CMV-TetO2 delivers maximal expression and supports inducible control in TetR-expressing lines while driving strong constitutive expression in non-TetR lines. pHR-AIO-AB ("all-in-one") encodes the transactivator, resistance marker, and gene of interest on a single construct to enable tightly controlled doxycycline-inducible expression in standard HEK293 lines, and is readily adaptable to other mammalian cell types. Both suites are available with puromycin, blasticidin, hygromycin, or zeocin markers, enabling straightforward co-infection and orthogonal multi-antibiotic selection of stable populations expressing multiple transgenes. They are well suited to demanding targets such as membrane proteins and multi-subunit assemblies. The protocol details the step-by-step generation of highly enriched, inducible HEK293 populations within 3-4 weeks.

15

Advances in protein function prediction from the fifth CAFA challenge

De Paolis Kaluza, M. C.; Ramola, R.; Joshi, P.; Piovesan, D.; Reade, W.; Orchard, S.; Martin, M. J.; Ignatchenko, A.; Kaggle Competition Participants, ; Rost, B.; Orengo, C. A.; Robinson-Rechavi, M.; Durand, D.; Brenner, S. E.; Greene, C. S.; Mooney, S. D.; Friedberg, I.; Radivojac, P.

2026-04-30 bioinformatics 10.64898/2026.04.27.716980 medRxiv

Top 0.1%

0.7%

Show abstract

The Critical Assessment of Functional Annotation (CAFA) is a long-standing community effort to independently assess computational methods for protein function prediction, to highlight wellperforming methodologies, to identify bottlenecks in the field, and to provide a forum for the dissemination of results and exchange of ideas. In its fifth round (CAFA5) of triennial challenges, a partnership with Kaggle Inc. facilitated participation from a large community of data scientists and computational biologists through a competitive prospective challenge on the crowdsourcing platform. In this work, we present an in-depth analysis of the submitted predictions and report improvements in accuracy over all methods from the previous CAFA challenges. We further introduce a new evaluation setting for proteins with pre-existing (incomplete) annotations and identify the need for methods that better leverage existing annotations to predict those that will be discovered later. Finally, we characterize the prospective evaluation framework by examining performance on a strict set of unpublished annotations and across intermediate database releases. Our results indicate that recent developments in the field, such as the availability of protein language models and accurately predicted 3D structures, as well as the growth of experimental annotations through biocuration, have all contributed to performance improvements.

16

Rigidity-Aware Geometric Pretraining for Protein Design and Conformational Ensembles

Ni, Z.; Li, Y.; Qiu, Z.; Schölkopf, B.; Guo, H.; Liu, W.; Liu, S.

2026-03-04 bioinformatics 10.64898/2026.03.02.708991 medRxiv

Top 0.1%

0.7%

Show abstract

AO_SCPLOWBSTRACTC_SCPLOWGenerative models have recently advanced de novo protein design by learning the statistical regularities of natural structures. However, current approaches face three key limitations: (1) Existing methods cannot jointly learn protein geometry and design tasks, where pretraining can be a solution; (2) Current pretraining methods mostly rely on local, non-rigid atomic representations for property prediction downstream tasks, limiting global geometric understanding for protein generation tasks; and (3) Existing approaches have yet to effectively model the rich dynamic and conformational information of protein structures. To overcome these issues, we introduce RigidSSL (Rigidity-Aware Self-Supervised Learning), a geometric pretraining framework that front-loads geometry learning prior to generative finetuning. Phase I (RigidSSL-Perturb) learns geometric priors from 432K structures from the AlphaFold Protein Structure Database with simulated perturbations. Phase II (RigidSSL-MD) refines these representations on 1.3K molecular dynamics trajectories to capture physically realistic transitions. Underpinning both phases is a bi-directional, rigidity-aware flow matching objective that jointly optimizes translational and rotational dynamics to maximize mutual information between conformations. Empirically, RigidSSL variants improve designability by up to 43% while enhancing novelty and diversity in unconditional generation. Furthermore, RigidSSL-Perturb improves the success rate by 5.8% in zero-shot motif scaffolding and RigidSSL-MD captures more biophysically realistic conformational ensembles in G protein-coupled receptor modeling. The code is available on this repository.

17

Stoic: Fast and accurate protein stoichiometry prediction

Litvinov, D.; Pantolini, L.; Skrinjar, P.; Tauriello, G.; McCafferty, C. L.; Engel, B. D.; Schwede, T.; Durairaj, J.

2026-03-16 bioinformatics 10.64898/2026.03.13.711535 medRxiv

Top 0.1%

0.7%

Show abstract

MotivationProtein complexes are central to cellular function, but experimental determination of their structures remains challenging. Structure prediction methods require prior knowledge of stoichiometry - the number of copies of each protein entity within a complex. Current approaches rely on computationally expensive brute-force methods that run structure prediction on multiple stoichiometry combinations, often with limited accuracy. ResultsWe introduce Stoic, a method that uses protein language model embeddings to predict protein complex stoichiometry. Our approach learns to identify interface residues that participate in protein-protein interactions, rather than relying on global sequence features. By integrating these interface-aware embeddings into a graph neural network, Stoic achieves fast and accurate stoichiometry prediction for both homomeric and heteromeric targets. AvailabilitySource code for inference and training along with web versions are available in the repository at https://github.com/PickyBinders/stoic. Contactjanani.durairaj@unibas.ch

18

PDBe-SIFTS: an open-source tool for Structure Integration with Function, Taxonomy, and Sequences, featuring improved alignment, scoring scheme, and accelerated search

Bellaiche, A.; Choudhary, P.; Nair, S.; Harrus, D.; Yu, C. W.-H.; Tanweer, S. A.; Evans, G. L.; Lo, S. W.; Martin, M.; Fleming, J. R.; Velankar, S.

2026-05-04 bioinformatics 10.64898/2026.04.30.721839 medRxiv

Top 0.1%

0.7%

Show abstract

Structure Integration with Function, Taxonomy and Sequences (SIFTS) provides residue-level mappings between UniProt Knowledgebase sequences and Protein Data Bank structures and has historically been generated through internal Protein Data Bank in Europe (PDBe) pipelines. Here, PDBe-SIFTS is presented as a fully open-source, locally deployable implementation of this mapping framework. The pipeline combines fast, scalable sequence search using MMseqs2, an improved bounded scoring scheme for ranking candidate mappings, and residue-level mapping refinement based on backbone connectivity. PDBe-SIFTS is distributed as a Python package with command-line tools for 1) building a sequence search database, 2) identifying the best sequence-structure match, 3) one-to-one mapping at the residue level, and 4) generating SIFTS annotations in PDBx/mmCIF format. Benchmarking on the complete Protein Data Bank archive showed that MMseqs2 reduced archive-scale UniProtKB searches from hours with BLASTP to minutes, approximately 22-36 times faster, while curated mappings were recovered at top rank in 93.1% of cases. The remaining discrepancies mainly involved biologically ambiguous cases such as highly conserved proteins, chimeric constructs, or closely related orthologs. These results show that PDBe-SIFTS enables fast mapping, improving structural coherence in residue-level alignments while delivering the most up-to-date and accurate mappings, comparable to expert curation. Tool: https://github.com/PDBeurope/SIFTS Quick start notebook with example: https://github.com/PDBeurope/SIFTS/tree/master/notebooks Broader audience statementMatching protein sequences to their three-dimensional structures, and mapping annotations across both, is essential for understanding protein function, interactions, and molecular mechanisms. This integrated view enables richer interpretation of biological data and underpins advances in drug discovery, disease research, and protein engineering. PDBe-SIFTS provides an open and functional framework for structure-sequence mapping, allowing researchers and databases to run, inspect, and extend these mappings locally, while benefiting from faster searches, transparent scoring, and structurally informed residue-level alignments. Graphical abstract O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=110 SRC="FIGDIR/small/721839v1_ufig1.gif" ALT="Figure 1"> View larger version (25K): org.highwire.dtl.DTLVardef@5e6ea6org.highwire.dtl.DTLVardef@1b2754dorg.highwire.dtl.DTLVardef@1334f9forg.highwire.dtl.DTLVardef@1b083a1_HPS_FORMAT_FIGEXP M_FIG C_FIG

19

When Multimodal Fusion Fails: Contrastive Alignment as a Necessary Stabilizer for TCR--Peptide Binding Prediction

Qi, C.; Wang, W.; Fang, H.; Wei, Z.

2026-04-02 bioinformatics 10.64898/2026.03.31.715453 medRxiv

Top 0.1%

0.6%

Show abstract

Multimodal learning is commonly assumed to improve predictive performance, yet in biological applications auxiliary modalities are often imperfect and can degrade learning if fused naively. We investigate this problem in TCR-peptide binding prediction, where sequence embeddings from pretrained protein language models are strong and transferable, but structure-derived residue graphs are built from predicted folds and heuristic discretization. In this setting, structural views can be noisy, inconsistent, and difficult to optimize jointly with sequence features. We introduce TRACE, a lightweight multimodal framework that encodes each entity (TCR and peptide) with parallel sequence and graph towers, then applies CLIP-style intra-entity contrastive alignment before interaction modeling. The alignment objective regularizes representation geometry by encouraging modality consistency for the same biological entity, thereby preventing unstable graph signals from dominating fusion. Across protocol-aware TCHard RN evaluations, naive sequence+graph fusion frequently underperforms a sequence-only baseline and can collapse toward near-random behavior. In contrast, TRACE consistently restores and improves performance. Controlled noise and supervision sweeps show that these gains persist under increasing graph corruption and positive-label scarcity, indicating that alignment is especially important when training conditions are hard. Our results challenge the assumption that adding modalities is inherently beneficial. Instead, they highlight a central principle for robust multimodal bioinformatics: performance depends not only on what modalities are used, but on how their interaction is constrained during optimization. TRACE provides a simple and general recipe for leveraging imperfect structural information without sacrificing stability.

20

RePaRank: An Efficient Architecture for Antibody-Antigen Interface Prediction by Proximity Ranking

Bednarek, J.; Janusz, B.; Krawczyk, K.

2026-03-05 immunology 10.64898/2026.03.03.708462 medRxiv

Top 0.1%

0.6%

Show abstract

The prediction of protein-protein interactions is central to structural biology, yet leading models are often computationally expensive, creating an accessibility gap for many high-throughput applications. Furthermore, common evaluation metrics such as binary contact prediction can be unreliable. In this work, we address both challenges. We introduce RePaRank, a computationally efficient deep learning architecture with 39.4 million parameters that predicts antibody-antigen interfaces by reframing the problem as a proximity ranking task in a learned embedding space. We also propose the Precision AUC, a robust, ranking-based metric that provides a more stable assessment of model performance than traditional binary methods. Our experiments show that RePaRank consistently outperforms benchmark models in paratope prediction and is highly competitive in epitope prediction among models that do not require external resources such as Multiple Sequence Alignments (MSA). RePaRank offers a practical and powerful tool for the immunoinformatics community.