Bioinformatics — Latest Matching Preprints

1

Ordered Gromov-Hausdorff Metric: A New Tool for Comparative Analysis of Protein Structures

Timofeev, A.; Anufriev, A.

2026-05-27 bioinformatics 10.64898/2026.05.23.727377 medRxiv

Top 0.1%

85.7%

Show abstract

MotivationClassical protein structure comparison metrics such as RMSD and TM-score effectively assess geometric similarity but ignore the linear order of amino acid residues (Zhang and Skolnick, 2004). The Gromov-Hausdorff (GH) metric compares metric spaces by shape but also does not account for order (Gromov, 1981). This can lead to incorrectly identifying proteins with swapped domains as similar. We introduce the Ordered Gromov-Hausdorff (OGH) metric, defined on ordered metric spaces, to incorporate residue order into the comparison. ResultsOGH combines coordinate normalization, an exponential penalty for order violations, and a monotonic alignment algorithm with computational complexity O(n{middle dot}w), where w is the search window width. It is proven that OGH satisfies all metric axioms for > 0. Analytical properties include invariance under isometries, upper boundedness, Lipschitz continuity under small coordinate perturbations, and concavity in the weight parameter . On the VAD dataset (28 viral proteins from HIV-1, SARS-CoV-2, MERS-CoV), OGH increases monotonically with residue shuffling (up to 0.363 at 100% shuffling) and correlates strongly with TM-score (r = 0.706). In the task of separating homologs at fixed global similarity (TM-score {approx} 0.5), OGH achieves AUC = 0.800, whereas TM-score gives AUC = 0.467, demonstrating that OGH detects conserved order even when global geometry is not conserved. AvailabilityThe Python source code for OGH is freely available at https://github.com/andytimoffilim/OGH. The VAD dataset (PDB IDs listed in the paper) is publicly accessible from the RCSB Protein Data Bank (Berman et al., 2000; wwPDB, 2019).

2

Min-frame transformation enables more sensitive viral genome alignment

Doughty, R. D.; Banerjee, A.; Kille, B.; Warnow, T.; Treangen, T. J.

2026-05-22 bioinformatics 10.64898/2026.05.20.726535 medRxiv

Top 0.1%

69.9%

Show abstract

MotivationMaximal unique matches (MUMs) are a fundamental primitive in genome comparison, where they serve as high-confidence anchors for downstream multiple genome alignment. However, because MUMs rely on exact string matching, their effectiveness degrades with increased genome divergence and larger sets of genomes, inhibiting their ability to recover long homologous regions and reducing the number of base pairs covered by the multiple genome alignment. Additionally, existing approaches that improve robustness to mutation, such as spaced seeds or translated alignment methods, introduce trade-offs in specificity, scalability, or computational complexity. MethodsTo address this gap, we introduce the Min-Frame Transformation (MFT), a deterministic encoding of nucleotide sequences to sequences over a transformed alphabet that preserves the coordinate structure of the original sequence. At each position, the MFT selects a k-mer from a local window according to a fixed global ordering and assigns it a character in the transformed alphabet via a predefined mapping. This process captures local sequence context and can mask the impact of mutations, increasing the likelihood that homologous regions remain detectable as exact matches. The resulting transformed sequences can be indexed using standard string data structures, such as suffix arrays and suffix trees, enabling efficient extraction of MUMs without modifying existing algorithms. ImpactThe MFT is a novel computational approach for improving the robustness of MUM-based seeding for genome alignment by producing longer and more contiguous matches that span a greater fraction of the genome, leading to improved alignment coverage and SNP recall. Altogether, these improvements have the potential to result in improvements for downstream viral genome analysis applications such as phylogenetic inference and transmission analysis. FundingTandy Warnow: NSF grant 2316233 Todd J. Treangen: NSF grants 2126387, 2239114, NIH grants U19-AI144297, P01-AI152999

3

HiCPotts: An R/Bioconductor package to identify significant interactions in chromosome conformation capture data and model sources of biases.

Osuntoki, I. G.; Harrison, A. P.; Dai, H.; Bao, Y.; Zabet, N. R.

2026-05-25 bioinformatics 10.64898/2026.05.21.726529 medRxiv

Top 0.1%

69.6%

Show abstract

MotivationChromosome Conformation Capture methods, including Hi-C, micro-C or Capture-C, are used to map chromatin interactions genome-wide. Most of the existing computational methods do not account for sources of biases (such as DNA accessibility, GC content or TE content) in the data. ResultsWe previously developed ZipHiC, a Bayesian method based on a the hidden Markov random field (HMRF) model and the Approximate Bayesian Computation (ABC), that uses zero-inflated Poisson distribution to model the noise, signal and false signal of the data and showed that this approach was able to detect biases from DNA accessibility, GC content and TE content in both Hi-C and micro-C data. Here, we present HiCPotts, another Bayesian method based on the HMRF model and the ABC that uses a zero-inflated Negative Binomial distribution instead to model the noise and signal of the data. We systematically show that HiCPotts reduces false positives and increases recovery of true interactions compared to ZipHiC, but also compared to other methods such as FastHiC, Juicer and HiCExplorer. Most importantly, we provide an R/Bioconductor package that allows modelling the noise, signal and false signal using various distributions such as the zero-inflated Negative Binomial (ZINB) and the zero-inflated Poisson distribution (ZIP). Availabilityhttps://bioconductor.org/packages/HiCPotts/

4

ParaDISM: Precise mapping of short reads to genes with highly homologous regions

Tzimotoudis, D.; Farrugia, R.; Zammit, J.; Masini, M. C.; Balestrucci, A.; Carbott, F. B.; Wettinger, S. B.; Alexiou, P.; Ciach, M. A.

2026-05-21 bioinformatics 10.64898/2026.05.19.726275 medRxiv

Top 0.1%

68.5%

Show abstract

BackgroundGenes with highly similar genomic copies (paralogs, tandem duplications and pseudogenes) pose a major challenge for Short-Read High Throughput Sequencing (srHTS). High sequence similarity makes it difficult to unambiguously identify the sequences of origin of short reads. This results in misalignment artifacts which can propagate through bioinformatic pipelines and increase error rates in variant calling. ResultsWe present ParaDISM, a pipeline that refines standard alignments to improve read placement and reduce misalignment-driven false variant calls in highly homologous sequences. ParaDISM assigns a read/read pair to a sequence only when supported by unambiguous sequence-specific evidence by using a multiple sequence alignment of reference sequences to identify disambiguating positions. An optional iterative refinement procedure calls variants from confidently assigned reads, updates the reference sequences, and processes remaining non-assigned reads. We evaluated the performance of ParaDISM both in terms of read alignment and the resulting short variant calls using extensive computational simulation experiments and the Genome in a Bottle HG002 benchmark. We applied ParaDISM to reanalyze two case studies: five public tumour exomes at the GNAQ/GNAQP1 locus, and 18 short-read sequencing datasets of patients diagnosed with Autosomal Dominant Polycystic Kidney Disease (16 exomes and 2 panel sequencing datasets). Compared to the standard aligners (bowtie2, bwa-mem and minimap2), ParaDISM reduced the number of misalignment artifacts and false variant calls, resulting in an increased specificity and precision of the results. ConclusionsParaDISM improves the precision of read placement and single-nucleotide variant calling in highly homologous reference sequences. By reducing the number of false variant calls caused by misalignment artifacts, ParaDISM provides a stronger level of evidence for the called variants compared to currently available approaches. The pipeline is open source and available under the MIT license at github.com/BioGeMT/ParaDISM.

5

MucOneUp: A Simulation Framework for MUC1-VNTR Variant Benchmarking

Popp, B.; Saei, H.

2026-05-12 bioinformatics 10.64898/2026.05.08.723876 medRxiv

Top 0.1%

59.5%

Show abstract

SummaryVariable number tandem repeats (VNTRs) in the MUC1 gene cause autosomal dominant tubulointerstitial kidney disease when disrupted by frameshift variants, but the GC-rich 60-bp repeat structure (20-125 copies) challenges variant detection. While tools like VNtyper enable MUC1 variant calling, no gold-standard benchmarking datasets exist for systematic performance evaluation. We present MucOneUp, a specialized simulation framework for generating MUC1-VNTR reference sequences with targeted variants and platform-specific sequencing reads (Illumina, Oxford Nanopore, PacBio). MucOneUp employs Markov chain-based repeat generation, supports diploid simulation with customizable variant placement, and includes additional analysis modules for SNaPshot assay simulation and exploratory frameshift analysis. We validate MucOneUp through a multi-variant, cross-platform benchmark of six tool-platform combinations using 13 distinct frameshift variants and investigate VNTR length effects on detection. Availability and implementationMucOneUp is accessible at no cost under the MIT License at https://github.com/berntpopp/MucOneUp and archived on Zenodo (DOI: 10.5281/zenodo.19740406). Contactbernt.popp@charite.de Supplementary informationSupplementary data are provided with this manuscript.

6

Discriminative learning of substitution matrices and gap penalties for pairwise alignment of biological sequences

Ciach, M. A.; Zacharopoulou, E.; Startek, M. P.; Miasojedow, B.; Alexiou, P.

2026-05-18 bioinformatics 10.64898/2026.05.14.725168 medRxiv

Top 0.1%

59.3%

Show abstract

Pairwise alignment scores are used to classify pairs of sequences in many areas of bioinformatics, including homology search, predicting interactions, or read mapping. The relative scores of different pairs strongly depend on the choice of a substitution matrix and gap penalties, but the existing approaches for the estimation of these parameters do not directly optimize them for the task of classification. In this work, we present DiscrimAlign, a statistical model for discriminative learning of substitution matrices and gap penalties from a dataset of positive and negative pairs of unaligned biological sequences. The model links the alignment score of a sequence pair with the associated binary label through a logistic function and learns the parameters by likelihood maximization. We analyze theoretical properties of the model, derive and implement a learning procedure, study its performance in simulated experiments, and apply it to predict microRNA-target interactions. We show that sequence alignment with discriminative substitution matrices and gap penalties predicts the interactions comparably to state-of-the-art neural network classifiers while being more interpretable. An implementation of the model and reproducibility workflows are available at https://github.com/BioGeMT/DiscrimAlign.

7

fourSynergy: Ensemble-based interaction calling on 4C-seq data using gradient-free optimization

Wind, S.-M.; Plagwitz, L.; Dix, J.; Heidtmann, G.; Heider, D.; Walter, C.

2026-06-01 bioinformatics 10.64898/2026.05.27.728108 medRxiv

Top 0.2%

58.1%

Show abstract

MotivationChromatin organization plays a crucial role in gene regulation and is associated with various severe diseases like cancer. Since chromatin changes are potentially reversible, a deeper understanding of the alterations needs to be harnessed for the development of new therapies. Circular Chromosome Conformation Capture Sequencing (4C-seq) is a sequencing technique enabling the identification of chromatin interactions between genes and regulatory elements. This work aims to develop an ensemble algorithm that utilizes synergies among available 4C-seq tools, which in turn allows to achieve superior predictive performance in interaction calling. ResultsWe employed existing 4C-seq algorithms using a weighted-voting approach. By optimizing the tool weights according to various predictive metrics using gradient-free optimization strategies, we demonstrate the potential of combining multiple 4C-seq analysis tools for interaction calling. Our results indicate that a weighted-voting based ensemble approach can outperform individual algorithms in various datasets. Although the optimal solutions differ across the 4C-seq datasets, we successfully identified global solutions that outperform the individual algorithms for all datasets analyzed. Availabilityhttps://github.com/sophiewind/fourSynergy, https://github.com/sophiewind/fourSynergy_pip Contactsophie.wind@uni-muenster.de Supplementary informationSupplementary data are available at Journal Name online.

8

ProtmRNA: Cross-Modal Knowledge Transfer from Proteins to Messenger RNA

Xu, G.; Wu, X.; Ma, J.

2026-05-19 bioinformatics 10.64898/2026.05.19.726141 medRxiv

Top 0.2%

53.9%

Show abstract

MotivationAccording to the central dogma of molecular biology, messenger RNA (mRNA) sequences are directly translated into amino acid sequences, positioning mRNA as the fundamental intermediary between genetic information and functional proteins. This natural correspondence suggests that mRNA sequence analysis could greatly benefit from the rich evolutionary and functional representations learned by large-scale protein language models. ResultsProtmRNA repurposes the pre-trained ESM-2 protein language model for mRNA sequence processing via cross-modal transfer learning. Evaluated on mRNA- and protein-related datasets, along with eight additional benchmarks compiled in this study, ProtmRNA achieves performance comparable or superior to state-of-the-art mRNA language models while using less than half the pre-training computational resources. This work establishes the potential of cross-modal transfer learning between biological sequences by demonstrating that protein-derived knowledge can be efficiently transferred to mRNA, offering a resource-efficient paradigm for advancing mRNA sequence understanding. Availability and ImplementationThe pre-trained ProtmRNA model and the eight CDS-region regression benchmarks curated in this study are publicly available at https://github.com/pesenteur/ProtmRNA.

9

Fast Set Operations for Compact k-mer Sets

Alanko, J.; Depuydt, L.; MARCHET, C.; Puglisi, S. J.

2026-05-27 bioinformatics 10.64898/2026.05.24.727514 medRxiv

Top 0.2%

52.9%

Show abstract

The k-mer spectrum of a set of sequences is the set of k-length substrings the sequences contain. This lossy representation of sequence content pervades modern genomics. Recently, the spectral Burrows-Wheeler transform (SBWT) has emerged as a space-efficient representation of k-spectra that also supports efficient k-mer lookup queries and, more generally, easy navigation of the de Bruijn graph of the k-spectrum. In this paper, we examine primitive set operations, such as intersection, union, and set difference, on SBWT-encoded k-spectra and show that these operations can be supported efficiently. Moreover, efficient merging leads directly to a new memory-efficient algorithm for SBWT construction, which was able to build the SBWT for the 661K bacterial dataset containing 88 billion distinct k-mers in 50 hours using 186 GiB of RAM and 112 GiB of disk space. Given the pervasiveness of k-mer sets in genomics and the continued rapid growth of genomic databases, our work opens the door to a wide array of future applications that manipulate and reason about genomic data by dealing directly with simultaneously compact and searchable k-mer set representations offered by the SBWT. 2012 ACM Subject ClassificationTheory of computation [->] Design and analysis of algorithms Digital Object Identifier10.4230/LIPIcs.WABI.2026. Supplementary MaterialSoftware (Source Code): https://github.com/LoreDepuydt/sbwt-set-operations FundingThis work has benefited from funding from the French State under the France 2030 program, reference ANR-21-IDES-0006. The European Metropolis of Lille and the University of Lille are also acknowledged for their funding and support of the project WILL-CHAIRES-25-001-BOSSA.

10

Redesign selective protein binders using contrastive decoding

Xie, Z.; Xu, J.

2026-05-13 bioinformatics 10.64898/2026.05.09.722041 medRxiv

Top 0.2%

52.9%

Show abstract

MotivationFixed-backbone sequence design methods such as ProteinMPNN operate on backbone coordinates alone and cannot represent target side-chains at the binding interface. Their decoding algorithm also lacks a mechanism to balance binding affinity and folding stability or to improve selectivity against structurally similar off-targets. These gaps limit the computational design of protein binders with high affinity and specificity. ResultsWe present RedNet, a multiscale graph neural network that encodes side-chain information of the binding target. We further develop a contrastive decoding algorithm, motivated by the thermodynamic decomposition of binding free energy, that addresses two objectives: (1) balancing binding affinity and folding stability, and (2) improving selectivity against structurally similar off-targets. RedNet reaches 43% native sequence recovery on heterodimers, compared with 37% for ProteinMPNN and 33% for ESM-IF. With contrastive decoding, it matches native-sequence co-folding success (68%) on high-confidence AlphaFold3 targets, exceeding ProteinMPNN (59%) and ESM-IF (61%). On a new benchmark of structurally similar on-/off-target pairs, RedNet with contrastive decoding reaches 64.8% energetic selectivity, ahead of PiFold (55.6%), ProteinMPNN (53.7%), and ESM-IF (53.7%). AvailabilitySource code and datasets are released at https://github.com/zw2x/rednet_public. Contactjinbo.xu@gmail.com

11

reComBat-seq: Regularized negative binomial regression for batch-effect correction in underdetermined transcriptomics datasets

Stoyanova, Z.; Malzl, D.; Menche, J.

2026-05-30 bioinformatics 10.64898/2026.05.27.728166 medRxiv

Top 0.2%

52.9%

Show abstract

MotivationBatch effect correction is essential for the integration of large-scale transcriptomics datasets such as single-cell RNA-seq or multi-study bulk RNA-seq datasets for reducing technical noise that may mask biological signal. Existing correction methods, either do not produce count data output which is crucial for state-of-the-art downstream analyses such as differential expression analysis or fail to converge in underdetermined study designs. ResultsWe present reComBat-seq, a method that extends the Negative Binomial regression framework of ComBat-seq by incorporating Elastic Net regularization. This approach resolves problems with rank-deficient design matrices while also preserving the integer nature of count data. Benchmarking on simulated and real datasets such as single-cell RNA-seq data demonstrates that reComBat-seq successfully removes batch effects in complex study designs while maintaining compatibility with downstream differential expression tools. Availability and ImplementationreComBat-seq source code can be found at https://github.com/menchelab/reComBat-seq. All code to reproduce the presented analyses can be found at https://github.com/menchelab/reComBatseq_Studies. Data produced in this study is available at https://doi.org/10.5281/zenodo.19736515. Used single-cell RNA-seq data can be found at https://doi.org/10.5281/zenodo.14234956. Supplementary InformationProofs and volcano plots of differential expression analysis

12

S-IGTD: supervised tabular-to-image topology learning via between-group correlation for multiclass classification of biological data

WU, H.-M.

2026-05-21 bioinformatics 10.64898/2026.05.19.726105 medRxiv

Top 0.2%

52.8%

Show abstract

MotivationTabular-to-image methods allow convolutional neural network (CNN)-based classifiers to analyse high-dimensional biological tables by mapping features onto a two-dimensional grid. Existing layouts are usually driven by unsupervised global correlation, which can place class-discriminative features far apart when nuisance or housekeeping covariation dominates the total covariance structure. ResultsWe present the Supervised Image Generator for Tabular Data (S-IGTD), a supervised extension of IGTD that optimizes tabular-to-image topology by replacing total-correlation distance with one minus the absolute between-group correlation, computed from class-wise feature means, under the Within-And-Between-Analysis (WABA) decomposition. We prove entrywise consistency of the supervised distance matrix under standard moment conditions and identify balanced-class settings in which S-IGTD improves a Signal Dispersion Score (SDS)-related topology objective. In controlled simulations targeting between-group signal, S-IGTD outperformed Euclidean- and correlation-distance IGTD variants in SDS, accuracy and macro-F1 score. Across five biological benchmarks ranging from 4- to 91-class classification, S-IGTD produced compact class-supervised layouts, with 24/35 Holm-adjusted significant SDS wins against seven non-reference layout controls. As a secondary downstream diagnostic, a CNN with batch normalization showed higher mean accuracy than random layouts and correlation-distance IGTD on all real datasets, and higher mean accuracy than Euclidean-distance IGTD on four of five datasets, with the clearest gains on large multiclass cancer and methylation benchmarks. Availability and implementationSource code, datasets, configuration files and reproducibility scripts are freely available at https://github.com/hanmingwu1103/S-IGTD. Contactwuhm@g.nccu.edu.tw

13

HiCPEP: Efficient estimation of chromatin compartment PC1 from Hi-C covariance structure

Cheng, Z.-R.; Chang, J.-M.

2026-05-18 bioinformatics 10.64898/2026.05.14.725269 medRxiv

Top 0.2%

52.7%

Show abstract

Principal component analysis (PCA) of the Hi-C Pearson correlation matrix is the standard approach for identifying A/B chromatin compartments. Despite its widespread use, the relationship between the first principal component (PC1) and the underlying compartment structure remains insufficiently characterized, and computing PC1 can become computationally expensive for high-resolution Hi-C data. Here we investigate the role of the PC1 explained variance ratio in compartment analysis and show that chromosomes with strong compartment organization typically exhibit a dominant PC1 signal. Based on this observation, we propose HiCPEP, a heuristic algorithm that estimates the sign pattern and relative magnitude of PC1 directly from the Hi-C Pearson covariance matrix without performing explicit eigenvector decomposition. The method can operate from either a dense Pearson matrix for fast approximation or a sparse observed/expected (O/E) matrix to reduce memory usage. Furthermore, because many covariance columns exhibit PC1-like patterns when the compartment signal is strong, HiCPEP can be accelerated using random sampling without substantially reducing accuracy. Across multiple Hi-C datasets, HiCPEP consistently recovered compartment patterns with high similarity to reference PC1 vectors produced by standard PCA-based methods. Benchmark experiments show that HiCPEP achieves comparable accuracy while reducing computational cost in terms of runtime or memory usage. These results suggest that HiCPEP provides a practical alternative for efficient chromatin compartment analysis from large-scale Hi-C datasets. The HiCPEP implementation is freely available at https://github.com/ZhiRongDev/HiCPEP.

14

StabCell: Stability selection for clustering and marker detection in single-cell RNA sequencing

Lück, N.; Rossi, A.; Staerk, C.

2026-05-12 bioinformatics 10.64898/2026.05.07.720061 medRxiv

Top 0.2%

52.0%

Show abstract

MotivationConventional pipelines for differential expression analysis in single-cell RNA sequencing (scRNA-seq) data first cluster individual cells and then test for differentially expressed genes between the resulting clusters. Using the same data for clustering and testing, however, poses a selective inference problem and can result in overconfidence in differences that may not reflect true biological variation. ResultsWe introduce StabCell, a stability selection framework which integrates clustering and detection of differentially expressed marker genes. By repeatedly performing clustering and differential expression analysis on complementary random subsamples, StabCell assesses clustering and marker stability, yielding a stable clustering with sets of stable marker genes. In simulations, we demonstrate that StabCell provides approximate empirical per-family error rate (PFER) control, selecting fewer false positive marker genes compared with conventional approaches, especially in cases with low signal-to-noise ratio and low sequencing depth. Applying the method to a cell differentiation dataset from induced pluripotent stem cells (IPSCs) to cardiomyocytes reveals that meaningful marker genes are consistently among the top-ranked genes. These results indicate that StabCell can improve the interpretability and robustness of scRNA-seq analyses. Availability and implementationAn implementation of StabCell in the statistical programming language R is available at https://github.com/LuckyLueck/StabCell. Code to reproduce the results is available at https://github.com/LuckyLueck/StabCell_paper.

15

VX: an AI-enabled desktop genome viewer and transcriptome browser with a programmable analysis framework

Shirokikh, N. E.; Cleynen, A.

2026-05-20 bioinformatics 10.64898/2026.05.17.725790 medRxiv

Top 0.2%

51.5%

Show abstract

BackgsroundGenome and transcriptome browsers are central to the interpretation of high-throughput sequencing data, but todays tools assume a human operator at a graphical interface and offer only limited programmability. As large-language-model assistants become routine in bioinformatics [Anthropic, 2024], this creates a bottleneck: agents cannot observe the visual state of the browser or drive it through the same interface as the human user, and analyses remain fragmented across a separate ecosystem of external tools. Transcript-coordinate data, produced by ribosome profiling [Ingolia et al., 2012] and direct RNA sequencing [Garalde et al., 2018], is also awkwardly supported in chromosome-oriented viewers. ResultsWe present VX, a desktop genome and transcriptome viewer written in D, using GTK 3 and OpenGL, that handles genome-scale and transcriptome-scale data in a unified interface. VX exposes its full functionality through an embedded HTTP API on the loopback interface and a Model Context Protocol server of currently thirty-nine tools, so that scripts and LLM agents can load data, navigate, manage tracks, run analyses, and capture figures through the same contract used by the GUI. An integrated analysis framework provides more than fifty analyses and includes signal processing and peak calling, quantification, variant analysis, alignment statistics, interaction and cross-track comparisons, all with an explicit four-level scope hierarchy running from viewport to whole dataset; results are written to disk and, where appropriate, added as new tracks. Additional features include a magnifier popup for base-resolution inspection (Alt+hover), chromosome-alias resolution across UCSC, Ensembl, and NCBI conventions, viewport video recording via an ffmpeg pipe, and INI-based configuration. ConclusionsVX complements existing desktop and web browsers by providing a native agent-control layer, an integrated analysis framework, and first-class transcriptspace handling. The binary is freely available for non-commercial use; the HTTP API and MCP protocol are fully specified in this article, so third-party clients can be written independently of the core implementation.

16

MetworkPy A Python Package for Graph- and Information-theoretic Investigation of Metabolic Networks

Griebel, B. T.; Ma, S.

2026-05-29 bioinformatics 10.64898/2026.05.26.727944 medRxiv

Top 0.3%

50.6%

Show abstract

SummaryWe present MetworkPy, a python package for investigating in silico genome-scale models of metabolism (GSMM). By using novel graph- and information-theoretic methods to explore the feasible reaction flux space, MetworkPy quantifies network context and simulates metabolic relationships between sets of enzyme-encoding genes without imposing assumptions of optimal growth. To demonstrate utility, we used MetworkPy to identify metabolic features perturbed by the transcription factor ArgR, a known regulator of arginine biosynthesis in Mycobacterium tuberculosis, based on published transcriptome data generated from an argR mutant strain. MetworkPy successfully linked reaction flux shifts in ArgRs transcriptome-constrained GSMM to arginine biosynthesis, which cannot be easily ascertained by conventional constraint-based optimization modeling approaches. MetworkPy offers a flexible toolbox for metabolic contextualization of genes-of-interest in microbial, eukaryotic, and multi-organism systems with potential applications for medicine and bioengineering. Availability and implementationThe MetworkPy package can be retrieved from PyPi (https://pypi.org/project/metworkpy/) and GitHub (https://github.com/Ma-Lab-Seattle-Childrens-CGIDR/metworkpy). Code for analyses performed in this paper can be retrieved from GitHub (https://github.com/Ma-Lab-Seattle-Childrens-CGIDR/metworkpy_application_note) Supplementary InformationSupplementary data are available online at bioRxiv.

17

Multiple versus pairwise sequence alignments for protein phylogenetics using foundation models

Alibutud, R. F.; Kumar, S.

2026-05-29 bioinformatics 10.64898/2026.05.26.727927 medRxiv

Top 0.3%

44.3%

Show abstract

Phylogenetic inference is a common task in molecular and evolutionary biology and has conventionally required a multiple sequence alignment (MSA), a statistical model of amino acid substitutions, and an optimality principle. Recently, global models of amino acid substitutions have been inferred from millions of MSAs using transformer-based deep learning, resulting in protein foundation models (pFMs), also known as protein language models (PLMs). Training pFMs on MSAs hypothetically enables them to encode residue dependencies and the phylogenetic structure of the MSA collection. In contrast, pFMs trained on individual sequences lack access to such phylogenetic structure. Here, we assess the phylogeny inference gains offered by the use of MSA for training pFMs by comparing the relative accuracies of phylogenies inferred using two types of pFMs: one trained on a large collection of MSAs (msat-pFM, [1]) and the other trained using a collection of single sequences (esm-pFM). For msat-pFM analysis, we inferred neighbor-joining trees using pairwise distances estimated directly from the sequence attention matrices. For esm-pFM [2], pairwise distances were obtained using the correlation of attentions of homologous residues, where pairwise sequence alignments (PSA) were used to establish residue homologies. Surprisingly, MSA phylogenies inferred using the msat-pFM were less accurate than esm-pFMs. This pattern was seen across datasets spanning both small and large numbers of species and proteins. Also, PSA phylogenies obtained using residue attentions from early ESM-PFM layers were much more accurate. These results suggest that the multiple sequence alignment step, which is obligatory to establish residue homologies across multiple sequences, may not add information when using evolutionary distances based on attentions in pFMs.

18

PARiS: Probabilistic Assignment and Repartitioning of isomiR Sequences: A data-driven method for denoising isomiR read count data

Swan, H. K.; Baran, A. M.; Aparicio-Puerta, E.; Halushka, M. K.; Jun, S.-H.; McCall, M. N.

2026-05-12 bioinformatics 10.64898/2026.05.09.723882 medRxiv

Top 0.3%

43.6%

Show abstract

MicroRNAs (miRNAs) are non-coding RNAs, approximately 18-24 nucleotides in length, with important gene regulatory functions. In small RNA sequencing (sRNA-seq), observed isoforms of miRNA, called isomiRs, arise from my biological and technical processes. Alterations in isomiR expression has been linked to a wide variety of human diseases, from cancers to neurological diseases. However, it is difficult to distinguish between technical and biological isomiRs. We present PARiS, an algorithm for the Probabilistic Assignment and Repartitioning of isomiR Sequences, that identifies technical error isomiRs in sRNA-seq data and reassigns them to their most likely biological source. We assess the ability of PARiS to identify and remove error isomiR sequences in a realistic simulation study. Additionally, we compare PARiS to alternative approaches, focusing on downstream miRNA-level differential expression analysis in a variety of settings, including a set of simulated datasets, an experimental benchmark dataset, and three colorectal adenocarcinoma cell lines.

19

MIMOSA: A model-independent framework for transcription factor binding site motif similarity assessment

Tsukanov, A. V.; Levitsky, V. G.

2026-05-17 bioinformatics 10.64898/2026.05.13.725009 medRxiv

Top 0.3%

41.6%

Show abstract

MotivationTranscription factors (TFs) regulate gene expression by binding specific DNA sequences, which are commonly represented by motif models. Although position weight matrices (PWMs) remain the dominant motif representation, alternative models, such as Markov models, can capture interpositional dependencies and may provide higher predictive performance. However, existing motif comparison tools are designed mainly for PWMs or require motifs to be reduced to PWM/PPM representations. This creates a major bottleneck for comparing motifs represented by different model architectures. This limitation complicates the interpretation of de novo motif discovery results and hinders the systematic integration of diverse motif models into genomic analyses. ResultsWe present MIMOSA (Model-Independent Motif Similarity Assessment), a model-independent framework for direct comparison of TF binding site (TFBS) motifs regardless of their mathematical representation. MIMOSA assesses motif similarity by comparing calibrated recognition profiles produced by motifs of different models on the same DNA sequence set, rather than by comparing the motifs themselves. In a cross-database benchmark on HOCOMOCO motifs, MIMOSA achieved retrieval performance comparable to established PWM-oriented tools, including Tomtom and MACRO-APE, with MRR and Recall@k close to the best-performing methods. Pairwise ranking comparisons showed that MIMOSA captures a similarity signal consistent with existing approaches while providing a representation-independent comparison strategy. Application to de novo motifs derived from ChIP-seq data for the ATF3 TF demonstrated that recognition-profile comparison distinguished alternative spacer variants represented as separate PWMs from their integration within more flexible models such as BaMM and Slim. Thus, MIMOSA enables quantitative cross-model motif comparison and supports interpretation of motif heterogeneity in TFBS analyses. Availability and implementationMIMOSA is implemented in Python and is freely available at https://github.com/ubercomrade/mimosa.

20

Highly Accurate Estimation of the Fold Accuracy of Protein Structural Models

Xie, L.; Ye, E.; Wang, H.; Zhang, T.; Zhen, Q.; Liang, F.; Liu, D.; Zhang, G.

2026-05-13 bioinformatics 10.64898/2026.04.15.718808 medRxiv

Top 0.4%

41.1%

Show abstract

BackgroundThe function of a protein is intrinsically linked to its three-dimensional fold, and deep learning has revolutionized the field by enabling high-accuracy structure prediction at an unprecedented scale. Nevertheless, the growing deployment of these predictive pipelines in drug discovery and structural biology reveals a critical bottleneck that lies in the lack of independent and rigorous estimation of model accuracy (EMA) methodologies. ResultsHere we present DeepUMQA-Global, a single-model deep learning framework for estimating accuracy of protein structure models. Our method employs a structure-sequence cross-consistency mechanism to evaluate the bidirectional compatibility between the predicted structure and the input sequence, enabling comprehensive characterization of fold accuracy. DeepUMQA-Global outperforms the self-assessment confidence scores of AlphaFold3, achieving improvements of 57.8% in Pearson correlation and 49.0% in Spearman correlation. With respect to the CASP16 retrospective benchmark, DeepUMQA-Global outperforms all single-model accuracy estimation methods that participated in CASP16 and achieves performance comparable to that of the top consensusObased methods. A lightweight consensus strategy built upon DeepUMQA-Global ranks first among all CASP16 participants, surpassing all other methods, including consensus approaches, and highlighting the strength of our method. Remarkably, DeepUMQA-Global demonstrates a strong ability to discriminate between alternative conformational states of proteins, as evidenced in the CASP unique alternative conformation protein complex target and the CoDNaS benchmark. ConclusionsOur results indicate that DeepUMQA-Global can be extended to broader protein modeling tasks, moving beyond static evaluation to offer a foundation for dynamic conformation EMA, where it accurately discriminates alternative conformational states and demonstrates reliable predictive fidelity in model accuracy estimation.