Bioinformatics — Latest Matching Preprints

1

Deciphering the links between metabolism and health by building small-scale knowledge graphs: application to endometriosis and persistent pollutants

Mathe, M.; Laisney, G.; Filangi, O.; Giacomoni, F.; Delmas, M.; Cano-Sancho, G.; Jourdan, F.; Frainay, C.

2026-03-04 bioinformatics 10.64898/2026.03.02.709027 medRxiv

Top 0.1%

85.7%

Show abstract

MotivationKnowledge graphs (KGs) are a robust formalism for structuring biomedical knowledge, but large-scale KGs often require complex queries, are difficult for non-experts to explore, and lack real-world context (such as experimental data, clinical conditions, patients symptoms). This limits their usability for addressing specific research questions. ResultsWe present Kg4j, a computational framework built on FORVM (a large-scale KG containing 82 million compound-biological concept associations), that constructs local, keyword-based sub-graphs tailored to address biomedical research questions. Resulting graphs support hypothetical relationships and can integrate experimental datasets, enabling the discovery of plausible but yet unknown connections. Starting from a conceptual definition of a research field of interest (e.g., disease, symptoms, exposure), the framework extracts relevant associations from FORVM and identifies potential biological mechanisms and chemical compounds. We applied this approach to endometriosis, exploring links between exposure to Persistent Organic Pollutants (POPs) and disease risk. We propose a novel validation strategy comparing the resulting sub-graph (2,706 nodes and 23,243 edges, 0.002% of FORVM) with recent scientific literature, showing consistency with known findings while also revealing new hypothetical associations requiring further investigation.We also showed that removing duplicated nodes and edges from the KG improves the proportion of validated nodes (from 8.4% to 16%), doubles the precision (from 0.085 to 0.197) while maintaining the recall (0.954 to 0.952), illustrating a trade-off between the loss of potentially relevant but redundant information and the reliability of remaining associations. By combining automated knowledge mining with experimental data integration, this framework supports reproducible, context-based exploration of biomedical knowledge and systematic hypothesis generation. Applied to endometriosis, it highlights potential mechanisms linking exposure to POPs to the aetiology of the disease, offering a scalable strategy for constructing disease-specific KGs. AvailabilityThe code and data underlying this article are available in the MetExplore repository at https://forge.inrae.fr/metexplore/kg4j. Contactclement.frainay@inrae.fr Key MessagesO_LIKg4 builds targeted knowledge maps from large biomedical databases using simple keywords. C_LIO_LIKeyword-driven exploration reveals the most relevant disease-exposure relationships without navigating millions of connections. C_LIO_LIApplied to endometriosis, the method recovered known links with persistent organic pollutant exposure. C_LIO_LIRemoving redundant information and formatting Knowledge graph as Labeled Property Graph improves the reliability of extracted knowledge. C_LI

2

Fuzzifier*: Robust and Sensitive Multi-omics Data Analysis

Offensperger, F.; Pan, C.; Sinn, E.; Zimmer, R.

2026-02-09 bioinformatics 10.64898/2026.02.06.701074 medRxiv

Top 0.1%

83.1%

Show abstract

MotivationCategorization is an important means for interpreting data and drawing conclusions. Often, the derived categories provide evidence for diagnostic or even therapeutic approaches. The standard pipelines for differential analysis of multi-omic high-throughput, and in particular single-cell data, yield (ranked) lists of possibly differential features after applying appropriate effect sizes or significance thresholds of computed p-value and/or foldchange. ResultsWe propose the Fuzzifier* pipeline for the differential analysis of any type of high-throughput data, either raw input data or fold-change data of groups of a (small or large) number of replicates. In Fuzzifier*, categorization can be applied to any step of the analysis pipeline according to custom-designed fuzzy concepts (Fuzzifier). Thus, any (fuzzified) analysis option corresponds to a path in a commutative diagram specifying the Fuzzifier* pipeline. Fuzzifier* computes a user-defined set of paths and presents an overview of the results, thereby identifying both highly reliable (consensus) and sensitive (path-specific) features. Fuzzifier* is a method that can be applied to any analysis pipeline to obtain different views on the data and yield more reliable results. This is demonstrated by the identification of context-specific miRNAs for individual cancer types from TCGA data. Fuzzifier* could both validate known cancer-specific miRNAs and identify novel candidates. In comparison to statistical tests, Fuzzifier* focuses on value distributions of tumor and normal samples as well as paired foldchange distributions and, thus, identifies condition-specific features from a relatively small number of replicates. Availability and Implementationhttps://github.com/zimmerlab/fuzzifier Contactoffensperger@bio.ifi.lmu.de and zimmer@ifi.lmu.de

3

Scalable computation of ultrabubbles in pangenomes by orienting bidirected graphs

Harviainen, J.; Sena, F.; Moumard, C.; Politov, A.; Schmidt, S.; Tomescu, A. I.

2026-03-31 bioinformatics 10.64898/2026.03.28.714704 medRxiv

Top 0.1%

80.0%

Show abstract

MotivationPangenome graphs are increasingly used in bioinformatics, ranging from environmental surveillance and crop improvement to the construction of population-scale human pangenomes. As these graphs grow in size, methods that scale efficiently become essential. A central task in pangenome analysis is the discovery of variation structures. In directed graphs, the most widely studied such structures, superbubbles, can be identified in linear time. Their canonical generalization to bidirected graphs, ultrabubbles, more accurately models DNA reverse complementarity. However, existing ultrabubble algorithms are quadratic in the worst case. ResultsWe show that all ultrabubbles in a bidirected graph containing at least one tip or one cutvertex--a common property of pangenome graphs--can be computed in linear time. Our key contribution is a new linear-time orientation algorithm that transforms such a bidirected graph into a directed graph of the same size, in practice. Orientation conflicts are resolved by introducing auxiliary source or sink vertices. We prove that ultrabubbles in the original bidirected graph correspond to weak superbubbles in the resulting directed graph, enabling the use of existing lineartime algorithms. Our approach achieves speedups of up to 25xover the ultrabubble implementation in vg, and of more than 200x over BubbleGun, enabling scalable pangenome analyses. For example, on the v2.0 pangenome graph constructed by the Human Pangenome Reference Consortium from 232 individuals, after reading the input, our method completes in under 3 minutes, while vg requires more than one hour, and four times more RAM. AvailabilityOur method is implemented in the BubbleFinder tool github.com/algbio/BubbleFinder, via the new ultrabubbles subcommand. Contactalexandru.tomescu@helsinki.fi

4

Theseus: Fast and Optimal Affine-Gap Sequence-to-Graph Alignment

Jimenez-Blanco, A.; Lopez-Villellas, L.; Moure, J. C.; Moreto, M.; Marco-Sola, S.

2026-02-14 bioinformatics 10.64898/2026.02.12.705572 medRxiv

Top 0.1%

79.4%

Show abstract

MotivationSequence-to-graph alignment is a central problem in bioinformatics, with applications in multiple sequence alignment (MSA) and pangenome analysis, among others. However, current algorithms for optimal affine-gap alignment impose high memory and computational requirements, limiting their scalability to aligning long sequences to complex graphs. Practical solutions partially address this problem using heuristic strategies that ultimately trade off optimality for speed. ResultsThis work presents Theseus, a novel, fast, and optimal affine-gap sequence-to-graph alignment algorithm. Theseus leverages similarities between genomic sequences to accelerate the alignment computation and reduces the overall memory requirements without compromising optimality. To that end, Theseus exploits the diagonal transition property to process only a subset of the dynamic programming cells, combined with a sparse-data strategy that enables efficient sequence-to-graph alignment. Moreover, our algorithm supports optimal affine-gap alignment on arbitrary directed graphs, including those with cycles. We evaluate Theseus on two key problems: multiple sequence alignment (MSA) and pangenome read mapping. For MSA, we compare it against the state-of-the-art methods SPOA, abPOA, and POASTA. Theseus is 2.0x to 232.2x faster than the other two optimal aligners, SPOA and POASTA. Compared with abPOA, a heuristic aligner, Theseus is 3.3x faster on average, while ensuring optimality. For pangenome read mapping, we benchmark Theseus against the alignment stage of the popular mapping tool vg map, along with the alignment kernels of SPOA, abPOA, and POASTA. Theseus outperforms the other methods, showing a 1.9x to 16.9x speed improvement on short reads. AvailabilityTheseus code and documentation are publicly available at https://github.com/albertjimenezbl/theseus-lib. Contactalbert.jimenez.blanco@upc.es

5

Evaluating transformer-based models for structural characterization of orphan proteins

Seckin, E.; Colinet, D.; Danchin, E.; Sarti, E.

2026-03-12 bioinformatics 10.64898/2026.03.10.709490 medRxiv

Top 0.1%

78.6%

Show abstract

MotivationTransformer-based models (TBMs) are state-of-the-art deep learning architectures that predict protein structural features with high accuracy. Despite methodological differences, they all rely on large protein sequence datasets structured by homology, as homologous proteins typically share similar structures. However, 5-30% of eukaryotic proteomes consist of orphan proteins--sequences without detectable similarity to known families. Although they may share structural traits with characterized proteins, their lack of homology makes them and ideal dataset for evaluating TBM generalization beyond familiar sequence space. ResultsWe compared predictions from several widely used TBM architectures on an expert-curated set of orphan proteins from the Meloidogyne genus. None of these proteins has an experimentally determined structure. To assess model performance, we conducted consistency analyses, comparing predicted features with those observed in sets of known homologous proteins and across models. Multiple sequence alignment-based approaches such as AlphaFold2 performed poorly on orphan proteins, as did single-sequence or embedding-based language models including ESMFold, OmegaFold, and ProtT5. This limited performance cannot be fully attributed to intrinsic disorder, as confirmed by independent non-TBM disorder predictors. While accurate tertiary structure prediction remains out of reach, secondary structure is more reliably captured: predictors share about 70% of secondary structure elements on average, regardless of global fold similarity, and these elements are consistently identified by dedicated secondary structure tools. AvailabilityAll data and analysis scripts are available at https://doi.org/10.5281/zenodo.18788931 Contactedoardo.sarti@inria.fr

6

Sassy2: Batch Searching of Short DNA Patterns

Beeloo, R.; Groot Koerkamp, R.

2026-03-12 bioinformatics 10.64898/2026.03.10.710811 medRxiv

Top 0.1%

77.6%

Show abstract

MotivationSearching short DNA patterns such as barcodes, primers, or CRISPR spacers within sequencing reads or genomes is a fundamental task in bioinformatics. These problems are instances of multiple approximate string matching (MASM) [Baeza-Yates and Navarro, 1997], which requires locating all occurrences with up to k errors of multiple patterns of length m in a text of length n. Classical approaches based on seeding with exact matches become inefficient for short patterns (m [≤] 64 bp) as k increases, producing either many spurious hits or missing true matches. Our previous work, Sassy1, showed that careful hardware optimization drastically accelerates single-pattern searches in long texts by distributing chunks of the text across SIMD lanes. MethodsSassy2 distributes multiple patterns across SIMD lanes to maximize parallelism when searching batches of short patterns. When k is small, often only a short substring of the pattern of length O(k) is needed to reject a possible match. Thus, Sassy2 first examines short suffixes of the patterns (e.g., the last 16 bp of 32 bp patterns), allowing more (but smaller) parallel SIMD lanes. Only positions passing this suffix filter undergo full pattern verification. ResultsOn synthetic data, Sassy2 achieves 10-50x speedups over Sassy1 for short texts (n [≤] 200 bp) and 2-4x for large texts (n [≥] 1 Mbp). On real-world tasks with 16 threads, Sassy2 reaches over 100 Gbp/s text throughput per guide when searching 312 gRNAs across the human genome and 116 Gbp/s throughput when demultiplexing Nanopore reads with 96 barcodes. In both cases, Sassy2 outperforms Sassy1 by 2-5x and Edlib by 20-45x. AvailabilitySassy2 is implemented in Rust and available at github.com/RagnarGrootKoerkamp/sassy.

7

CLEAR: Concise List Enrichment Analysis Reducing Redundancy

Jia, X.; Phan, A.; Dorman, K.; Kadelka, C.

2026-04-01 bioinformatics 10.64898/2026.03.30.715378 medRxiv

Top 0.1%

73.8%

Show abstract

MotivationHigh-throughput experiments generate genome-wide measurements for thousands of genes, which are often tested marginally. Biological processes are driven by coordinated groups of genes rather than individual genes, making gene set enrichment analysis an essential post hoc interpretation tool. Traditional approaches such as Over-Representation Analysis and Gene Set Enrichment Analysis test gene sets independently, which ignores the hierarchical and overlapping structure of gene set collections such as the Gene Ontology, and often leads to redundant enrichment results. Set-based approaches such as MGSA address this issue by modeling multiple gene sets simultaneously, but they rely on binary gene activation states derived from arbitrary thresholds on gene-level statistics. ResultsWe introduce Concise List Enrichment Analysis Reducing Redundancy (CLEAR), a Bayesian gene set enrichment framework that jointly models gene sets while incorporating continuous gene-level statistics such as test statistics or p-values. CLEAR extends model-based gene set analysis by replacing threshold-based gene activation with a probabilistic model for continuous gene-level statistics. This approach preserves the redundancy-reduction advantages of set-based enrichment methods while avoiding the information loss introduced by binarization. Using both simulated datasets and human gene expression data, we show that CLEAR improves sensitivity compared with existing enrichment approaches while producing a more concise and interpretable set of enriched gene sets. Availability and implementationThe source code, data, and a brief tutorial are freely available at https://github.com/jiatuya/CLEAR

8

AniAnn's: alignment-free annotation of tandem repeat arrays using fast average nucleotide identity estimates

Sweeten, A. P.; Schatz, M.; Phillippy, A. M.

2026-01-28 bioinformatics 10.64898/2026.01.27.702063 medRxiv

Top 0.1%

73.6%

Show abstract

MotivationSatellite DNA has long posed challenges for genome assembly and analysis due to its low sequence complexity and poor mappability. These large heterochromatic arrays of tandem repeats are ubiquitous across eukaryotic genomes, yet remain understudied. Current methods for annotating satellite regions, and other classes of tandem repeat arrays, are limited in their ability to annotate divergent or novel sequences. ResultsIn this work, we introduce AniAnns, an algorithm for annotating large blocks of tandemly repeating DNAs. AniAnns exploits the high Average Nucleotide Identity (ANI) shared between repeat units of the same array to quickly and accurately infer the boundaries of such arrays. We show that AniAnns improves the annotation of satellites and other tandem repeats within a variety of plant and animal genomes, while requiring only a fraction of the runtime compared to previous approaches. We conclude by exploring several use cases of AniAnns as a lightweight method for masking repeats prior to whole-genome alignment as well as the de novo annotation and classification of satellite repeats. AvailabilityAniAnns is open source software and available at github.com/marbl/anianns

9

SpotGraphs: Graph-based analysis of spatially resolved transcriptional data in R

Lee, A. J.; Sanin, D. E.

2026-03-16 bioinformatics 10.64898/2026.03.12.711347 medRxiv

Top 0.1%

71.1%

Show abstract

IntroductionCommon spatial transcriptomic analysis pipelines in R focus on pre-processing and visualization, while providing limited and indirect methods to leverage true spatially resolved quantification of transcripts. Often, x,y-coordinates in spatial transcriptomics (ST) data are integrated into analysis via "spatially aware" normalization (Salim et al., 2024), clustering methods (Zhao et al., 2021), or the identification of spatially variable genes (Yan et al., 2025). Though useful, these methods do not provide any opportunity for analysts to adjust or interrogate the underlying graphs that define adjacencies between spots in their data. Here, we present SpotGraphs, a package that allows the user a more direct and flexible option to interact with the x,y-coordinates of their ST data in R through the existing igraph infrastructure (Antonov et al., 2023; Csardi et al., 2025; Csardi & Nepusz, 2006). Similar functionality exists in Python through SquidPys graph API (Palla et al., 2022), and we compare results obtained from both packages, demonstrating similar performance. Additionally, we provide a set of tools that are useful for ST data analysis, including a toolkit to filter low quality spots laying on tissue debris, beyond arbitrary thresholds, edit spot-level adjacencies based on spatial clusters, and identify centers or boundaries of user-defined neighborhoods of interest.

10

HORI-EN: Atomic-level energetic profiling and higher-order network identification in protein structures

Joshi, S.; Sowdhamini, R.

2026-03-31 bioinformatics 10.64898/2026.03.29.715065 medRxiv

Top 0.1%

70.8%

Show abstract

MotivationCharacterizing atomic-level stability and cooperative interaction networks is essential for understanding protein function and evolution. However, existing tools often lack the precision to integrate detailed physicochemical energies with higher-order graph-theoretic analyses. ResultsWe present HORI-EN, an updated implementation to the HORI framework, featuring hybrid energetic scoring (Physicochemical + Knowledge-Based) and a Normalized Interaction Score (NIS) based on cumulative distribution functions. HORI-EN identifies higher-order cliques of interacting residues, revealing cooperative stabilization networks. Validation on the SKEMPI v2 dataset demonstrates that HORI-EN shows discriminative performance in identifying mutational hotspots, achieving an ROC-AUC of 0.780 on the full dataset and 0.844 on a clean benchmark. Enrichment analysis indicates a 3.1-fold increase in precision for the top 1% of predictions. Furthermore, analysis of the residue interaction network recovers 77.4% of non-contacting hotspots by identifying one-hop bridging interactions to the partner chain. Beyond hotspot prediction, HORI-EN distinguishes native structures from decoys and captures conserved energetic signatures in evolutionary case studies of serine proteases and lipases. Availability and ImplementationThe web server is freely available at https://caps.ncbs.res.in/HORI-EN and source code is available at https://github.com/thesixeyedknight/HoriPy. Contactmini@ncbs.res.in

11

XPCLRS: fast selection signature detection using cross-population composite likelihood ratio

Talenti, A.

2026-02-27 genomics 10.64898/2026.02.27.708459 medRxiv

Top 0.1%

69.5%

Show abstract

SummaryThe growing size of genomic datasets poses serious computational challenges, especially for laboratories and groups with limited access to high-performance compute facilities. This problem affects a broad range of analysis, including selection signature methods used to identify genomic regions undergoing selective pressure. Many of these methods were developed with SNP arrays in mind and are often not designed with scalability as a priority. Cross-population composite likelihood ratio (XP-CLR) is one such approaches, intended to detect hard selective sweeps by comparing two groups of individuals. In this paper I introduce XPCLRS, a rust implementation of the XP-CLR selection signatures method. It can be hundreds of times faster, supports multithreading natively and produces results comparable to its original counterpart (R = 0.976). lowers the computational barrier to applying XP-CLR alongside other statistics, helping detect regions of interest, reduce false positives, and ultimately improve the robustness of genomic studies. Availability and implementationThe source code for the software freely accessible in GitHub (https://www.github.com/RenzoTale88/xpclrs). Additionally, the software is also distributed via crates.io (https://crates.io/crates/xpclrs) and docker container (https://hub.docker.com/r/tale88/xpclrs) for ease of installation. The software is released under MIT open source license. ContactE-mail: Andrea.Talenti@glasgow.ac.uk

12

FuzzyClusTeR: a web server for analysis of tandem and diffuse DNA repeat clusters with application to telomeric-like repeats

Aksenova, A. Y.; Zhuk, A. S.; Lada, A. G.; Sergeev, A. V.; Volkov, K. V.; Batagov, A.

2026-03-23 bioinformatics 10.64898/2026.03.19.712643 medRxiv

Top 0.1%

68.1%

Show abstract

DNA repeats constitute a large fraction of eukaryotic genomes and play important roles in genome stability and evolution. While tandem repeats such as microsatellites have been extensively studied, the genomic organization and potential functions of dispersed or loosely organized repeat patterns remain poorly understood. Here we present FuzzyClusTeR, a web server for the identification, visualization and enrichment analysis of DNA repeat clusters in genomic sequences. Using parameterized metrics, FuzzyClusTeR detects both classical tandem repeats and regions where related motifs occur in proximity without forming perfect tandem arrays, which we term diffuse (or fuzzy) repeat clusters. The server supports analysis of user-defined sequences as well as genome-scale datasets, including the T2T-CHM13 and GRCh38 human genome assemblies, and provides interactive visualization and statistical tools for assessing the genomic distribution of repetitive motifs and corresponding clusters. As a demonstration, we analyzed telomeric-like repeats in the T2T-CHM13v2.0 genome and identified families of diffuse clusters enriched in these motifs. Comparison with simulated sequences suggests that these clusters represent non-random genomic patterns with potential evolutionary and functional significance. FuzzyClusTeR enables systematic exploration of repeat clustering across genomic regions or entire genomes. It is available at https://utils.researchpark.ru/bio/fuzzycluster GRAPHICAL ABSTRACT O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=79 SRC="FIGDIR/small/712643v1_ufig1.gif" ALT="Figure 1"> View larger version (27K): org.highwire.dtl.DTLVardef@1844091org.highwire.dtl.DTLVardef@1ab0e1dorg.highwire.dtl.DTLVardef@12bc717org.highwire.dtl.DTLVardef@11bbec9_HPS_FORMAT_FIGEXP M_FIG C_FIG

13

TSUMUGI: a platform for phenotype-driven gene network identification from comprehensive knockout mouse phenotyping data

Kuno, A.; Matsumoto, K.; Taki, T.; Takahashi, S.; Mizuno, S.

2026-02-19 bioinformatics 10.64898/2026.02.18.706720 medRxiv

Top 0.1%

67.7%

Show abstract

SummaryDeciphering complex organismal phenotypes requires elucidation of coordinated functions among multiple genes, yet this remains a fundamental challenge in functional genetics. The International Mouse Phenotyping Consortium (IMPC) has recently established a comprehensive phenotypic atlas based on systematic single-gene knockout mouse lines, providing an unprecedented resource for gene-phenotype associations in mammals. However, extracting phenotype-associated multiple gene relationships remains challenging. Here, we present TSUMUGI, a platform that identifies a phenotype-driven gene network by leveraging gene-phenotype associations from the IMPC. TSUMUGI enables users to initiate analyses from a user-specified phenotype or gene list and interactively explore a gene network. In addition, TSUMUGI highlights human disease-associated genes, supports flexible network filtering, and allows seamless export of a gene network for downstream analyses. By linking shared phenotypic signatures to putative functional gene modules, TSUMUGI provides a framework for systematic interpretation and hypothesis generation of complex organismal phenotypes. Availability and implementationThe web application is available online at https://larc-tsukuba.github.io/tsumugi/. The command-line tools are distributed via PyPI (https://pypi.org/project/TSUMUGI) and Bioconda (https://bioconda.github.io/recipes/tsumugi/README.html). Source code and documentation are hosted at GitHub (https://github.com/akikuno/TSUMUGI-dev) and archived on Zenodo (https://doi.org/10.5281/zenodo.18464478) under the MIT license.

14

MetaTracer: A nucleotide alignment-based framework for high-resolution taxonomic and transcript assignment in metatranscriptomic data

Furstenau, T.; Shaffer, I.; Hsu, K.-L. C.; Pearson, T.; Ernst, R. K.; Fofanov, V.

2026-02-23 bioinformatics 10.64898/2026.02.20.707109 medRxiv

Top 0.1%

66.6%

Show abstract

SummaryMetaTracer is a nucleotide alignment-based tool for metatranscriptomic analysis of complex bacterial communities that assigns sequence reads to both taxonomic groups and expressed genes in a single pass. Full nucleotide-level alignment improves accuracy relative to k-mer-based classifiers and preserves species-level resolution that is often lost in protein-based approaches. By retaining alignment coordinates and mapping reads directly to annotated genomic features, MetaTracer enables direct attribution of gene expression to specific microbial species. On simulated datasets, MetaTracer achieves high accuracy for both taxonomic and gene assignment. Applied to real dental plaque metatranscriptomic datasets, MetaTracer resolves species-specific transcriptional activity and detects reproducible differences in microbial gene expression between children with early childhood caries and healthy controls. Availability and implementationMetaTracer is implemented as a Python-based workflow wrapper (metatracer v0.1.0) that depends on the mtsv-tools core engine (v2.1.0), which is written in Rust. The required functionality is supported by the v2.1.0 release of mtsv-tools. Both packages are open source under the MIT license and are available at github.com/FofanovLab/metatracer and github.com/FofanovLab/mtsv-tools. Versioned releases are archived at Zenodo (DOI: 10.5281/zenodo.18665766 and DOI: 10.5281/zenodo.18718002). Installation is supported via Bioconda. ContactViacheslav.Fofanov@nau.edu

15

LinkDTI: Drug-Target Interactionsprediction through a Link Predictionframework on Biomedical KnowledgeGraph

Mondal, M.; Arunachalam, S.; Wu, S.; Datta, A.

2026-02-23 bioinformatics 10.64898/2026.02.21.707210 medRxiv

Top 0.1%

66.5%

Show abstract

Computational drug-target interactions (DTI) prediction serves as a valuable tool for drug discovery and repurposing by cost-effectively narrowing down the potential drug-target space. This paper presents LinkDTI, a computational framework that predicts DTIs by identifying connections within a heterogeneous knowledge graph of drugs, proteins, diseases, and side effects. Unlike methods that rely on mathematical techniques like matrix completion or similarity-based scoring, LinkDTI uses an advanced graph-based approach to capture relationships between biomedical entities. Specifically, LinkDTI applies a modified version of the multilayer GraphSAGE model that learns from the heterogeneous knowledge graph and predicts potential drug-target interactions. Our model incorporates negative sampling that balances the data to address the issue of having more negative than positive interactions. Our results show that LinkDTI consistently performs better in AUROC and AUPRC than baseline methods by at least 2.5% across different sampling ratios and conditions. Subsequently, it identifies approximately 945 new potential DTIs, marking a 49.14% increase over known DTIs. Overall, LinkDTI offers a simple yet effective method for integrating diverse biomedical data to identify potential drug-target interactions. The code and data can be found at https://github.com/hub2nature/LinkDTI_heterogenous_KG.git.

16

BICEP: an extension to indels and copy number variants for rare variant prioritisation in pedigree analysis

Ormond, C.; Ryan, N. M.; Corvin, A.; Heron, E. A.

2026-03-11 bioinformatics 10.64898/2026.03.09.710467 medRxiv

Top 0.1%

66.2%

Show abstract

SummaryBICEP is a Bayesian inference model that evaluates how likely a rare variant is to be causal for a genomic trait in pedigree-based analyses. The original prior model in BICEP was designed for single nucleotide variants only. Here, we have developed an extension of the prior models for more comprehensive genomic analysis to include indels and copy number variants. We benchmark the performance of these new priors and show comparable performance accuracy with the existing single nucleotide variant prior model. For copy number variants we evaluate four different input predictors to the models and recommend the best performing ones as the default. Availability and implementationthe updated prior models have been implemented in the current version of BICEP available from: https://github.com/cathaloruaidh/BICEP.

17

PoolParty: streamlined design of DNA sequence libraries in Python

Liu, Z.; Cordero, A.; Kinney, J. B.

2026-04-09 bioinformatics 10.64898/2026.04.06.716802 medRxiv

Top 0.1%

66.1%

Show abstract

MotivationComputationally designed DNA sequence libraries are essential components of massively parallel reporter assays (MPRAs), deep mutational scanning (DMS) experiments, and other multiplex assays of variant effect (MAVEs). They are also increasingly used in silico to analyze genomic AI models. Designing these libraries, however, remains tedious and error-prone due to the lack of purpose-built software. ResultsHere we describe PoolParty, a Python package that streamlines the design of complex oligo pools using a simple but flexible API. In PoolParty, each library is represented by a computational graph that can be specified in just a few lines of code. Over 50 built-in operations cover nucleotide- and codon-level mutagenesis, motif insertion, barcode generation, and more. PoolParty automatically generates informative names for each sequence and provides "design cards" detailing how each sequence was generated. Visualization methods let users quickly audit library content and inspect the underlying graph. PoolParty thus transforms oligo pool design from a tedious task requiring custom functions and scripts into a structured, transparent, and reproducible process. Availability and implementationPoolParty is freely available and can be installed using pip. It is compatible with Python [≥] 3.10. Documentation is provided at https://poolparty.readthedocs.io; source code is available at https://github.com/jbkinney/poolparty-statetracker. A static release is archived at DOI 10.5281/zenodo.19445098.

18

Pinc: a simple probabilistic AlphaFold interaction score

Toth-Petroczy, A.; Badonyi, M.

2026-03-03 bioinformatics 10.64898/2026.03.02.708997 medRxiv

Top 0.1%

65.0%

Show abstract

MotivationScreening of interacting proteins with AlphaFold has become widespread in biological research owing to its utility in generating and testing hypotheses. While several model quality and interaction confidence metrics have been developed, their interpretation is not always straightforward. ResultsHere, building on a previously published method, we address this limitation by converting predicted aligned errors of an AlphaFold model into conditional contact probabilities. We show that, without additional parametrisation, the contact probabilities are readily calibrated to the fraction of native contacts observed across experimentally determined protein dimers. We find that the average contact probability for interacting chains, termed Pinc (probability of interface native contacts), is more sensitive to interactions involving smaller interfaces than many commonly used scores. We provide an R script to calculate Pinc for AlphaFold models, and propose its use as an alternative scoring metric for interaction screens and for prioritising interface residues for experimental validation. Availability and implementationAn R script and a Colab notebook are available at https://git.mpi-cbg.de/tothpetroczylab/Pinc

19

Scaling Variant-Aware Multiplex Primer Design

Han, Y.; Boucher, C.

2026-02-06 bioinformatics 10.64898/2026.02.03.703607 medRxiv

Top 0.1%

65.0%

Show abstract

MotivationRobust primer design is essential for reliable multiplex PCR in diverse and evolving pathogen, microbial, and host genomes. Traditional methods optimized for a single reference often fail on emerging variants, leading to reduced efficiency. Variant-aware design seeks primers that remain effective across diverse targets, but this introduces two key challenges: identifying robust candidates and selecting an optimal subset of primers. Although there are methods for the first challenge, namely the Primer Design Region (PDR) optimization problem, existing approaches lack optimality guarantees. ResultsWe introduce a near-linear algorithm with provable guarantees for efficient PDR optimization. Complementing this, we introduce a reference-free risk model based on Gini impurity that provides a stable, biologically interpretable measure of site-specific variation and yields PDRs that are robust to sequence diversity across datasets without ad hoc smoothing. For the second challenge related to thermodynamic stability, we optimize predicted {Delta}G and cast subset selection as a k-partite maximum-weight clique problem (NP-hard). We then design an efficient local-search heuristic with linear time updates. Together, these advances yield a principled, scalable framework for variant-aware primer design. Across Foot-and-Mouth Disease virus and Zika virus datasets, {Delta}-PRO produces more compact and robust PDR sets and multiplex panels with reduced predicted dimerization compared to existing tools, demonstrating the practical gains of principled and scalable variant-aware primer design for high-throughput multiplex PCR assays. AvailabilityThe proposed methods are implemented in a software package. Our implementation and results are publicly available at https://github.com/yhhan19/variant-aware-primer-design. Supplementary informationSupplementary materials are available online.

20

EMTscore infers divergent EMT pathways from omics data and enables rapid screening for correlated gene sets

wen, h.; Bleris, L.; Hong, T.

2026-01-30 bioinformatics 10.64898/2026.01.27.702045 medRxiv

Top 0.1%

64.8%

Show abstract

SummaryQuantitative analyses of epithelial-mesenchymal transition (EMT) have been widely used in several areas of biomedical sciences due to its importance in development and cancer progression, but its multi-contextual nature requires standardization and implementation of gene set scoring methods beyond capacities of conventional tools. We developed EMTscore, a package that provides an efficient implementation of unbiased scoring methods for multiple EMT pathways using individual single-cell or bulk omics data, and the package allows rapid screening for relationships between EMT and other cellular processes. Availability and ImplementationEMTscore is available from GitHub https://github.com/wenmm/EMTscore under the GNU General Public License, and it will be deposited to Zenodo upon acceptance. It is also under review at Bioconductor. ContactTian Hong (hong@utdallas.edu)