Bioinformatics
◐ Oxford University Press (OUP)
Preprints posted in the last 90 days, ranked by how well they match Bioinformatics's content profile, based on 1061 papers previously published here. The average preprint has a 0.95% match score for this journal, so anything above that is already an above-average fit.
Timofeev, A.; Anufriev, A.
Show abstract
MotivationClassical protein structure comparison metrics such as RMSD and TM-score effectively assess geometric similarity but ignore the linear order of amino acid residues (Zhang and Skolnick, 2004). The Gromov-Hausdorff (GH) metric compares metric spaces by shape but also does not account for order (Gromov, 1981). This can lead to incorrectly identifying proteins with swapped domains as similar. We introduce the Ordered Gromov-Hausdorff (OGH) metric, defined on ordered metric spaces, to incorporate residue order into the comparison. ResultsOGH combines coordinate normalization, an exponential penalty for order violations, and a monotonic alignment algorithm with computational complexity O(n{middle dot}w), where w is the search window width. It is proven that OGH satisfies all metric axioms for > 0. Analytical properties include invariance under isometries, upper boundedness, Lipschitz continuity under small coordinate perturbations, and concavity in the weight parameter . On the VAD dataset (28 viral proteins from HIV-1, SARS-CoV-2, MERS-CoV), OGH increases monotonically with residue shuffling (up to 0.363 at 100% shuffling) and correlates strongly with TM-score (r = 0.706). In the task of separating homologs at fixed global similarity (TM-score {approx} 0.5), OGH achieves AUC = 0.800, whereas TM-score gives AUC = 0.467, demonstrating that OGH detects conserved order even when global geometry is not conserved. AvailabilityThe Python source code for OGH is freely available at https://github.com/andytimoffilim/OGH. The VAD dataset (PDB IDs listed in the paper) is publicly accessible from the RCSB Protein Data Bank (Berman et al., 2000; wwPDB, 2019).
Harviainen, J.; Sena, F.; Moumard, C.; Politov, A.; Schmidt, S.; Tomescu, A. I.
Show abstract
MotivationPangenome graphs are increasingly used in bioinformatics, ranging from environmental surveillance and crop improvement to the construction of population-scale human pangenomes. As these graphs grow in size, methods that scale efficiently become essential. A central task in pangenome analysis is the discovery of variation structures. In directed graphs, the most widely studied such structures, superbubbles, can be identified in linear time. Their canonical generalization to bidirected graphs, ultrabubbles, more accurately models DNA reverse complementarity. However, existing ultrabubble algorithms are quadratic in the worst case. ResultsWe show that all ultrabubbles in a bidirected graph containing at least one tip or one cutvertex--a common property of pangenome graphs--can be computed in linear time. Our key contribution is a new linear-time orientation algorithm that transforms such a bidirected graph into a directed graph of the same size, in practice. Orientation conflicts are resolved by introducing auxiliary source or sink vertices. We prove that ultrabubbles in the original bidirected graph correspond to weak superbubbles in the resulting directed graph, enabling the use of existing lineartime algorithms. Our approach achieves speedups of up to 25xover the ultrabubble implementation in vg, and of more than 200x over BubbleGun, enabling scalable pangenome analyses. For example, on the v2.0 pangenome graph constructed by the Human Pangenome Reference Consortium from 232 individuals, after reading the input, our method completes in under 3 minutes, while vg requires more than one hour, and four times more RAM. AvailabilityOur method is implemented in the BubbleFinder tool github.com/algbio/BubbleFinder, via the new ultrabubbles subcommand. Contactalexandru.tomescu@helsinki.fi
Jia, X.; Phan, A.; Dorman, K.; Kadelka, C.
Show abstract
MotivationHigh-throughput experiments generate genome-wide measurements for thousands of genes, which are often tested marginally. Biological processes are driven by coordinated groups of genes rather than individual genes, making gene set enrichment analysis an essential post hoc interpretation tool. Traditional approaches such as Over-Representation Analysis and Gene Set Enrichment Analysis test gene sets independently, which ignores the hierarchical and overlapping structure of gene set collections such as the Gene Ontology, and often leads to redundant enrichment results. Set-based approaches such as MGSA address this issue by modeling multiple gene sets simultaneously, but they rely on binary gene activation states derived from arbitrary thresholds on gene-level statistics. ResultsWe introduce Concise List Enrichment Analysis Reducing Redundancy (CLEAR), a Bayesian gene set enrichment framework that jointly models gene sets while incorporating continuous gene-level statistics such as test statistics or p-values. CLEAR extends model-based gene set analysis by replacing threshold-based gene activation with a probabilistic model for continuous gene-level statistics. This approach preserves the redundancy-reduction advantages of set-based enrichment methods while avoiding the information loss introduced by binarization. Using both simulated datasets and human gene expression data, we show that CLEAR improves sensitivity compared with existing enrichment approaches while producing a more concise and interpretable set of enriched gene sets. Availability and implementationThe source code, data, and a brief tutorial are freely available at https://github.com/jiatuya/CLEAR
Lee, A. J.; Sanin, D. E.
Show abstract
IntroductionCommon spatial transcriptomic analysis pipelines in R focus on pre-processing and visualization, while providing limited and indirect methods to leverage true spatially resolved quantification of transcripts. Often, x,y-coordinates in spatial transcriptomics (ST) data are integrated into analysis via "spatially aware" normalization (Salim et al., 2024), clustering methods (Zhao et al., 2021), or the identification of spatially variable genes (Yan et al., 2025). Though useful, these methods do not provide any opportunity for analysts to adjust or interrogate the underlying graphs that define adjacencies between spots in their data. Here, we present SpotGraphs, a package that allows the user a more direct and flexible option to interact with the x,y-coordinates of their ST data in R through the existing igraph infrastructure (Antonov et al., 2023; Csardi et al., 2025; Csardi & Nepusz, 2006). Similar functionality exists in Python through SquidPys graph API (Palla et al., 2022), and we compare results obtained from both packages, demonstrating similar performance. Additionally, we provide a set of tools that are useful for ST data analysis, including a toolkit to filter low quality spots laying on tissue debris, beyond arbitrary thresholds, edit spot-level adjacencies based on spatial clusters, and identify centers or boundaries of user-defined neighborhoods of interest.
Joshi, S.; Sowdhamini, R.
Show abstract
MotivationCharacterizing atomic-level stability and cooperative interaction networks is essential for understanding protein function and evolution. However, existing tools often lack the precision to integrate detailed physicochemical energies with higher-order graph-theoretic analyses. ResultsWe present HORI-EN, an updated implementation to the HORI framework, featuring hybrid energetic scoring (Physicochemical + Knowledge-Based) and a Normalized Interaction Score (NIS) based on cumulative distribution functions. HORI-EN identifies higher-order cliques of interacting residues, revealing cooperative stabilization networks. Validation on the SKEMPI v2 dataset demonstrates that HORI-EN shows discriminative performance in identifying mutational hotspots, achieving an ROC-AUC of 0.780 on the full dataset and 0.844 on a clean benchmark. Enrichment analysis indicates a 3.1-fold increase in precision for the top 1% of predictions. Furthermore, analysis of the residue interaction network recovers 77.4% of non-contacting hotspots by identifying one-hop bridging interactions to the partner chain. Beyond hotspot prediction, HORI-EN distinguishes native structures from decoys and captures conserved energetic signatures in evolutionary case studies of serine proteases and lipases. Availability and ImplementationThe web server is freely available at https://caps.ncbs.res.in/HORI-EN and source code is available at https://github.com/thesixeyedknight/HoriPy. Contactmini@ncbs.res.in
Queme, B.; Kakkar, A.; Muruganujan, A.; Thomas, P. D.; Gauderman, W. J.; Mi, H.
Show abstract
MotivationPost-GWAS interpretation frequently requires translating variant lists (e.g., lead SNPs, clumped loci, credible sets, or curated panels) into pathway and functional hypotheses. In practice, obtaining pathway and functional over-representation results from SNP inputs often requires stitching together multiple tools for variant annotation, regulatory annotation, gene identifier handling, and statistical testing. This integration burden can reduce reproducibility and restrict end-to-end analysis to groups with dedicated bioinformatics support. SummaryWe present SNPWay, a web server and R package that performs end-to-end SNP-to-function and pathway over-representation analysis in a single standardized workflow. SNPWay accepts rsIDs, VCF files, or hg19/GRCh37 genomic coordinates. It queries Annotation Query (AnnoQ) to obtain SNP-to-gene mappings from ANNOVAR, SnpEff, and VEP under both Ensembl and RefSeq gene models, and incorporates enhancer-gene links via PEREGRINE to augment mappings for noncoding variants. SNPWay aggregates mapped genes into a single, non-redundant, combined gene list and submits it to PANTHER for over-representation testing against the Homo sapiens reference list, returning over-represented pathways and functional categories (e.g., Gene Ontology) with direct links for interactive exploration in PANTHER. SNPWays modular architecture is designed for extensibility, enabling incorporation of additional analysis methods in future releases. A step-by-step walkthrough is provided in Supplementary Data. Availability and implementationWeb server: https://snpway.annoq.org/. R package and source code: https://github.com/USCbiostats/Annoq_Overrepr_Workflow. Documentation: https://snpway.annoq.org/about. Examples: The website contains Sample files for the input formats, also provided in Supplementary Data. SNPWay is free to use with no mandatory login. Contacthuaiyumi@usc.edu Supplementary informationSupplementary data are available at Bioinformatics online
Doughty, R. D.; Banerjee, A.; Kille, B.; Warnow, T.; Treangen, T. J.
Show abstract
MotivationMaximal unique matches (MUMs) are a fundamental primitive in genome comparison, where they serve as high-confidence anchors for downstream multiple genome alignment. However, because MUMs rely on exact string matching, their effectiveness degrades with increased genome divergence and larger sets of genomes, inhibiting their ability to recover long homologous regions and reducing the number of base pairs covered by the multiple genome alignment. Additionally, existing approaches that improve robustness to mutation, such as spaced seeds or translated alignment methods, introduce trade-offs in specificity, scalability, or computational complexity. MethodsTo address this gap, we introduce the Min-Frame Transformation (MFT), a deterministic encoding of nucleotide sequences to sequences over a transformed alphabet that preserves the coordinate structure of the original sequence. At each position, the MFT selects a k-mer from a local window according to a fixed global ordering and assigns it a character in the transformed alphabet via a predefined mapping. This process captures local sequence context and can mask the impact of mutations, increasing the likelihood that homologous regions remain detectable as exact matches. The resulting transformed sequences can be indexed using standard string data structures, such as suffix arrays and suffix trees, enabling efficient extraction of MUMs without modifying existing algorithms. ImpactThe MFT is a novel computational approach for improving the robustness of MUM-based seeding for genome alignment by producing longer and more contiguous matches that span a greater fraction of the genome, leading to improved alignment coverage and SNP recall. Altogether, these improvements have the potential to result in improvements for downstream viral genome analysis applications such as phylogenetic inference and transmission analysis. FundingTandy Warnow: NSF grant 2316233 Todd J. Treangen: NSF grants 2126387, 2239114, NIH grants U19-AI144297, P01-AI152999
Osuntoki, I. G.; Harrison, A. P.; Dai, H.; Bao, Y.; Zabet, N. R.
Show abstract
MotivationChromosome Conformation Capture methods, including Hi-C, micro-C or Capture-C, are used to map chromatin interactions genome-wide. Most of the existing computational methods do not account for sources of biases (such as DNA accessibility, GC content or TE content) in the data. ResultsWe previously developed ZipHiC, a Bayesian method based on a the hidden Markov random field (HMRF) model and the Approximate Bayesian Computation (ABC), that uses zero-inflated Poisson distribution to model the noise, signal and false signal of the data and showed that this approach was able to detect biases from DNA accessibility, GC content and TE content in both Hi-C and micro-C data. Here, we present HiCPotts, another Bayesian method based on the HMRF model and the ABC that uses a zero-inflated Negative Binomial distribution instead to model the noise and signal of the data. We systematically show that HiCPotts reduces false positives and increases recovery of true interactions compared to ZipHiC, but also compared to other methods such as FastHiC, Juicer and HiCExplorer. Most importantly, we provide an R/Bioconductor package that allows modelling the noise, signal and false signal using various distributions such as the zero-inflated Negative Binomial (ZINB) and the zero-inflated Poisson distribution (ZIP). Availabilityhttps://bioconductor.org/packages/HiCPotts/
Tzimotoudis, D.; Farrugia, R.; Zammit, J.; Masini, M. C.; Balestrucci, A.; Carbott, F. B.; Wettinger, S. B.; Alexiou, P.; Ciach, M. A.
Show abstract
BackgroundGenes with highly similar genomic copies (paralogs, tandem duplications and pseudogenes) pose a major challenge for Short-Read High Throughput Sequencing (srHTS). High sequence similarity makes it difficult to unambiguously identify the sequences of origin of short reads. This results in misalignment artifacts which can propagate through bioinformatic pipelines and increase error rates in variant calling. ResultsWe present ParaDISM, a pipeline that refines standard alignments to improve read placement and reduce misalignment-driven false variant calls in highly homologous sequences. ParaDISM assigns a read/read pair to a sequence only when supported by unambiguous sequence-specific evidence by using a multiple sequence alignment of reference sequences to identify disambiguating positions. An optional iterative refinement procedure calls variants from confidently assigned reads, updates the reference sequences, and processes remaining non-assigned reads. We evaluated the performance of ParaDISM both in terms of read alignment and the resulting short variant calls using extensive computational simulation experiments and the Genome in a Bottle HG002 benchmark. We applied ParaDISM to reanalyze two case studies: five public tumour exomes at the GNAQ/GNAQP1 locus, and 18 short-read sequencing datasets of patients diagnosed with Autosomal Dominant Polycystic Kidney Disease (16 exomes and 2 panel sequencing datasets). Compared to the standard aligners (bowtie2, bwa-mem and minimap2), ParaDISM reduced the number of misalignment artifacts and false variant calls, resulting in an increased specificity and precision of the results. ConclusionsParaDISM improves the precision of read placement and single-nucleotide variant calling in highly homologous reference sequences. By reducing the number of false variant calls caused by misalignment artifacts, ParaDISM provides a stronger level of evidence for the called variants compared to currently available approaches. The pipeline is open source and available under the MIT license at github.com/BioGeMT/ParaDISM.
Aksenova, A. Y.; Zhuk, A. S.; Lada, A. G.; Sergeev, A. V.; Volkov, K. V.; Batagov, A.
Show abstract
DNA repeats constitute a large fraction of eukaryotic genomes and play important roles in genome stability and evolution. While tandem repeats such as microsatellites have been extensively studied, the genomic organization and potential functions of dispersed or loosely organized repeat patterns remain poorly understood. Here we present FuzzyClusTeR, a web server for the identification, visualization and enrichment analysis of DNA repeat clusters in genomic sequences. Using parameterized metrics, FuzzyClusTeR detects both classical tandem repeats and regions where related motifs occur in proximity without forming perfect tandem arrays, which we term diffuse (or fuzzy) repeat clusters. The server supports analysis of user-defined sequences as well as genome-scale datasets, including the T2T-CHM13 and GRCh38 human genome assemblies, and provides interactive visualization and statistical tools for assessing the genomic distribution of repetitive motifs and corresponding clusters. As a demonstration, we analyzed telomeric-like repeats in the T2T-CHM13v2.0 genome and identified families of diffuse clusters enriched in these motifs. Comparison with simulated sequences suggests that these clusters represent non-random genomic patterns with potential evolutionary and functional significance. FuzzyClusTeR enables systematic exploration of repeat clustering across genomic regions or entire genomes. It is available at https://utils.researchpark.ru/bio/fuzzycluster GRAPHICAL ABSTRACT O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=79 SRC="FIGDIR/small/712643v1_ufig1.gif" ALT="Figure 1"> View larger version (27K): org.highwire.dtl.DTLVardef@1844091org.highwire.dtl.DTLVardef@1ab0e1dorg.highwire.dtl.DTLVardef@12bc717org.highwire.dtl.DTLVardef@11bbec9_HPS_FORMAT_FIGEXP M_FIG C_FIG
Loecker, J.; Bessell, B.; Puniya, B. L.; Helikar, T.
Show abstract
SummaryImproved accessibility of high-throughput RNA sequencing has increased the amount of data generated each year. This increase in data creates a need for reproducible pipelines that can process RNA-seq data consistently across experiments. AutoRNAseq addresses this need by providing a Snakemake-based workflow for bulk RNA-seq analysis by automating data retrieval, quality control, and gene quantification. Unlike existing RNA-seq workflows that require users to coordinate multiple pipelines and pre-configure reference data, AutoRNAseq provides a single, end-to-end workflow that automates data acquisition, reference preparation, quality control, alignment, and quantification with minimal user intervention. AutoRNAseq is applicable to any domain requiring consistently processed RNA-seq datasets, including bioinformatics, computational biology, and drug-response studies. Availability and ImplementationAutoRNAseq is implemented in Snakemake and available at https://gitlab.com/unebraska/lagbh-public/autornaseq. Documentation and example configuration files are provided in the GitLab README file and this papers Supplementary Information. The code to reproduce the statistics presented here is in the GitLab repository under the "publication" folder.
Liu, Z.; Cordero, A.; Kinney, J. B.
Show abstract
MotivationComputationally designed DNA sequence libraries are essential components of massively parallel reporter assays (MPRAs), deep mutational scanning (DMS) experiments, and other multiplex assays of variant effect (MAVEs). They are also increasingly used in silico to analyze genomic AI models. Designing these libraries, however, remains tedious and error-prone due to the lack of purpose-built software. ResultsHere we describe PoolParty, a Python package that streamlines the design of complex oligo pools using a simple but flexible API. In PoolParty, each library is represented by a computational graph that can be specified in just a few lines of code. Over 50 built-in operations cover nucleotide- and codon-level mutagenesis, motif insertion, barcode generation, and more. PoolParty automatically generates informative names for each sequence and provides "design cards" detailing how each sequence was generated. Visualization methods let users quickly audit library content and inspect the underlying graph. PoolParty thus transforms oligo pool design from a tedious task requiring custom functions and scripts into a structured, transparent, and reproducible process. Availability and implementationPoolParty is freely available and can be installed using pip. It is compatible with Python [≥] 3.10. Documentation is provided at https://poolparty.readthedocs.io; source code is available at https://github.com/jbkinney/poolparty-statetracker. A static release is archived at DOI 10.5281/zenodo.19445098.
Wei, Y.; Huang, Z.; Zhang, P.; Tian, Q.; Li, Y.; Zou, Q.; Yu, L.
Show abstract
Multiple sequence alignment (MSA) is a fundamental problem in computational bioinformatics, playing a critical role in genome biology, especially in long read sequencing and assembly. One solution for representing and solving MSA is Partial Order Alignment (POA), which employs Directed Acyclic Graphs (DAGs) to represent sequence relationships. However, when facing the ultra-long, error-prone reads (e.g., >100 kbps), existing POA algorithms with quadratic space complexity become impractical due to excessive memory consumption. This paper introduces the linearPOA, which based on divide-and-conquer strategy to solve the POA, aimed at saving memory compared to quadratic space complexity algorithms like SPOA, abPOA and TSTA. Particularly notable is its capability to save up to 102.74 times memory usage when aligning sequences with 100 kbp reads, compared to the abPOA method using non-heuristic methods. The algorithm was implemented within the linearPOA library, providing functionality for POA and foundational support for sequencing analysis, like error correction for reads. The linearPOA algorithm provides memory-efficient algorithms for long-read sequencing, especially in directly assembling long reads like 100 kbp reads. AvailabilityThe linearPOA library is freely available at https://github.com/malabz/linearPOA, and the data underlying this article are available in Zenodo, at https://doi.org/10.5281/zenodo.15637837. Supplementary informationSupplementary information are available at BioRxiv online.
Schaffranke, A.; Kueken, A.; Nikoloski, Z.
Show abstract
SummaryRecent advances in analysis of biochemical networks have contributed the identification of their modular structure based on the concept of multi reaction dependencies and kinetic coupling of reaction rates (Kuken et al., 2022; Langary et al., 2025). Existing implementations of the algorithms to study modular structure do not scale well with the size of the networks, prohibiting their application with genome-scale networks. Here, we introduce COCOA.jl, a multithreaded Julia package for identification of concordant and kinetic modules, with applications in the study of concentration robustness. Availability and implementationCOCOA.jl is implemented in Julia 1.12.2 and is freely available under the MIT license at https://github.com/antoniofranky/COCOA.jl. It runs on Linux, macOS, and Windows; installation is supported via the Julia package manager. COCOA.jl can be called from Python via JuliaCall. Contactantonschaf@posteo.de; ankueken@uni-potsdam.de
Flanagan, K.; Xu, S.; Yeo, G. W.
Show abstract
MotivationCrosslinking and immunoprecipitation (CLIP) methods remain the gold standard for characterizing RNA binding protein (RBP) behavior. As a result, many researchers rely on CLIP to assess how treatments targeting RBPs alter binding patterns and regulatory activity. However, current tools for differential RBP binding analysis lack core features required for rigorous statistical inference, including proper normalization and appropriate handling of replicate experiments. Furthermore, existing approaches cannot adequately separate expression driven effects from true changes in RBP binding, complicating interpretation of differential analyses. Addressing these limitations is essential for producing reproducible and informative analyses of differential RBP binding. ResultsHere we present Flipper, an application purpose built for the analysis of differential RBP binding. Flipper introduces several innovations that adapt the DESeq2 framework for robust differential analysis of eCLIP count data. These include integration of input controls to account for expression driven binding shifts, hierarchical normalization strategies that adjust for technical variation without confounding signal to noise ratios, and improved post-differential analysis tools. We demonstrate that Flipper exhibits high specificity when applied to real differential eCLIP data while also providing deeper biological insights. In addition, analyses of both real and simulated data indicate that Flipper achieves superior sensitivity and precision compared with existing approaches. Together, these results highlight Flipper as a robust and generalizable framework for differential RBP binding analysis.
Alves Ferreira, I.; Zentgraf, J.; Schmitz, J. E.; Rahmann, S.
Show abstract
MotivationMost existing workflows for quantifying bacterial gene expression from RNA-seq data rely on mapping reads to a (single) reference transcriptome, typically ignoring strain-level variation. When samples contain unknown or mixed strains, these workflows may introduce reference bias and fail to accurately capture strain-specific gene expression. Pan-transcriptomic approaches address this issue by using pan-transcriptomes as references, but existing solutions require multiple steps for pan-transcriptome construction, indexing, and expression quantification. ResultsWe introduce PanXpress, a unified framework for bacterial pantranscriptomics that performs pantranscriptome construction and indexing directly from genomic FASTA and GFF annotation files, alignment-free mapping of reads to genes from FASTQ samples, and gene expression quantification. The index, a multi-way Cuckoo hash table storing gapped k-mers with associated genes, preserves diversity on the k-mer level. Using simulated RNA-seq data from a mixture of Pseudomonas aeruginosa strains, PanXpress achieves mapping recall comparable to alignment-based methods such as Bowtie2 with higher precision and obtains accurate gene expression and log fold change estimates. On real P. aeruginosa RNA-seq data, using PanXpress pantranscriptomic reference increases the proportion of mapped reads and discovered expressed genes. The index of PanXpress is smaller than that of other tools and it provides faster analysis with consistent results, compared to other tools (Salmon, Kallisto, Bowtie2). PanXpress is thus an accurate and efficient method for bacterial gene expression analysis in complex samples. AvailabilityPanXpress is available at https://gitlab.com/rahmannlab/panxpress. Contactsven.rahmann@uni-saarland.de
Wang, G.; Liu, F.; Chen, Z.; Davoli, T.
Show abstract
SummaryAssociation measurements, such as mutual information (MI), are fundamental in the analysis of cancer multi-omics data for identifying cancer-related genes, gene signatures, and gene regulatory networks, thereby shedding light on tumor development, progression, and treatment. Confounding factors, including tumor purity and mutation burden, can bias association measurements in MI, potentially leading to the misclassification of passenger events as drivers. Conditional mutual information (CMI) provides a robust framework for assessing both linear and non-linear associations while effectively accounting for different confounding factors. An R package called conMItion is introduced to estimate CMI and its statistical significance for multi-omics data, with flexibility to adjust for one or two confounding factors. We demonstrated the utilization of conMItion through two use cases. First, we identified co-occurring somatic alterations in bladder cancer genomic data. Second, we applied conMItion to a single-cell RNA sequencing dataset of lung cancer patients and identified positively or negatively associated cell types within the lung cancer tumor microenvironment. Availability and ImplementationThe conMItion package is freely available on CRAN at https://CRAN.R-project.org/package=conMItion. The two use cases described in the paper can be accessed at https://github.com/GJYWang/conMItion. A supplementary document is available online.
Cao, Y.; Ge, G.; Zhao, K.
Show abstract
MotivationSequencing-based epigenomic profiling methods are powerful but suffer from technical variability that complicates cross-sample comparisons and can obscure true biological signals. While existing normalization methods using spike-in controls or computational approaches have been proposed, they often rely on assumptions that may not hold across diverse experimental conditions or require additional data types. ResultsWe present Ryder, a flexible and robust Python package for the normalization and differential analysis of epigenomic data. Ryder introduces a normalization strategy that leverages stable internal reference regions, such as invariant CTCF binding sites, to correct for technical artifacts genome-wide. Our results show that it effectively models and adjusts both background noise and signal intensity, ensuring accurate signal alignment across samples. We demonstrate that Ryder performs robust, genome-wide normalization - correcting signals in both peak and background regions - across a range of assays including DNase-seq, CUT&RUN, ATAC-seq, MNase-seq, and ChIP-seq, with or without spike-in controls. By reducing technical noise, we show that Ryder improves the detection of genuine biological changes, such as quantitative reduction of chromatin accessibility at key enhancer elements by depletion of BRG1, a key subunit of the chromatin remodeling BAF complexes. Availability and ImplementationThe Ryder source code and documentation are freely available at: https://github.com/YaqiangCao/ryder.
Popp, B.; Saei, H.
Show abstract
SummaryVariable number tandem repeats (VNTRs) in the MUC1 gene cause autosomal dominant tubulointerstitial kidney disease when disrupted by frameshift variants, but the GC-rich 60-bp repeat structure (20-125 copies) challenges variant detection. While tools like VNtyper enable MUC1 variant calling, no gold-standard benchmarking datasets exist for systematic performance evaluation. We present MucOneUp, a specialized simulation framework for generating MUC1-VNTR reference sequences with targeted variants and platform-specific sequencing reads (Illumina, Oxford Nanopore, PacBio). MucOneUp employs Markov chain-based repeat generation, supports diploid simulation with customizable variant placement, and includes additional analysis modules for SNaPshot assay simulation and exploratory frameshift analysis. We validate MucOneUp through a multi-variant, cross-platform benchmark of six tool-platform combinations using 13 distinct frameshift variants and investigate VNTR length effects on detection. Availability and implementationMucOneUp is accessible at no cost under the MIT License at https://github.com/berntpopp/MucOneUp and archived on Zenodo (DOI: 10.5281/zenodo.19740406). Contactbernt.popp@charite.de Supplementary informationSupplementary data are provided with this manuscript.
Grabarczyk, D.; Kocikowski, M.; Parys, M.; Cohen, S. B.; Alfaro, J. A.
Show abstract
MotivationEncoding antibodies (Abs) and nanobodies (Nbs) as mRNA enables in vivo production of therapeutic proteins. However, this approach requires meeting two species-dependent requirements: the mRNA encoding must support efficient expression in the host species, and the encoded protein sequence must resemble the natural Ab repertoire of the recipient species to minimize immunogenicity. These requirements motivate species-conditioned generative models for joint mRNA and protein design. ResultsWe propose SpeciefAI a transformer-based model for multi-species Ab and Nb species sequence-harmonisation by generation of novel Framework Regions (FRs) tailored to input Complementarity-Determining Regions (CDRs). Our model works directly in the mRNA space and learns the correspondence between FRs and CDRs in six species. The model is capable of generating sequences with a highly similar distribution to natural sequences and a mean absolute difference in codon adaptation index (CAI) of 0.013 and 0.033 for humans and dogs respectively. We show that the generated human sequences are highly human (0.95 T20 score) and canine sequences highly canine (0.95 cT20 score). We furthermore demonstrate that we can generate diverse candidate sequences using our method. Availability and ImplementationSource code is available on https://github.com/Dominko/SpeciefAI. OAS and COGNANO data are publicly available on https://opig.stats.ox.ac.uk/webapps/oas/ and https://cognanous.com/datasets/vhh-corpus (preprocessed versions available upon request). Canine data is available on https://zenodo.org/records/18301526.