Bioinformatics — Latest Matching Preprints

1

VSS-Hi-C: Variance-stabilized signals for chromatin 3D contacts

Libbrecht, M. W.; Bayat, F.

2021-10-19 bioinformatics 10.1101/2021.10.19.465027 medRxiv

Top 0.1%

91.7%

Show abstract

MotivationThe genome-wide chromosome conformation capture assay Hi-C is widely used to study chromatin 3D structures and their functional implications. Read counts from Hi-C indicate the strength of chromatin contact between each pair of genomic loci. These read counts are heteroskedastic: that is, a difference between the interaction frequency of 0 and 100 is much more significant than a difference between the interaction frequency of 1000 and 1100. This property impedes visualization and downstream analysis because it violates the Gaussian variable assumption of many computational tools. Thus heuristic transformations aimed at stabilizing the variance of signals like the shifted-log transformation are typically applied to data before its visualization and inputting to models with Gaussian assumption. However, such heuristic transformations cannot fully stabilize the variance because of their restrictive assumptions about the mean-variance relationship in the data. ResultsHere we present VSS-Hi-C, a data-driven variance stabilization method for Hi-C data. We show that VSS-Hi-C signals have a unit variance improving visualization of Hi-C, for example in heatmap contact maps. VSS-Hi-C signals also improve the performance of subcompartment callers relying on Gaussian observations. VSS-Hi-C is implemented as an R package and can be used for variance stabilization of different genomic and epigenomic data types with two replicates available. Availabilityhttps://github.com/nedashokraneh/vssHiC Contactmaxwl@sfu.ca

2

xQTLImp: efficient and accurate xQTL summary statistics imputation

Wang, T.; Yin, Q.; Liu, Y.; Chen, J.; Wang, Y.; Peng, J.

2019-08-31 bioinformatics 10.1101/726182 medRxiv

Top 0.1%

91.7%

Show abstract

MotivationQuantitative trait locus (QTL) analysis of multiomic molecular traits, such as gene transcription (eQTL), DNA methylation (mQTL) and histone modification (haQTL), has been widely used to infer the effects of genomic variation on multiple levels of molecular activities. However, the power of xQTL (various types of QTLs) detection is largely limited by missing association statistics due to missing genotypes and limited effective sample size. Existing hidden Markov model (HMM)-based imputation approaches require individual-level genotypes and molecular traits, which are rarely available. No available implementation exists for the imputation of xQTL summary statistics when individual-level data are missed.\n\nResultsWe present xQTLImp, a C++ software package specifically designed for efficient imputation of xQTL summary statistics based on multivariate Gaussian approximation. Experiments on a single-cell eQTL dataset demonstrates that a considerable amount of novel significant eQTL associations can be rediscovered by xQTLImp.\n\nAvailabilitySoftware is available at https://github.com/hitbc/xQTLimp.\n\nContactydwang@hit.edu.cn or jiajiepeng@nwpu.edu.cn\n\nSupplementary informationSupplementary data are available at Bioinformatics online.

3

Deciphering the links between metabolism and health by building small-scale knowledge graphs: application to endometriosis and persistent pollutants

Mathe, M.; Laisney, G.; Filangi, O.; Giacomoni, F.; Delmas, M.; Cano-Sancho, G.; Jourdan, F.; Frainay, C.

2026-03-04 bioinformatics 10.64898/2026.03.02.709027 medRxiv

Top 0.1%

85.7%

Show abstract

MotivationKnowledge graphs (KGs) are a robust formalism for structuring biomedical knowledge, but large-scale KGs often require complex queries, are difficult for non-experts to explore, and lack real-world context (such as experimental data, clinical conditions, patients symptoms). This limits their usability for addressing specific research questions. ResultsWe present Kg4j, a computational framework built on FORVM (a large-scale KG containing 82 million compound-biological concept associations), that constructs local, keyword-based sub-graphs tailored to address biomedical research questions. Resulting graphs support hypothetical relationships and can integrate experimental datasets, enabling the discovery of plausible but yet unknown connections. Starting from a conceptual definition of a research field of interest (e.g., disease, symptoms, exposure), the framework extracts relevant associations from FORVM and identifies potential biological mechanisms and chemical compounds. We applied this approach to endometriosis, exploring links between exposure to Persistent Organic Pollutants (POPs) and disease risk. We propose a novel validation strategy comparing the resulting sub-graph (2,706 nodes and 23,243 edges, 0.002% of FORVM) with recent scientific literature, showing consistency with known findings while also revealing new hypothetical associations requiring further investigation.We also showed that removing duplicated nodes and edges from the KG improves the proportion of validated nodes (from 8.4% to 16%), doubles the precision (from 0.085 to 0.197) while maintaining the recall (0.954 to 0.952), illustrating a trade-off between the loss of potentially relevant but redundant information and the reliability of remaining associations. By combining automated knowledge mining with experimental data integration, this framework supports reproducible, context-based exploration of biomedical knowledge and systematic hypothesis generation. Applied to endometriosis, it highlights potential mechanisms linking exposure to POPs to the aetiology of the disease, offering a scalable strategy for constructing disease-specific KGs. AvailabilityThe code and data underlying this article are available in the MetExplore repository at https://forge.inrae.fr/metexplore/kg4j. Contactclement.frainay@inrae.fr Key MessagesO_LIKg4 builds targeted knowledge maps from large biomedical databases using simple keywords. C_LIO_LIKeyword-driven exploration reveals the most relevant disease-exposure relationships without navigating millions of connections. C_LIO_LIApplied to endometriosis, the method recovered known links with persistent organic pollutant exposure. C_LIO_LIRemoving redundant information and formatting Knowledge graph as Labeled Property Graph improves the reliability of extracted knowledge. C_LI

4

GALEON: A Comprehensive Bioinformatic Tool to Analyse and Visualise Gene Clusters in Complete Genomes

Pisarenco, V. A.; Vizueta, J.; Rozas, J.

2024-04-17 genomics 10.1101/2024.04.15.589673 medRxiv

Top 0.1%

84.6%

Show abstract

MotivationGene clusters, defined as a set of genes encoding functionally-related proteins, are abundant in eukaryotic genomes. Despite the increasing availability of chromosome-level genomes, the comprehensive analysis of gene family evolution remains largely unexplored, particularly for large and highly dynamic gene families or those including very recent family members. These challenges stem from limitations in genome assembly contiguity, particularly in repetitive regions such as large gene clusters. Recent advancements in sequencing technology, such as long reads and chromatin contact mapping, hold promise in addressing these challenges. ResultsTo facilitate the identification, analysis, and visualisation of physically clustered gene family members within chromosome-level genomes, we introduce GALEON, a user-friendly bioinformatic tool. GALEON identifies gene clusters by studying the spatial distribution of pairwise physical distances among gene family members along with the genome-wide gene density. The pipeline also enables the simultaneous analysis and comparison of two gene families, and allows the exploration of the relationship between physical and evolutionary distances. This tool offers a novel approach for studying the origin and evolution of gene families. Availability and ImplementationGALEON is freely available from http://www.ub.edu/softevol/galeon, and from https://github.com/molevol-ub/galeon

5

Optimizing sparse and skew hashing: faster k-mer dictionaries

Pibiri, G. E.; Patro, R.

2026-01-22 bioinformatics 10.64898/2026.01.21.700884 medRxiv

Top 0.1%

84.2%

Show abstract

MotivationRepresenting a set of k-mers -- strings of length k -- in small space under fast lookup queries is a fundamental requirement for several applications in Bioinformatics. A data structure based on sparse and skew hashing (SSHash) was recently proposed for this purpose [Pibiri, 2022]: it combines good space effectiveness with fast lookup and streaming queries. It is also order-preserving, i.e., consecutive k-mers (sharing a prefix-suffix overlap of length k-1) are assigned consecutive hash codes which helps compressing satellite data typically associated with k-mers, like abundances and color sets in colored De Bruijn graphs. ResultsWe study the problem of accelerating queries under the sparse and skew hashing indexing paradigm, without compromising its space effectiveness. We propose a refined data structure with less complex lookups and fewer cache misses. We give a simpler and faster algorithm for streaming lookup queries. Compared to indexes with similar capabilities and based on the Burrows-Wheeler transform, like SBWT and FMSI, SSHash is significantly faster to build and query. SSHash is competitive in space with the fast (and default) modality of SBWT when both k-mer strands are indexed. While larger than FMSI, it is also more than one order of magnitude faster to query. Availability and ImplementationThe SSHash software is available at https://github.com/jermp/sshash, and also distributed via Bioconda. A benchmark of data structures for k-mer sets is available at https://github.com/jermp/kmer_sets_benchmark. The datasets used in this article are described and available at https://zenodo.org/records/17582116. Contactgiulioermanno.pibiri@unive.it, rob@cs.umd.edu.

6

AdaTiSS: A Novel Data-Adaptive Robust Method forQuantifying Tissue Specificity Scores

Wang, M.; Jiang, L.; Snyder, M. P.

2019-12-09 bioinformatics 10.1101/869404 medRxiv

Top 0.1%

83.9%

Show abstract

MotivationAccurately detecting tissue specificity (TS) in genes helps researchers understand tissue functions at the molecular level, and further identify disease mechanisms and discover tissue-specific therapeutic targets. The Genotype-Tissue Expression (GTEx) project (Consortium, 2015), and the Human Protein Atlas (HPA) project (Uhlen, et al., 2015) are two publicly available data resources, providing large-scale gene expressions across multiple tissue types. Multiple tissue comparisons, technical background noise and unknown variation factors make it challenging to accurately identify tissue specific gene expressions. Several methods worked on measuring the overall TS in gene expressions and classifying genes into tissue-enrichment categories. There still lacks a robust method to provide quantitative TS scores for each tissue. MethodsWe recognized that the key to quantify tissue specific gene expressions is to properly define a concept of expression population. We considered that inside the population, the sample expressions from various tissues are more or less balanced, and the outlier expressions outside the population may indicate tissue specificity. We then formulated the question to robustly estimate the population distribution. In a linear regression problem, we developed a novel data-adaptive robust estimation based on density-power-weight under unknown outlier distribution and non-vanishing outlier proportion (Wang, et al., 2019). In the question of quantifying TS, we focused on the Gaussian-population mixture model. We took into account gene heterogeneities and applied the robust data-adaptive procedure to estimate the population. With the robustly estimated population parameters, we constructed the AdaTiSS algorithm to obtain data-adaptive quantitative TS scores. ResultsOur TS scores from the AdaTiSS algorithm achieve the goal that the TS scores are comparable across tissues and also across genes, which standardize gene expressions in terms of TS. Compared to the categorical TS method such as the HPA criterion, our method provides more information on the population fitting, and shows advantages in quantitatively analyzing tissue specific functions, making the biology functional analysis more precise. We also discuss some limitations and possible future work. Contactmpsnyder@stanford.edu

7

Toward optimal fingerprint indexing for large scale genomics

Agret, C.; Cazaux, B.; Limasset, A.

2021-11-05 bioinformatics 10.1101/2021.11.04.467355 medRxiv

Top 0.1%

83.8%

Show abstract

MotivationTo keep up with the scale of genomic databases, several methods rely on local sensitive hashing methods to efficiently find potential matches within large genome collections. Existing solutions rely on Minhash or Hyperloglog fingerprints and require reading the whole index to perform a query. Such solutions can not be considered scalable with the growing amount of documents to index. ResultsWe present NIQKI, a novel structure with well-designed fingerprints that lead to theoretical and practical query time improvements, outperforming state-of-the-art by orders of magnitude. Our contribution is threefold. First, we generalize the concept of Hyperminhash fingerprints in (h,m)-HMH fingerprints that can be tuned to present the lowest false positive rate given the expected sub-sampling applied. Second, we provide a structure able to index any kind of fingerprints based on inverted indexes that provide optimal queries, namely linear with the size of the output. Third, we implemented these approaches in a tool dubbed NIQKI that can index and calculate pairwise distances for over one million bacterial genomes from GenBank in a few days on a small cluster. We show that our approach can be orders of magnitude faster than state-of-the-art with comparable precision. We believe this approach can lead to tremendous improvements, allowing fast queries and scaling on extensive genomic databases. Availability and implementationWe wrote the NIQKI index as an open-source C++ library under the AGPL3 license available at https://github.com/Malfoy/NIQKI. It is designed as a user-friendly tool and comes along with usage samples. 2012 ACM Subject ClassificationApplied computing [->] Bioinformatics Digital Object Identifier10.4230/LIPIcs.WABI.2022.25

8

go3: A Fast and Lightweight Library for Semantic Similarity of GO Terms and Genes

Mellina-Andreu, J. L.; Cisterna-Garcia, A.; Botia, J. A.

2025-09-04 bioinformatics 10.1101/2025.09.04.669468 medRxiv

Top 0.1%

83.8%

Show abstract

MotivationCalculation of semantic similarity of Gene Ontology (GO) term subsets is a fundamental task in functional genomics, comparative studies, and biomedical data integration. Existing tools, primarily in Python or R, often face severe limitations in performance when scaling to large annotation datasets. ResultsWe present go3, the first high-performance, Python-compatible library written in Rust that supports multiple semantic similarity metrics for GO terms and genes. go3 supports both pairwise and batch computations, optimized using Rusts parallelism and memory safety. Compared to GOATOOLS, the state of the art, it achieves up to 5x speedup and 25x lower memory footprint when loading the GO ontology and gene annotations, and up to x 10 3 speedup when calculating semantic similarities between genes, while preserving output compatibility. Availability and implementationgo3 is implemented in Rust and available through Python 3. It is accessible in GitHub: (https://github.com/Mellandd/GO3). Contactjoseluis.mellinaa@um.es

9

Fuzzifier*: Robust and Sensitive Multi-omics Data Analysis

Offensperger, F.; Pan, C.; Sinn, E.; Zimmer, R.

2026-02-09 bioinformatics 10.64898/2026.02.06.701074 medRxiv

Top 0.1%

83.1%

Show abstract

MotivationCategorization is an important means for interpreting data and drawing conclusions. Often, the derived categories provide evidence for diagnostic or even therapeutic approaches. The standard pipelines for differential analysis of multi-omic high-throughput, and in particular single-cell data, yield (ranked) lists of possibly differential features after applying appropriate effect sizes or significance thresholds of computed p-value and/or foldchange. ResultsWe propose the Fuzzifier* pipeline for the differential analysis of any type of high-throughput data, either raw input data or fold-change data of groups of a (small or large) number of replicates. In Fuzzifier*, categorization can be applied to any step of the analysis pipeline according to custom-designed fuzzy concepts (Fuzzifier). Thus, any (fuzzified) analysis option corresponds to a path in a commutative diagram specifying the Fuzzifier* pipeline. Fuzzifier* computes a user-defined set of paths and presents an overview of the results, thereby identifying both highly reliable (consensus) and sensitive (path-specific) features. Fuzzifier* is a method that can be applied to any analysis pipeline to obtain different views on the data and yield more reliable results. This is demonstrated by the identification of context-specific miRNAs for individual cancer types from TCGA data. Fuzzifier* could both validate known cancer-specific miRNAs and identify novel candidates. In comparison to statistical tests, Fuzzifier* focuses on value distributions of tumor and normal samples as well as paired foldchange distributions and, thus, identifies condition-specific features from a relatively small number of replicates. Availability and Implementationhttps://github.com/zimmerlab/fuzzifier Contactoffensperger@bio.ifi.lmu.de and zimmer@ifi.lmu.de

10

Per-residue optimisation of protein structures:Rapid alternative to optimisation with constrainedalpha carbons

Schindler, O.; Bucekova, G.; Svoboda, T.; Svobodova, R.

2025-11-26 bioinformatics 10.1101/2025.11.24.690085 medRxiv

Top 0.1%

81.1%

Show abstract

In recent years, the number of known protein structures has increased significantly. Predictive algorithms and experimental methods provide the positions of protein residues relative to each other with high accuracy. However, the local quality of the protein structure, including bond lengths, angles, and positions of individual atoms, often lacks the same level of precision. For this reason, protein structures are usually optimised by a force field prior to their application in further research sensitive to structural quality. Protein structure optimisation, however, is computationally challenging. In this paper, we introduce a general method Per-residue optimisation of protein structures: Rapid alternative to optimisation with constrained alpha carbons (PROPTIMUS RAPHAN). Rather than optimising the entire protein structure at once, PROPTIMUS RAPHAN divides the structure into overlapping residual substructures and optimises each substructure individually. This approach results in computational time that scales linearly with the size of the structure. Additionally, we present PROPTIMUS RAPHANGFN-FF, a reference implementation of our method employing a generic, almost QM-accurate force field, GFN-FF. O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=115 SRC="FIGDIR/small/690085v2_ufig1.gif" ALT="Figure 1"> View larger version (33K): org.highwire.dtl.DTLVardef@1e9fc5borg.highwire.dtl.DTLVardef@b6d70aorg.highwire.dtl.DTLVardef@1e00427org.highwire.dtl.DTLVardef@30c66f_HPS_FORMAT_FIGEXP M_FIG C_FIG We tested PROPTIMUS RAPHANGFN-FF on 461 AlphaFold DB structures and demonstrated that our approach achieves results comparable to the optimisation of the structure with constrained alpha carbons in significantly less time. Scientific ContributionThe main contribution of this work is the PROPTI-MUS RAPHAN method and its reference parallelisable implementation PROP-TIMUS RAPHANGFN-FF. Because the time requirement increases linearly with the size of the structure, PROPTIMUS RAPHANGFN-FF optimises on average 5 000 atoms per hour and a common CPU. Therefore, prior to any research sensitive to protein structure quality, our method can be employed to obtain protein structures closer to QM-accuracy.

11

ASMS: finding allele specific methylation in human genomes without phasing

Raineri, E.; Esteve Marco, M. A.; Esteve-Codina, A.

2024-12-22 bioinformatics 10.1101/2024.12.18.629129 medRxiv

Top 0.1%

80.9%

Show abstract

MotivationAllele-specific methylation (ASM) refers to differential DNA methylation patterns between two alleles at a given locus. This phenomenon is often driven by genetic variants, such as single nucleotide variants (SNVs), which influence methylation in cis by affecting transcription factor or methylation regulator binding, leading to allele-specific differences. Understanding ASM is critical for elucidating gene regulation, as it impacts gene expression and contributes to normal biological variation as well as disease processes, including cancer [1] and autoimmune disorders [2],[3]. Another key driver of ASM is genomic imprinting, an epigenetic mechanism in which gene expression is regulated in a parent-of-origin-specific manner. Imprinted regions, marked during gametogenesis and maintained through cell divisions, are essential for growth, development, and metabolism. Dysregulation of imprinting is associated with developmental and metabolic disorders, such as Prader-Willi and Angelman syndromes, and certain cancers. Detecting ASM across the genome remains challenging due to its tissue- and cell-specific nature and the technical difficulty of phasing reads to assign methylation patterns to specific alleles. Current ASM detection pipelines (e.g., [4])) often require phasing via genetic variants, a computationally intensive process that is limited in regions with low heterozygosity. ResultsTo address these limitations, we developed asms (Allele-Specific Methylation Scanner), a tool designed to detect ASM directly from methylation data without the need for prior phasing. asms offers a faster and complementary approach to uncovering the regulatory effects of ASM, particularly in genomic regions where genetic variants or imprinting play a critical role. For demonstration purposes we leverage the fact that reads generated by Oxford Nanopore (ONT) technology measure sequence and methylation status at once, but the same software can be used with other sequencing technologies. asms can check thousands of loci in a short time. The initial list of loci to examine can be given by the user, or generated by asms through a genomic scan. The asms cluster subcommand separates the reads based on methylation. If phasing results are available, asms can use them to verify whether distinct alleles correspond to distinct base modifications patterns. We benchmark our software using publicly available Ashkenazi trio data [5]. Implementation and availabilityasms is implemented in rust and python. The software is available at https://github.com/ecmra/asms.

12

Scalable computation of ultrabubbles in pangenomes by orienting bidirected graphs

Harviainen, J.; Sena, F.; Moumard, C.; Politov, A.; Schmidt, S.; Tomescu, A. I.

2026-03-31 bioinformatics 10.64898/2026.03.28.714704 medRxiv

Top 0.1%

80.0%

Show abstract

MotivationPangenome graphs are increasingly used in bioinformatics, ranging from environmental surveillance and crop improvement to the construction of population-scale human pangenomes. As these graphs grow in size, methods that scale efficiently become essential. A central task in pangenome analysis is the discovery of variation structures. In directed graphs, the most widely studied such structures, superbubbles, can be identified in linear time. Their canonical generalization to bidirected graphs, ultrabubbles, more accurately models DNA reverse complementarity. However, existing ultrabubble algorithms are quadratic in the worst case. ResultsWe show that all ultrabubbles in a bidirected graph containing at least one tip or one cutvertex--a common property of pangenome graphs--can be computed in linear time. Our key contribution is a new linear-time orientation algorithm that transforms such a bidirected graph into a directed graph of the same size, in practice. Orientation conflicts are resolved by introducing auxiliary source or sink vertices. We prove that ultrabubbles in the original bidirected graph correspond to weak superbubbles in the resulting directed graph, enabling the use of existing lineartime algorithms. Our approach achieves speedups of up to 25xover the ultrabubble implementation in vg, and of more than 200x over BubbleGun, enabling scalable pangenome analyses. For example, on the v2.0 pangenome graph constructed by the Human Pangenome Reference Consortium from 232 individuals, after reading the input, our method completes in under 3 minutes, while vg requires more than one hour, and four times more RAM. AvailabilityOur method is implemented in the BubbleFinder tool github.com/algbio/BubbleFinder, via the new ultrabubbles subcommand. Contactalexandru.tomescu@helsinki.fi

13

RAmbler: de novo genome assembly of complex repetitive regions

Chakravarty, S.; Logsdon, G.; Lonardi, S.

2023-05-29 bioinformatics 10.1101/2023.05.26.542525 medRxiv

Top 0.1%

79.9%

Show abstract

Complex repetitive regions (also called segmental duplications) in eukaryotic genomes often contain essential functional and regulatory information. Despite remarkable algorithmic progress in genome assembly in the last twenty years, modern de novo assemblers still struggle to accurately reconstruct these highly repetitive regions. When sequenced reads will be long enough to span all repetitive regions, the problem will be solved trivially. However, even the third generation of sequencing technologies on the market cannot yet produce reads that are sufficiently long (and accurate) to span every repetitive region in large eukaryotic genomes. In this work, we introduce a novel algorithm called RAmbler to resolve complex repetitive regions based on high-quality long reads (i.e., PacBio HiFi). We first identify repetitive regions by mapping the HiFi reads to the draft genome assembly and by detecting un-usually high mapping coverage. Then, (i) we compute the k-mers that are expected to occur only once in the genome (i.e., single copy k-mers, which we call unikmers), (ii) we barcode the HiFi reads based on the presence and the location of their unikmers, (iii) we compute an overlap graph solely based on shared barcodes, (iv) we reconstruct the sequence of the repetitive region by traversing the overlap graph. We present an extensive set of experiments comparing the performance of RAmbler against Hifiasm, HiCANU and Verkko on synthetic HiFi reads generated over a wide range of repeat lengths, number of repeats, heterozygosity rates and depth of sequencing (over 140 data sets). Our experimental results indicate that RAmbler outperforms Hifiasm, HiCANU and Verkko on the large majority of the inputs. We also show that RAmbler can resolve several long tandem repeats in Arabidopsis thaliana using real HiFi reads. The code for RAmbler is available at https://github.com/sakshar/rambler. CCS CONCEPTSApplied computing [->] Bioinformatics; Computational genomics; Molecular sequence analysis; * Theory of computation [->] Graph algorithms analysis.

14

QuAdTrim: Overcoming computational bottlenecks insequence quality control

Robinson, A. J.; Ross, E. M.

2019-12-19 bioinformatics 10.1101/2019.12.18.870642 medRxiv

Top 0.1%

79.8%

Show abstract

With the recent torrent of high throughput sequencing (HTS) data the necessity for highly efficient algorithms for common tasks is paramount. One task for which the basis for all further analysis of HTS data is initial data quality control, that is, the removal or trimming of poor quality reads from the dataset. Here we present QuAdTrim, a quality control and adapter trimming algorithm for HTS data that is up to 57 times faster and uses less than 0.06% of the memory of other commonly used HTS quality control programs. QuAdTrim will reduce the time and memory required for quality control of HTS data, and in doing, will reduce the computational demands of a fundamental step in HTS data analysis. Additionally, QuAdTrim impliments the removal of homopolymer Gs from the 3 end of sequence reads, a common error generated on the NovaSeq, NextSeq and iSeq100 platforms. Availability and ImplementationThe source code is freely available on bitbucket under a BSD licence, see COPYING file for details: https://bitbucket.org/arobinson/quadtrim ContactAndrew Robinson andrewjrobinson at gmail dot com

15

Extraction and analysis of methylation features from Pacific Biosciences SMRT reads using MeStudio

Riccardi, C.; Passeri, I.; Cangioli, L.; Fagorzi, C.; Mengoni, A.; Fondi, M.

2022-03-27 bioinformatics 10.1101/2022.03.23.485463 medRxiv

Top 0.1%

79.8%

Show abstract

MotivationDNA methylation is the most relevant epigenetic information, present in eukaryotes and prokaryotes, and is related to several biological phenomena, from cellular differentiation to control of gene flow, pathogenesis and virulence. The widespread use of third-generation sequencing technologies allows direct and easy detection of genome-wide methylation profiles, offering increasing opportunities to understand and exploit the epigenomics landscape. ResultsWe introduce MeStudio, a pipeline which allows to analyse and combine genome-wide methylation profiles with genomic features. Outputs report the presence of DNA methylation in coding sequences, noncoding sequences, intergenic sequences, and sequences upstream to CDS. We show the usage and performances of MeStudio on a set of single-molecule real time sequencing outputs from the bacterial species Sinorhizobium meliloti. Availability and ImplementationMeStudio is written in Python, Bash and C and is freely available under an open source GPLv3 license at https://github.com/combogenomics/MeStudio Supplementary informationSupplementary data are available at Bioinformatics online. Contactcombo.unifi@gmail.com

16

Tokenvizz: GraphRAG-Inspired Tokenization Tool for Genomic Data Discovery and Visualization

Oguztuzun, C.; Gao, Z.; Xu, R.

2024-12-06 bioinformatics 10.1101/2024.12.03.626631 medRxiv

Top 0.1%

79.7%

Show abstract

SummaryOne of the primary challenges in biomedical research is the interpretation of complex genomic relationships and the prediction of functional interactions across the genome. Tokenvizz is a novel tool for genomic analysis that enhances data discovery and visualization by combining GraphRAG-inspired tokenization with graph-based modeling. In Tokenvizz, genomic sequences are represented as graphs, where sequence k-mers (tokens) serve as nodes and attention scores as edge weights, enabling researchers to visually interpret complex, non-linear relationships within DNA sequences. Through a web-based visualization interface, researchers can interactively explore these genomic relationships and extract biologically meaningful insights about regulatory patterns and functional elements. Applied to promoter-enhancer interaction prediction tasks, Tokenvizz outperformed traditional sequential models while providing interpretable insights into genomic features, demonstrating the advantage of graph-based representations for biological discovery. Availability and ImplementationTokenvizz, along with its user guide, is freely accessible on GitHub at: https://github.com/ceragoguztuzun/tokenvizz. ACM Reference FormatCera[g] O[g]uztuzun, Zhenxiang Gao, and Rong Xu. 2024. Tokenvizz: GraphRAG Inspired Tokenization Tool for Genomic Data Discovery and Visualization. In Proceedings of (Bioinformatics). ACM, New York, NY, USA, 7 pages. https://doi.org/XXXXXXX.XXXXXXX

17

Gene-set enrichment with regularized regression

Fang, T.; Davydov, I.; Marbach, D.; Zhang, J. D.

2019-06-04 bioinformatics 10.1101/659920 medRxiv

Top 0.1%

79.5%

Show abstract

MotivationCanonical methods for gene-set enrichment analysis assume independence between gene-sets. In practice, heterogeneous gene-sets from diverse sources are frequently combined and used, resulting in gene-sets with overlapping genes. They compromise statistical modelling and complicate interpretation of results.\n\nResultsWe rephrase gene-set enrichment as a regression problem. Given some genes of interest (e.g. a list of hits from an experiment) and gene-sets (e.g. functional annotations or pathways), we aim to identify a sparse list of gene-sets for the genes of interest. In a regression framework, this amounts to identifying a minimum set of gene-sets that optimally predicts whether any gene belongs to the given genes of interest. To accommodate redundancy between gene-sets, we propose regularized regression techniques such as the elastic net. We report that regression-based results are consistent with established gene-set enrichment methods but more parsimonious and interpretable.\n\nAvailabilityWe implement the model in gerr (gene-set enrichment with regularized regression), an R package freely available at https://github.com/TaoDFang/gerr and submitted to Bioconductor. Code and data required to reproduce the results of this study are available at https://github.com/TaoDFang/GeneModuleAnnotationPaper.\n\nContactJitao David Zhang (jitao_david.zhang@roche.com), Roche Pharma Research and Early Development, Roche Innovation Center Basel, F. Hoffmann-La Roche Ltd. Grenzacherstrasse 124, 4070 Basel, Switzerland.

18

DOUBLER: Unified Representation Learning of Biological Entities and Documents for Predicting Protein-Disease Relationships

Sztyler, T.; Malone, B.

2020-10-27 bioinformatics 10.1101/2020.10.27.357202 medRxiv

Top 0.1%

79.5%

Show abstract

MotivationWe propose a system that learns consistent representations of biological entities, such as proteins and diseases, based on a knowledge graph and additional data modalities, like structured annotations and free text describing the entities. In contrast to similar approaches, we explicitly incorporate the consistency of the representations into the learning process. In particular, we use these representations to identify novel proteins associated with diseases; these novel relationships could be used to prioritize protein targets for new drugs. ResultsWe show that our approach outperforms state-of-the-art link prediction algorithms for predicting unknown protein-disease associations. Detailed analysis demonstrates that our approach is most beneficial when additional data modalities, such as free text, are informative. AvailabilityCode and data are available at: https://github.com/nle-sztyler/research-doubler Contacttimo.sztyler@neclab.eu

19

Theseus: Fast and Optimal Affine-Gap Sequence-to-Graph Alignment

Jimenez-Blanco, A.; Lopez-Villellas, L.; Moure, J. C.; Moreto, M.; Marco-Sola, S.

2026-02-14 bioinformatics 10.64898/2026.02.12.705572 medRxiv

Top 0.1%

79.4%

Show abstract

MotivationSequence-to-graph alignment is a central problem in bioinformatics, with applications in multiple sequence alignment (MSA) and pangenome analysis, among others. However, current algorithms for optimal affine-gap alignment impose high memory and computational requirements, limiting their scalability to aligning long sequences to complex graphs. Practical solutions partially address this problem using heuristic strategies that ultimately trade off optimality for speed. ResultsThis work presents Theseus, a novel, fast, and optimal affine-gap sequence-to-graph alignment algorithm. Theseus leverages similarities between genomic sequences to accelerate the alignment computation and reduces the overall memory requirements without compromising optimality. To that end, Theseus exploits the diagonal transition property to process only a subset of the dynamic programming cells, combined with a sparse-data strategy that enables efficient sequence-to-graph alignment. Moreover, our algorithm supports optimal affine-gap alignment on arbitrary directed graphs, including those with cycles. We evaluate Theseus on two key problems: multiple sequence alignment (MSA) and pangenome read mapping. For MSA, we compare it against the state-of-the-art methods SPOA, abPOA, and POASTA. Theseus is 2.0x to 232.2x faster than the other two optimal aligners, SPOA and POASTA. Compared with abPOA, a heuristic aligner, Theseus is 3.3x faster on average, while ensuring optimality. For pangenome read mapping, we benchmark Theseus against the alignment stage of the popular mapping tool vg map, along with the alignment kernels of SPOA, abPOA, and POASTA. Theseus outperforms the other methods, showing a 1.9x to 16.9x speed improvement on short reads. AvailabilityTheseus code and documentation are publicly available at https://github.com/albertjimenezbl/theseus-lib. Contactalbert.jimenez.blanco@upc.es

20

ReCycled: A Tool to Reset the Start of Circular Bacterial Chromosomes

Somerville, V.; Schmid, M.; Dreier, M.; Engel, P.

2025-04-09 bioinformatics 10.1101/2025.04.07.647662 medRxiv

Top 0.1%

79.4%

Show abstract

Summary (two sentences)For many publicly available bacterial genomes the chromosome start is not set at the replication initiation protein. This is a burden for comparative genomics studies as the synteny between such genomes cannot readily be assessed. Here, we present ReCycled, a tool for identifying and resetting the start of circular bacterial chromosomes. Availability and implementationFreely available on GitHub at https://github.com/Freevini/ReCycled under the GPL-3.0 license. Runs on all tested GNU/Linux systems. ContactVincent Somerville, vincent@somerville.earth Supplementary InformationThe analyses, scripts and data to test the different pipelines are deposited online on zenodo (10.5281/zenodo.15170502).