Bioinformatics — Latest Matching Preprints

1

Protocol for constructing correlation-based molecular networks from large-scale untargeted metabolomics data

Lin, H.; Zhang, L.; Lotfi, A.; Jarmusch, A.; Lee, I.; Kim, A.; Morton, J.; Aksenov, A. A.

2026-04-21 bioinformatics 10.1101/2025.04.26.649581 medRxiv

Top 0.8%

28.1%

Show abstract

This protocol describes a computational approach for constructing correlation-based molecular networks from untargeted metabolomics data using MetVAE, a variational autoencoder-based framework. Complementing spectral similarity networks, it captures functional relationships re-flected in cross-sample correlations. The workflow imports metabolomics features and sample metadata, adjusts for compositionality, missingness, confounding, and high-dimensionality, esti-mates sparse metabolite correlations, and exports GraphML files for network visualization. In a hepatocellular carcinoma mouse model, it links lipid classes in high-fat-diet animals, suggesting an endogenous "auto-brewery" route to lipotoxic metabolites.

2

MutaPhy: A clade-based framework to detect genotype-phenotype associations on phylogenetic trees

Ngo, A.; Guindon, S.; Pedergnana, V.

2026-04-21 evolutionary biology 10.64898/2026.04.19.719535 medRxiv

Top 1%

22.9%

Show abstract

Understanding how genetic variation in pathogens influences clinical phenotypes observed in infected hosts is a fundamental challenge in evolutionary genomics and public health. Phenotypic traits such as infection severity are often non-randomly distributed within the pathogens phylogeny, suggesting the existence of evolutionary determinants but also violating the independence assumption underlying classical genome-wide association studies and potentially leading to inflated false positive rates. We present MutaPhy, a phylogeny-based method aimed at detecting correlations between a binary host phenotype and the corresponding pathogen genome by directly utilizing the hierarchical structure of phylogenetic trees. MutaPhy encompasses three different scales: (i) a subtree scale, on which relevant clades over-representing the phenotype of interest are detected using permutation-based tests; (ii) a tree scale, which agglomerates local signals into a global association statistics; and (iii) a site scale, whereby candidate mutational events on branches leading to significant clades are examined using ancestral sequence reconstruction. We evaluate the statistical behavior and detection performance of MutaPhy using simulations under diverse evolutionary scenarios. We also compare this tool to several existing phylogenetic association methods. As illustrative applications, we apply MutaPhy to dengue virus and hepatitis C virus datasets associated to clinical phenotypes in human hosts. Our results highlight the ability of the proposed approach to detect viral lineages associated to over-represented phenotypes while revealing limited evidence for robust mutation-level associations in these particular datasets. Altogether, MutaPhy provides a framework for guiding genotype-phenotype association analyses by leveraging phylogenetic structure, thereby reducing false positive findings and improving the interpretability of association signals.

3

PathPinpointR: Predicting the progression of sc-RNAseq samples through reference trajectories.

Nicholas, M. T.; Mehta, D.; Ouyang, J.; Dawoud, A.; Ellison, C.; Westendorf, J.; Green, L. A.; Skipp, P.; Rackham, O.

2026-04-21 bioinformatics 10.64898/2026.04.21.715327 medRxiv

Top 1%

22.3%

Show abstract

Single-cell RNA sequencing (scRNA-seq) has transformed our ability to analyse cellular heterogeneity, enabling detailed mapping of cellular progression. Trajectory inference tools construct trajectories from scRNA-seq data, facilitating the tracing of cellular progression through developmental pathways. PathPinpointR (PPR) is a lightweight and user-friendly R package developed to predict and compare the positions of scRNA-seq samples along reference biological trajectories, such as those created from large cell atlas projects. PPR utilises sets of switching-gene events from reference trajectories as indicators of cellular progression. By applying these positional indicators to query datasets, each cell can be accurately assigned a pseudo-time value, providing predictive insight into its position along a trajectory. This information can be used to stage cells within an established developmental process, or to evaluate how different patient samples compare when mapped onto reference disease or drug response trajectories. AvailabilityPathPinpointR is available at https://github.com/moi-taiga/PathPinpointR. Contacto.j.l.rackham@soton.ac.uk

4

OpusTaxa: A Unified Workflow for Taxonomic Profiling, Assembly, and Functional Analysis of Shotgun Metagenomes

Chen, Y.-K.; Harker, C. M.; Pham, C. M.; Grundy, L.; Wardill, H. R.; Roach, M. J.; Ryan, F. J.

2026-04-19 bioinformatics 10.64898/2026.04.15.718825 medRxiv

Top 1%

22.0%

Show abstract

Shotgun metagenomics has become a cornerstone of microbiome research, yet the complexity of existing workflows remains a major barrier for life scientists without dedicated bioinformatics support. Manual database setup, detailed sample sheet preparation, and management of software dependencies can make routine analysis difficult and time-consuming. Cross-study comparisons are further hampered by inconsistent processing pipelines, database versions, and profiling strategies, limiting reproducibility and the potential for large-scale meta-analyses. We present OpusTaxa, an open-source Snakemake workflow that provides end-to-end processing of short paired-end shotgun metagenomic data with minimal configuration. Users provide either FASTQ files or Sequence Read Archive accessions; OpusTaxa automatically downloads required databases, performs quality control, removes host reads, and executes taxonomic profiling, metagenome assembly, and functional analysis. All analysis modules can be independently toggled, and per-sample outputs are automatically merged into harmonised, cross-sample tables ready for downstream exploration. Across two public datasets, we demonstrate how OpusTaxa can be used to compare consistency across complementary taxonomic profilers and to estimate microbial load in addition to standard metagenomic workflows. AvailabilityOpusTaxa is freely available at https://github.com/yenkaiC/OpusTaxa. Documentation, test data, and example configurations are included in the repository.

5

diagFDR: Verifiable False Discovery Rate Reporting in Proteomics via Scope, Calibration, and Stability Diagnostics

Chion, M.; Godmer, A.; Douche, T.; Matondo, M.; Giai Gianetto, Q.

2026-04-20 bioinformatics 10.64898/2026.04.16.718468 medRxiv

Top 1%

21.6%

Show abstract

AO_SCPLOWBSTRACTC_SCPLOWIn mass spectrometry-based proteomics, false discovery rate (FDR) control underpins the credibility of peptide and protein identifications. In contemporary workflows, including multi-run Data Independent Acquisition (DIA), deep learning-assisted scoring, library-free searches, and extensive post-processing, the statement "1% FDR" has become increasingly ambiguous, potentially referring to different statistical entities, multiple-testing scopes, and null models. We propose a standardized framework requiring explicit specification of three complementary properties: "scope", meaning which statistical universe is controlled; "calibration", meaning whether confidence measures behave consistently with their intended interpretation on the reported unit; and "stability", meaning whether acceptance thresholds and resulting identification lists remain robust to perturbations. Building on routine target/decoy outputs, we introduce pipeline-agnostic diagnostics that audit internal coherence of scores, q-values, and posterior error probabilities, quantify tail support and cutoff fragility, and test plausibility of target-decoy assumptions. We further complement internal checks with external validation via entrapment, which measures empirical false positives on knownabsent sequences. We highlight a "granularity paradox": as scoring becomes more discriminative, decoy matches can become so sparse near stringent cutoffs that the numerical support for decoy-based estimation deteriorates, making reported FDR thresholds increasingly fragile despite improved separation between the distributions of target and decoy scores. Applications to DIA-NN and MS2Rescore show that scope and aggregation choices can materially alter both estimated error rates and list reproducibility. We provide a practical reporting checklist and an open-source R package (diagFDR, available from CRAN) that generates diagnostic reports from standard software outputs. As a minimal verifiable reporting standard, we recommend that any "FDR = %" claim specify the controlled unit and scope, report tail support at the operating cutoff, and make decoy-inclusive outputs available for independent verification. HighlightsO_LIFDR claims can be misleading without explicit scope, calibration, and stability assessment. C_LIO_LIdiagFDR introduces pipeline-agnostic diagnostics from standard software outputs. C_LIO_LIThe granularity paradox shows sparse decoy tails can make stringent cutoffs numerically fragile. C_LIO_LICase studies show that scope misuse and rescoring can affect both error rates and stability. C_LIO_LIdiagFDR produces reviewer-ready reports and a practical reporting checklist. C_LI

6

GNOMES: an integrated framework for genome-wide normalization and differential binding analysis of CUT&RUN and ChIP-seq data

Roule, T.; Akizu, N.

2026-04-21 bioinformatics 10.64898/2026.04.16.718722 medRxiv

Top 1%

19.7%

Show abstract

BackgroundDespite their use, quantitative comparison of epigenomic datasets such as ChIP-seq and CUT&RUN remains challenging, particularly due to difficulties in signal normalization across samples and conditions. Normalization solely based on sequencing depth is often insufficient due to the high variability in signal-to-noise ratios across samples, even from a same experiment. While exogeneous spike-in normalization can address some issues, robust spike-in controls are not always available, and may introduce additional experimental burden and computational complexity. Furthermore, normalization and differential binding analysis are typically performed using separate bioinformatics tools. Indeed, most differential analysis frameworks operate on raw count matrices, preventing users from visually inspecting normalized signal tracks and evaluating how normalization influences the results. To overcome these challenges, we developed GNOMES (Genome-wide NOrmalization of Mapped Epigenomic Signals), a framework that integrates signal normalization, quality control, and differential binding analysis within a unified workflow. ResultsGNOMES is a user-friendly tool able to process ChIP-seq and CUT&RUN datasets from aligned reads, and generate normalized coverage profiles and differential binding results. The tool implements a robust genome-wide normalization strategy based on percentile scaling of signal local maxima, enabling stable normalization between biological replicates and conditions. GNOMES supports both single- and paired- end sequencing, does not required a negative control (input or IGG), and can be applied to both broad (histone marks) or narrow (transcription factor) enrichment patterns. The workflow includes normalization, optional consensus peak identification, and differential binding analysis. For each step, GNOMES generates extensive quality-control metrics and visual outputs, including normalized bigWig tracks, median signal tracks, BED files of regions with significant changes, and diagnostic plots such as heatmaps and PCA. GNOMES is highly configurable and integrates established tools such as MACS2 for candidate peak regions identification for differential binding analysis, as well as DESeq2 and edgeR for statistical testing. Finally, GNOMES is organism-agnostic and can be applied to epigenomic datasets from any model system. ConclusionsGNOMES provides an integrated and highly customizable environment for normalization and differential binding analysis of epigenomic sequencing data. By integrating signal normalization, with downstream differential statistical method for differential binding analysis, and comprehensive quality control, GNOMES simplifies the analysis of ChIP-seq and CUT&RUN datasets, for the identification of chromatin changes.

7

ProNA3D: Distance-Based Analysis of Nucleic Acid-Containing Interfaces

Genz, L. R.; Topf, M.

2026-04-21 bioinformatics 10.64898/2026.04.16.719043 medRxiv

Top 2%

18.1%

Show abstract

1.Biomolecular interactions are central to many essential cellular processes, but RNA-containing complexes remain challenging to resolve structurally, even as experimental methods and AI-based prediction have expanded structural coverage. Tools for the integrated analysis of complex interfaces remain limited. We present ProNA3D, a tool that provides a unified platform for analyzing protein-nucleic acid and nucleic acid-only complexes, bridging the gap between structure prediction and functional interpretation. ProNA3D supports both experimental and computationally predicted structures, incorporating scoring metrics for AlphaFold3 predictions. It also offers interactive two-dimensional interface visualization and secondary-structure topology plots for RNA and DNA. An interface-based density zoning feature facilitates structure analysis in cryo-EM maps, allowing evaluation of dynamic complexes in the context of heterogeneous density. We demonstrate ProNA3D on diverse complexes solved by X-ray crystallography or cryo-EM, as well as on computational models. For example, in a trimeric complex of HIV-1 RNA and a human antibody, ProNA3D identified a high-connectivity nucleotide with potential functional relevance. Applying ProNA3D to the entire Protein Data Bank revealed distinct interface connectivity trends and interaction modes characteristic of specific complexes (e.g., in methyltransferase-DNA and CRISPR-associated) in nucleic acid-containing interfaces. The method is available as both a UCSF ChimeraX plug-in for visualization and a command-line tool at https://gitlab.com/topf-lab/ProNA3D.

8

Pan1c : a pipeline to easily build chromosome-level pangenome graphs

Mergez, A.; Racoupeau, M.; Bardou, P.; Linard, B.; Legeai, F.; Choulet, F.; Gaspin, C.; Klopp, C.

2026-04-21 bioinformatics 10.64898/2026.04.17.719212 medRxiv

Top 2%

16.9%

Show abstract

The advances of sequencing technologies and the availability of high-quality genome assemblies for many genotypes per species, give the opportunity to improve sequence alignment rate and quality, and the variant calling accuracy by including all genomic variations in a graph reference, called a pangenome graph. Because the process of building and analysing a pangenome graph is still complex, with related software packages under development, there is an important need for releasing user-friendly pipelines for this emerging research area. Pan1C is a pipeline based on a chromosome-by-chromosome graph construction strategy. It integrates two complementary strategies for building pangenomes and produces informative metric plots and graphics using a large set of tools. By benchmarking Pan1C on human, fungal, and wheat assemblies, which span a wide range of genome sizes and complexities, we showed the interest of Pan1C for assembly and graph validation as well as for performing primary analyses.

9

Bi-level diversity optimisation for representative protein panel selection

Ou, Z.; James, K.; Charnock, S.; Wipat, A.

2026-04-21 bioinformatics 10.64898/2026.04.17.719243 medRxiv

Top 2%

14.7%

Show abstract

Selecting representative subsets from large protein sequence datasets is a common challenge in enzyme discovery and related tasks under limited screening capacity. In practice, candidate panels are often constructed using clustering-based redundancy reduction or manual selection guided by phylogenetic or similarity-network analyses, which do not directly optimise subset diversity and require threshold tuning or expert interpretation. Here, we present a bi-level diversity-optimisation framework for representative protein panel selection implemented using a local search heuristic that iteratively updates panel composition to improve diversity. The method formulates panel design as a combinatorial optimisation problem over pairwise distance matrices, combining a MaxMin objective to enforce minimum separation between selected sequences with a MaxSum objective to increase global dispersion. This formulation enables the direct construction of fixed-cardinality panels while remaining independent of the similarity representation used to compute pairwise distances. Benchmarking across four Pfam families shows that the bi-level formulation consistently reduces redundancy among selected sequences, lowering maximum pairwise identity by 43-46% relative to the previous MaxSum-based formulation, while maintaining comparable or improved EC-label coverage. The framework can incorporate sequence- or structure-based similarity measures, providing a flexible strategy for constructing diverse representative panels across homologous protein families.

10

Large-scale analysis of ligand binding mode similarities in the PDB using interaction fingerprints

Kunnakkattu, I. R.; Choudhary, P.; Midlik, A.; Fleming, J. R.; Balasubramaniyan, B.; Sasidharan Nair, S.; Velankar, S.

2026-04-21 bioinformatics 10.64898/2026.04.17.719144 medRxiv

Top 2%

14.6%

Show abstract

Three-dimensional structures of protein-ligand complexes are essential for insights into the molecular principles that govern ligand recognition and binding. With more than 180,000 ligand-bound entries in the Protein Data Bank (PDB), representing over two million individual complexes, the volume of available structural data offers unprecedented opportunities for large-scale analysis of interaction patterns. Analysis of interaction patterns across the PDB archive can help discover similarities and differences in the binding modes of ligands, assisting in drug discovery. However, large-scale analysis of up-to-date information remains a significant challenge due to the rapid growth of data. Here, we introduce the Extended Connectivity Interaction Fingerprint (ECIFP), an interaction-based fingerprint that simplifies 3D protein-ligand contact information into a fingerprint, while retaining key molecular and chemical features of the interacting fragments. The simpler fingerprint representation of the interaction data makes comparison of millions of protein-ligand complexes tractable. Benchmarking shows that ECIFP outperforms ligand-only Extended Connectivity Fingerprints in identifying similar binding sites across identical protein sequences occupied by chemically diverse ligands. Our analysis showed that similarities calculated using ECIFP can be used to compare macromolecular complexes with similar or different ligands. In this study, we demonstrate two large-scale applications of ECIFP: (1) identification of distinct binding modes for over 9,000 ligands across the entire PDB, and (2) detection of binding-mode similarities among structurally diverse ligands within the same binding site across 48,870 binding sites from over 21,000 proteins.

11

REPLAY: A reproducible and user-friendly application for DNA replication timing analysis from Repli-seq data

Dickinson, Q.; Yu, C.; Rivera-Mulia, J. C.

2026-04-21 genomics 10.64898/2026.04.16.719037 medRxiv

Top 2%

14.4%

Show abstract

BackgroundDNA replication timing (RT) is a fundamental feature of genome organization that is regulated in a cell-type-specific manner and frequently altered in disease. Repli-seq is the standard approach for genome-wide RT profiling; however, its analysis typically requires multiple independent tools and custom scripts, limiting reproducibility, portability, and accessibility, particularly for users without computational expertise. In addition, existing workflows often lack standardization and require substantial user intervention. ResultsWe developed REPLAY, a fully automated, reproducible, and user-friendly application for replication timing analysis. REPLAY is distributed as a standalone executable that enables end-to-end processing from compressed FASTQ files to genome-wide RT profiles without requiring software installation or programming experience. Through an intuitive graphical interface, users can configure analysis parameters, including input and output directories, reference genome, normalization strategy (quantile, median, or interquartile range), and smoothing. The application integrates all processing steps--quality control, trimming, alignment, binning, RT log2 calculation, normalization, smoothing, and visualization-- within a single automated workflow. Application of REPLAY to publicly available datasets demonstrate accurate reconstruction of RT profiles and high reproducibility across samples. ConclusionsREPLAY offers a portable, reproducible, and accessible solution for the analysis of RT data. By eliminating the need for command-line tools and complex installations, it lowers the entry barrier enabling standardized analysis across diverse research settings.

12

Single-cell hit calling in high-content imaging screens with Buscar

Serrano, E.; Li, W.-s.; Way, G. P.

2026-04-19 bioinformatics 10.64898/2026.04.15.718737 medRxiv

Top 2%

12.2%

Show abstract

High-content screening (HCS) enables the systematic quantification of single-cell morphology features across thousands of perturbations, capturing rich phenotypic heterogeneity. Image-based profiling is a critical bioinformatics processing step in this pipeline, as researchers use it to predict mechanisms of action, assess toxicity, perform hit calling, and more. However, current image-based profiling workflows rely on aggregate statistics, such as calculating mean or median feature values per well, implicitly assuming cell homogeneity. This limitation obscures subpopulation effects, reducing sensitivity to subtle or heterogeneous effects of perturbations. Here we present Buscar, a method that leverages the full heterogeneity of single-cell image-based profiles to call hits. Buscar requires two reference, single-cell populations that define distinct morphology states: a reference state (e.g., disease cells) and a target state (e.g., healthy cells). Buscar then compares these two groups to define on- and off-morphology signatures, which it then uses to score every perturbation in a given screen. The scores quantify perturbation efficacy and off-target effects, or specificity, in an interpretable manner, clarifying which morphologies are appropriately altered and which may arise from off-target activity. We apply Buscar to three datasets. First, as a proof of concept, we applied Buscar to a Cell Painting dataset of cardiac fibroblasts from patients with heart failure. Buscar quantifies both morphology rescue and off-target morphology activity in these cells treated with a TGF{beta} receptor inhibitor. Second, we show that Buscar recovers biologically coherent gene-phenotype associations across 16 manually-labeled phenotypes in the MitoCheck dataset. Lastly, applied to CPJUMP1, we show that Buscar is robust to technical replicates collected across plates in both small-molecule and CRISPR-Cas9 perturbations. Together, these results establish Buscar as a reproducible and interpretable hit calling method that overcomes aggregation bias, enabling the simultaneous quantification of compound efficacy and specificity to enhance hit calling in HCS. We release Buscar as an open-source python package.

13

DNAharvester: A Nextflow Pipeline for Analysing Highly Degraded DNA from Ancient and Historical Specimens

Sharif, B.; Kutschera, V. E.; Oskolkov, N.; Guinet, B.; Lord, E.; Chacon-Duque, J. C.; Oppenheimer, J.; van der Valk, T.; Diez-del-Molino, D.; D. Heintzman, P.; Dalen, L.

2026-04-21 bioinformatics 10.64898/2026.04.20.719564 medRxiv

Top 3%

10.4%

Show abstract

Ancient DNA (aDNA) research has advanced rapidly with the development of high-throughput sequencing, which now enables genome-wide analyses of large collections of prehistoric specimens. However, analysing palaeontological and archaeological material with highly degraded DNA constitutes a major bioinformatic challenge. DNA from such samples is characterised by short fragment lengths, low endogenous content, post-mortem damage, and considerable cross-species contamination, which can increase spurious mapping and reference bias, affecting downstream population genetic inferences. Here we present DNAharvester, a modular and reproducible pipeline designed specifically for the processing of highly degraded DNA from ancient and historical specimens. DNAharvester integrates metagenomic filtering before mapping, competitive mapping, adaptive aligner selection (incorporating algorithms such as BWA-aln, BWA-mem, and Bowtie2), and systematic evaluation of reference bias and spurious mapping. By incorporating flexible mapping and filtering strategies, the pipeline can be adapted to varying sample preservation, with a distinct focus on maximising authentic data recovery from highly degraded material. Furthermore, DNAharvester features comprehensive subworkflows for iterative assembly of mitogenomes, identification of genomic repeats and CpG sites, taxonomic classification, microbial/pathogen screening of unmapped reads, genetic sex determination, and variant calling for downstream analyses. To accommodate datasets with varying sequencing depths, the pipeline incorporates multiple variant calling strategies, including diploid variant calling, genotype likelihood estimation, and pseudo-haploid random allele calling. Implemented in Nextflow, DNAharvester provides a highly scalable, containerised framework that enhances reproducibility, portability, and robustness in aDNA analyses. We validated the pipeline across a gradient of simulated scenarios and empirical datasets, demonstrating its ability to systematically mitigate complex background contamination while preserving authentic genomic signals even in the most challenging of circumstances. By streamlining complex bioinformatic tasks through simple configuration files, DNAharvester establishes a standardised approach for the rigorous analysis of highly degraded DNA datasets and makes genomic analyses of ancient remains accessible to the broader research community.

14

Structure-informed Siamese graph neural networks classify CirA missense variants with implications for cefiderocol susceptibility

Razavi, M.; Tellapragada, C.; Giske, C. G.

2026-04-21 bioinformatics 10.64898/2026.04.17.718272 medRxiv

Top 3%

9.9%

Show abstract

Cefiderocol uptake in Enterobacterales depends partly on TonB-dependent catecholate transporters, including CirA, yet the functional interpretation of CirA missense variation remains limited by an absence of large experimental phenotype datasets. Here we describe a structure-informed Siamese graph neural network (GNN) framework designed to prioritise CirA missense variants that are likely to impair transporter function and thereby contribute to reduced cefiderocol susceptibility. Because large experimental datasets of CirA missense phenotypes are not available, we trained the model on a synthetic mutant set generated from structurally motivated rules applied to the CirA reference structure (AlphaFold model, UniProt P17315). Each residue was represented using protein language model embeddings, backbone geometry, and amino-acid identity, and paired wild-type and mutant graphs were compared through a shared encoder. On synthetic held-out benchmarks, the model achieved strong classification performance on a position-held-out split (macro-F1 = 0.989 against synthetic labels). Applied to a collection of Escherichia coli CirA protein sequences, the framework prioritised a subset of variants as high-confidence non-benign candidates and assigned many others to review or abstain categories, reflecting predictive uncertainty outside the synthetic training distribution. A post-hoc severity-ranking scheme triages disruptive candidates for follow-up. This framework demonstrates that structure-informed synthetic data generation paired with Siamese GNN inference can bridge the gap between sequence-level genomic surveillance and mechanistic functional prediction of outer-membrane transporter variants.

15

KIR*BLOOM: Accurate KIR genotyping using a new copy number-aware integrated genotype likelihood framework

Gohar, Y.; Garcia, A. D.; Kichula, K. M.; Norman, P. J.; Dilthey, A. T.

2026-04-20 bioinformatics 10.64898/2026.04.15.718735 medRxiv

Top 3%

9.7%

Show abstract

Killer-cell immunoglobulin-like receptor (KIR) genes, key modulators of natural killer (NK) cell activity, play critical roles in immune response and disease susceptibility. Accurate KIR genotyping from short-read sequencing data remains challenging because of high sequence similarity among genes, extensive copy number variation, and substantial allelic diversity. Here, we present KIR*BLOOM, a likelihood-based approach for KIR genotyping from short-read data that models read depth and sequencing error across alternative genotype configurations. KIR*BLOOM first identifies KIR-relevant read pairs, maps them to a KIR allele database, and reduces the candidate allele space by excluding alleles unlikely to be present. It then infers gene copy number and selects alleles under the inferred copy-number constraints. Finally, variant calling is used to refine CDS sequences and identify potential novel alleles. We evaluated performance on 45 whole-genome sequencing samples with haplotype-resolved assemblies from the HPRC or HGSVC, using Immuannot-derived annotations as ground truth. KIR*BLOOM achieved 99.85% precision, 99.92% recall, and a Jaccard index of 99.77% for copy-number inference. At five-digit allele resolution, it achieved 92.73% precision, 92.69% recall, and an 87.29% Jaccard index, outperforming T1K, GraphKIR, and Geny. Together, these results demonstrate that KIR*BLOOM enables highly accurate KIR genotyping from short-read sequencing data.

16

Dissecting the coordinated progression of cell states in spatial transcriptomics with CoPro

Miao, Z.; Qu, Y.; Huang, S.; Laux, L.; Peters, S.; Aristel, A.; Zhang, Z.; Niedernhofer, L. J.; McMahon, A.; Kim, J.; Zhang, N.

2026-04-21 bioinformatics 10.64898/2026.04.17.719309 medRxiv

Top 3%

9.1%

Show abstract

Spatial transcriptomics enables the study of how cells coordinate their molecular states within tissue, providing insight into both normal function and disease processes. A key challenge is to identify gene expression programs that vary continuously across space and are coordinated between cell types. We present CoPro, a computational framework for detecting the spatially coordinated progression of cellular states. CoPro can operate in both supervised and unsupervised modes to identify gene programs that co-vary within or between cell types, and to disentangle multiple overlapping spatial patterns. CoPro can be applied to single-cell-level spatial transcriptomics datasets, including MERFISH, SeqFISH+, Xenium, and histology-imputed transcriptomic data. We demonstrate the utility of CoPro with data collected from colon, brain, liver, and kidney tissues. In the colon, CoPro separates epithelial differentiation along the crypt axis from spatially localized inflammatory signals. In the aging liver, it identifies multiple aging-associated cellular programs superimposed on anatomical zonation. In the brain, the flexible kernel design enables the decoupling of the gene expression gradient along the dorsal-ventral and medial-lateral axes. In the kidney, CoPro identifies tubule-vasculature coordination that is essential in nephron function. These results demonstrate CoPros utility for analyzing spatial coordination of gene expression in complex tissues and disentangling overlapping biological processes, such as anatomical organization and disease-associated variation.

17

DOME Copilot: Making transparency and reproducibility for artificial intelligence methods simple

Farrell, G.; Attafi, O. A.; Fragkouli, S.-C.; Heredia, I.; Fernandez Tobias, S.; Harrison, M.; Hermjakob, H.; Jeffryes, M.; Obregon Ruiz, M.; Pearce, M.; Pechlivanis, N.; Lopez Garcia, A.; Psomopoulos, F.; Tosatto, S. C. E.

2026-04-19 bioinformatics 10.64898/2026.04.16.718888 medRxiv

Top 4%

6.8%

Show abstract

Unprecedented breakthroughs are being made in life science research through the application of artificial intelligence (AI). However, adherence to method reporting guidelines is necessary to support their reusability and reproducibility. The DOME Copilot solution extracts structured reports of AI methods using a large language model to help interpret manuscripts. It is a fast and efficient resource capable of scaling to annotate the corpus of global AI literature, unlocking value and trust in published methods.

18

Practical quantification of immunohistochemistry antigen concentrations and reaction-diffusion parameters

Peale, F. V.; Perng, W.; Mbiribindi, B.; Andrews, B. T.; Wang, X.; Dunlap, D.; Eastham, J.; Ngu, H.; Chernyshev, A.; Orlova, D.

2026-04-21 pathology 10.64898/2026.04.16.719078 medRxiv

Top 4%

6.4%

Show abstract

The immunohistochemistry (IHC) methods widely used in diagnostic medicine and biomedical research are kinetically complex reaction-diffusion processes that, ideally, produce stain intensities correlated with the local antigen concentration. Yet after 75 years of use, practical theoretical tools to rigorously plan and interpret IHC experiments are still lacking. Because modeling the reactions requires time-consuming computer simulation, impractical for regular use, most protocols are optimized empirically, without detailed knowledge of the reaction rates and antigen-antibody equilibria. The resulting stain intensities can be calibrated against standards with known antigen abundance, but they are typically not interpretable in terms of chemical antigen concentrations. To address these limitations, we developed a fast interpolation method to model reaction-diffusion behavior, and experimental methods to characterize IHC kinetic parameters in formalin-fixed paraffin-embedded (FFPE) samples. Used together, these allow experimental measurement of both the chemical concentration of antigen in the sample and the reaction-diffusion parameters consistent with the assay results. Results show 1) direct immunofluorescent detection has low nanomolar sensitivity with >1000-fold dynamic range, and 2) antibody diffusion rates in FFPE samples can be >1000-fold slower than in aqueous solutions, producing diffusion-limited conditions in which the IHC reaction time course may depend on the sample antigen concentration. Awareness of these details is necessary to avoid potential underestimation of both the absolute and relative antigen concentrations in different samples that may occur if staining is stopped before reaching equilibrium. Software tools are provided to allow users to rapidly model IHC reaction time courses and to fit experimental time course data with candidate reaction parameters. The principles described here apply equally to other tissue-based "spatial omics" analyses and should be considered when designing and interpreting experiments requiring any macromolecule to diffuse into and react in a tissue section. SIGNIFICANCEThe theoretical and experimental framework described here advances IHC staining from a qualitative or semi-quantitative method towards a more rigorously quantitative assay. The practical ability to predict IHC reaction kinetics and fit reaction parameters to experimental data has the potential to advance IHC applications in diagnostic medicine and biomedical research in three ways: 1) interpretation of experimental and diagnostic samples stained under different conditions can be more objective, facilitating comparison of results from different protocols and different laboratories; 2) IHC staining can be interpreted as molar chemical antigen-antibody concentrations calculated from the reaction parameters measured in the studied sample; 3) the correlation between antigen concentration and biological behavior can be examined more reliably. Practical software tools are provided.

19

Generative AI-assisted Bayesian-frequentist Hybrid Inference in Single-cell RNA Sequencing Analysis for Genes Associated with Alzheimer's Disease

Han, G.; Yuan, A.; Oware, K. D.; Wright, F.; Carroll, R. J.; Smith, M.; Ory, M. G.; Yan, D.; Wang, W.; Sun, Z.; Dai, Q.; Allen, C.; Dang, A.; Liu, Y.

2026-04-20 geriatric medicine 10.64898/2026.04.17.26351142 medRxiv

Top 4%

6.4%

Show abstract

Alzheimers disease genomics and other high-dimensional omics studies demand powerful statistical methods, yet Bayesian inference remains underutilized despite its advantages in small-sample settings, owing to the prohibitive cost of eliciting reliable priors across thousands or millions of parameters. We propose an AI-assisted Bayesian-frequentist hybrid inference framework that couples large language model based prior elicitation with the hybrid inference theory of Yuan (2009). ChatGPT-4o is queried via a standardized prompt to assess the strength of evidence linking each gene to a disease of interest, and the response is mapped to an informative normal prior via a standardized effect-size calibration. Parameters for covariates of secondary interest are treated as frequentist parameters, preserving efficiency and avoiding sensitivity to mis-specified priors. We derive closed-form hybrid estimators under uniform and conjugate normal priors in linear models, establish their asymptotic equivalence to the frequentist and full Bayes estimators, and show in simulations that hybrid inference using unconditional variance estimation leads to high statistical power while accurately controlling the Type I error rate. Applied to single-cell RNA sequencing data from the ROSMAP cohort for Alzheimers disease as an example, the framework identifies biologically coherent pathways (such as gamma-secretase pathways) previously undetected. The proposed framework offers a principled and computationally scalable approach to genome-wide Bayesian analysis, with potential for broad application across omics platforms and disease settings.

20

3D Reconstruction of Nanoparticle Distribution in Tumor Spheroids with Volume Electron Microscopy

Bottone, D.; Gerken, L. R.; Habermann, S.; Mateos, J. M.; Lucas, M. S.; Riemann, J.; Fachet, M.; Resch-Genger, U.; Kissling, V. M.; Roesslein, M.; Gogos, A.; Herrmann, I. K.

2026-04-21 bioinformatics 10.64898/2026.04.17.719153 medRxiv

Top 4%

6.3%

Show abstract

AO_SCPLOWBSTRACTC_SCPLOWSpatially resolved characterization of nanomaterial (NM) distribution within cellular ultrastructure is essential for understanding NM fate and activity in biological systems. Volume electron microscopy (vEM) is uniquely positioned to address this challenge, yet fully documented quantitative pipelines that simultaneously segment NMs and cellular structures remain scarce. Here, an end-to-end analytical pipeline is presented based on the example of serial block-face scanning electron microscopy (SBF-SEM) data of tumor spheroids containing nanoparticles (NPs). A hybrid segmentation strategy is adopted: a fine-tuned Cellpose-SAM model for cells and nuclei, and an empirical Bayes approach for AuNPs. The fine-tuned model outperforms both the pre-trained baseline and benchmark experiments in Amira, and shows good generalization to 2D EM datasets of varying sample types, suggesting potential as a general-purpose segmentation model for electron microscopy. Full 3D reconstruction of NP distributions reveals preferential clustering in the perinuclear region, with a median nucleus-to-NP distance of 2.57 {micro}m and NM uptake spanning several orders of magnitude across cells. Furthermore, morphological analysis of segmented cells and nuclei using 3D shape descriptors and local curvature metrics provides quantitative access to features inaccessible from single sections. Together, these results establish a reproducible, open framework for the joint quantitative analysis of NM distribution and cellular morphology in vEM data.