GigaScience — Latest Matching Preprints

1

REPLAY: A reproducible and user-friendly application for DNA replication timing analysis from Repli-seq data

Dickinson, Q.; Yu, C.; Rivera-Mulia, J. C.

2026-04-21 genomics 10.64898/2026.04.16.719037 medRxiv

Top 0.1%

17.1%

Show abstract

BackgroundDNA replication timing (RT) is a fundamental feature of genome organization that is regulated in a cell-type-specific manner and frequently altered in disease. Repli-seq is the standard approach for genome-wide RT profiling; however, its analysis typically requires multiple independent tools and custom scripts, limiting reproducibility, portability, and accessibility, particularly for users without computational expertise. In addition, existing workflows often lack standardization and require substantial user intervention. ResultsWe developed REPLAY, a fully automated, reproducible, and user-friendly application for replication timing analysis. REPLAY is distributed as a standalone executable that enables end-to-end processing from compressed FASTQ files to genome-wide RT profiles without requiring software installation or programming experience. Through an intuitive graphical interface, users can configure analysis parameters, including input and output directories, reference genome, normalization strategy (quantile, median, or interquartile range), and smoothing. The application integrates all processing steps--quality control, trimming, alignment, binning, RT log2 calculation, normalization, smoothing, and visualization-- within a single automated workflow. Application of REPLAY to publicly available datasets demonstrate accurate reconstruction of RT profiles and high reproducibility across samples. ConclusionsREPLAY offers a portable, reproducible, and accessible solution for the analysis of RT data. By eliminating the need for command-line tools and complex installations, it lowers the entry barrier enabling standardized analysis across diverse research settings.

2

Machine learning-based prediction of memory requirements for metagenomic assembly in high-performance computing environments

Zierep, P. F.; Faack, S.; Beracochea, M.; Sanchez, S.; Batut, B.; Finn, R. D.; Gruening, B. A.

2026-05-13 microbiology 10.64898/2026.05.12.724571 medRxiv

Top 0.1%

17.0%

Show abstract

Metagenomic assembly can be a computationally intensive step in microbiome analysis, with memory requirements that vary widely depending on input data characteristics. In workflow systems like Galaxy and large-scale platforms like MGnify, which run thousands of heterogeneous jobs, inaccurate memory allocation drives job failures and costly retries when underestimated, and reduces throughput when overestimated. Current approaches rely primarily on heuristic rules based on input file size or sample metadata, which often fail to generalize across diverse datasets. In this study, we present a machine learning-based framework for predicting memory requirements of metagenomic assembly using metaSPAdes. We analyzed 300 assembly jobs from diverse biomes and evaluated 18 predictive models using combinations of input file size, biome classification, and sequence-derived k-mer features. K-mer profiles were computed from raw sequencing data and summarized into statistical descriptors capturing sequence complexity and diversity. Model performance was assessed using both conventional regression metrics and a production-oriented cost function that accounts for retry policies and resource waste in high-performance computing environments. Our results show that machine learning models can outperform commonly used heuristics. In particular, models incorporating biome information achieved the best performance and can be tuned to favor conservative predictions that reduce job failure rates. Simpler models based solely on input file size also performed competitively, offering a practical alternative for systems with limited feature availability. When evaluated under realistic workload distributions, predictive approaches reduced total memory waste by several million gigabyte-hours per 1,000 jobs compared to static allocation strategies. These findings demonstrate that data-driven resource prediction can substantially improve efficiency in metagenomic workflows. The proposed framework is adaptable to different computational environments and provides a foundation for integrating predictive resource allocation into large-scale bioinformatics platforms beyond Galaxy.

3

MOAflow: how re-design a pipeline with Nextflow streamlines data analysis

Tartaglia, J.; Giorgioni, M.; Cattivelli, L.; Faccioli, P.

2026-03-30 bioinformatics 10.64898/2026.03.26.713914 medRxiv

Top 0.1%

15.3%

Show abstract

BackgroundAdvances in high-throughput DNA sequencing technologies have dramatically reduced the time and cost required to generate genomic data. As sequencing is no longer a limiting factor, increasing attention must be paid to optimizing the analyses of the large-scale datasets produced. Efficient processing of such data is essential to reduce computational time and operational costs. In this context, workflow management systems (WMSs) have become key instruments for orchestrating complex bioinformatic pipelines. Among these systems, Nextflow has emerged as one of the most widely adopted solutions in bioinformatics. MethodsTo improve scalability and computational efficiency, we employed Nextflow to re-design an already existing pipeline dedicated to the analysis of MNase-defined cistrome-Occupancy (MOA-seq) data. The re-engineering process focused on modularizing the workflow and integrating containerization technologies to ensure reproducibility and easier deployment across heterogeneous computing environments. ResultsThe resulting workflow, named MOAflow, represents a modernized and fully containerized pipeline for MOA-seq data analysis. With only Docker and Nextflow required, the pipeline guarantees high portability and reproducibility. The data of the original article was used to benchmark the new pipeline. Its outputs closely match those of the original study with minor variations. ConclusionsMOAflow demonstrates how the adoption of robust WMS can substantially enhance the performance and usability of pre-existing bioinformatic pipelines. By leveraging containerization and Nextflow, it ensures consistent results across platforms while minimizing setup complexity. This work highlights the value of modern WMS-driven approaches in meeting the computational demands.

4

Community needs for FAIR pathogen data

van Geest, G.; Thomas-Lopez, D.; Feitzinger, A. A.; Weissgold, L. A.; Halabi, S.; Cuesta, I.; Hjerde, E.; Gurwitz, K. T.; Arora, N.; Neves, A.; Palagi, P. M.; Williams, J. J.

2026-04-15 scientific communication and education 10.64898/2026.04.14.718420 medRxiv

Top 0.1%

14.5%

Show abstract

BackgroundDatasets related to infectious diseases are essential for public health decision-making, yet their reuse remains limited by persistent barriers to data sharing and integration. Achieving data that are Findable, Accessible, Interoperable, and Reusable (FAIR) is widely recognized as essential for accelerating scientific discovery and enabling coordinated responses to emerging threats, but the needs of the global pathogen data community have not been systematically characterized. AimThis study, conducted by the Pathogen Data Network (PDN), aims to identify infrastructural and educational priorities among stakeholders working with infectious disease-related data in order to guide community-responsive support for data sharing and interoperability. MethodsA cross-sectional stakeholder survey was disseminated to a well-defined expert population within PDN networks and via open professional channels. A total of 136 responses from researchers, healthcare professionals, bioinformaticians, and educators were analyzed descriptively to identify prioritized barriers, training needs, and preferred support mechanisms. ResultsRespondents consistently identified structural constraints as the primary impediments to effective data use, including limited funding (74%), data-aggregation challenges (68%), and a shortage of skilled personnel (52%). Respondents identified bioinformatics for infectious disease research (68%) as the highest priority for training, followed by guidance on using the integrated pathogen data and tools portal provided by the PDN, the Pathogens Portal (51%). The Pathogens Portal was also ranked as the most essential PDN resource (72%). Preferred training formats included virtual short courses (68%) and webinars (66%). Notably, while researchers emphasized technical subjects like machine learning, educators prioritized foundational case studies. ConclusionThese findings provide an evidence-based diagnostic of community needs and suggest that barriers to FAIR pathogen data are predominantly systemic rather than purely technological. The survey framework and openly available dataset offer a reusable template for assessing needs in other communities and regions. By aligning training, infrastructure development, and outreach with empirically identified priorities, organizations supporting infectious disease research can strengthen the interoperability and reuse of data and establish a benchmark for future community-driven improvements.

5

Claw4Science: A Dataset and Platform for the OpenClaw Scientific Agent Ecosystem

Xu, M.; Chen, J.; Zhang, Z.

2026-04-01 bioinformatics 10.64898/2026.03.30.715118 medRxiv

Top 0.1%

14.0%

Show abstract

Large language models have enabled a new class of scientific software in the form of AI agents that can execute research workflows across bioinformatics, drug discovery, and related domains. Among these systems, OpenClaw introduced a skill-based design that allows workflows to be expressed as structured Markdown files, lowering the barrier to contribution and enabling rapid ecosystem growth. However, this growth has led to fragmentation. Projects are distributed across independent repositories, skills vary widely in quality, naming is inconsistent, and there is no unified way to discover or compare available tools. In this work, we construct the first curated dataset of the OpenClaw scientific ecosystem. The dataset includes 91 projects organized by functional role and 2,230 skills spanning 34 scientific categories. Based on this dataset, we perform a systematic analysis of the structure, distribution, and emerging patterns of scientific agent development. To make this ecosystem accessible in practice, we further build Claw4Science, a public platform at https://claw4science.org, which is built on top of our dataset. The platform organizes projects and aggregates distributed skill repositories into a unified interface, with a focus on bioinformatics and scientific workflows, providing a practical entry point for navigating the ecosystem. Our results show that the OpenClaw ecosystem reflects a shift from isolated systems to a more modular and shareable model of scientific computation. At the same time, challenges in evaluation, reproducibility, and governance remain open. We argue that our dataset provides a foundation for future benchmark development and standardized infrastructure for scientific AI agents.

6

REBEL, Reproducible Environment Builder for Explicit Library resolution

Martelli, E.; Ratto, M. L.; Nuvolari, B.; Arigoni, M.; Tao, J.; Micocci, F. M. A.; Alessandri, L.

2026-04-07 bioinformatics 10.64898/2026.04.04.716498 medRxiv

Top 0.1%

10.7%

Show abstract

BackgroundAchieving FAIR-compliant computational research in bioinformatics is systematically undermined by two compounding challenges that existing tools leave unresolved: long-term reproducibility and accessibility. Standard package managers re-download dependencies from live repositories at every build, making environments vulnerable to library disappearance and version drift, and pinning a package version does not pin the versions of its transitive dependencies, causing divergences between builds performed at different points in time. Compounding this, packages from repositories such as CRAN, Bioconductor, and PyPI frequently omit critical system-level dependencies from their installation metadata, leaving users to manually discover which underlying library is missing or which version is required. Beyond these technical failures, constructing a truly reproducible environment demands expertise in containerization making reproducibility in practice a privilege and not a standard. FindingsWe present REBEL (Reproducible Environment Builder for Explicit Library Resolution), a framework that addresses both challenges through three dependency inference heuristics: (i) Deep Inspection of source code, (ii) Fuzzy Matching against a manually curated knowledge base, and (iii) Conservative Dependency Locking. The resolved dependency stack is then archived into a self-contained local store, enabling offline and deterministic rebuilds at any future time. We compared the installation of 1,000 randomly sampled CRAN packages in isolated Docker containers versus the standard package manager and REBEL resolved 149 of 328 standard installation failures (45.4%). Moreover through its DockerBuilder component, REBEL further generates fully reproducible Docker images from a plain text requirements file, making deterministic environment construction accessible without expertise in containerization. ConclusionsREBEL provides a practical foundation for FAIR-compliant, long-term reproducible bioinformatics analyses, making deterministic environment construction accessible to researchers regardless of their technical background. REBEL is freely available at https://github.com/Rebel-Project-Core

7

Benchmarking Agentic Bioinformatics Systems for Complex Protein-Set Retrieval: A Coccolithophore Calcification Case Study

Zhang, X.

2026-04-02 bioinformatics 10.64898/2026.03.28.715041 medRxiv

Top 0.1%

10.5%

Show abstract

Large language model agents are increasingly used for bioinformatics tasks that require external databases, tool use, and long multi-step retrieval workflows. However, practical evaluation of these systems remains limited, especially for prompts whose target set is both large and biologically heterogeneous. Here, I benchmarked three agent systems on the same difficult retrieval task: downloading coccolithophore calcification-related proteins from UniProt across six mechanistically distinct categories, while producing category-separated FASTA files and supporting evidence. The compared systems were Codex app agents extended with Claude Scientific Skills, Biomni Lab online, and DeerFlow 2 with default skills only. Outputs were normalized at the UniProt accession level and compared category by category using overlap analysis, Venn decomposition, and a heuristic relevance assessment of each subset relative to the benchmark prompt. Across the six shared categories, Codex retrieved 2,118 proteins, DeerFlow 6,255, and Biomni 8,752 in a run. Codex showed the best balance between sensitivity and specificity: 92.4% of its proteins fell into subsets labeled high relevance and the remaining 7.6% into medium relevance. DeerFlow was substantially more exhaustive, but 43.8% of its proteins fell into low or low-medium relevance subsets. Biomni produced the largest sets, yet 69.5% of its proteins fell into low or low-medium relevance subsets, mainly due to broad expansion into generic calcium sensors, kinases, transcription factors, and poorly specific domain families. Category-specific analysis showed that Codex was the strongest primary source for inorganic carbon transport, calcium and pH regulation, vesicle trafficking, and signaling, whereas DeerFlow contributed valuable complementary matrix and polysaccharide candidates. A second run for each system also separated them strongly by repeatability: Codex had the highest within-system stability (mean category Jaccard 0.982; micro-Jaccard 0.974), DeerFlow was intermediate (0.795; 0.571), and Biomni was least stable (0.412; 0.319). These results suggest that for complex protein-family retrieval tasks, agent quality depends less on raw output volume than on prompt decomposition, taxonomic scoping, exact query generation, provenance-rich export artifacts, and repeated-run stability.

8

OncoContour: An Interactive Platform for Geographic Visualization and Demographic Analysis of Cancer Incidence.

White, D.; Uzun, A.

2026-05-22 bioinformatics 10.64898/2026.05.20.726625 medRxiv

Top 0.1%

10.2%

Show abstract

Cancer incidence varies substantially across geographic regions and demographic groups, yet translating large-scale surveillance datasets into accessible, interpretable visualizations remains a challenge for researchers and public health professionals without computational expertise. We developed OncoContour, an interactive web-based platform that enables geographic visualization and demographic analysis of cancer incidence data through a browser-accessible interface. To demonstrate its capabilities, we analyzed publicly available cancer incidence data from the United States Cancer Statistics database via CDC WONDER, covering five major cancer types across four northeastern U.S. metropolitan statistical areas from 2017 through 2022, supplemented by demographic data from the U.S. Census Bureau American Community Survey. OncoContour integrates population distribution heatmaps, per-capita cancer incidence heatmaps, interactive multi-city temporal trend charts, structured cancer data tables, and demographic visualizations covering race, ethnicity, age, and sex distributions into a single dynamically generated HTML report. The platform is implemented in Python using Flask, Folium, Plotly, and Matplotlib, and is containerized using Docker for reproducible local deployment. Across all four metropolitan areas, breast and prostate cancers accounted for the highest incidence counts over the study period, while a decline in reported cases observed in 2020 is consistent with documented disruptions to cancer screening during the COVID-19 pandemic. By integrating geospatial mapping, temporal analysis, and demographic visualization within a unified, no-code interface, OncoContour aims to support cancer surveillance, epidemiological investigation, and targeted public health planning. OncoContour is freely available at https://github.com/alperuzun/oncocontour_docker.

9

Open neuroinformatics infrastructure ecosystem for federated multisite studies

Wang, M.; Bhagwat, N.; Cremonesi, F.; Dugre, M.; Pfarr, J.-K.; d'Angremont, E.; Dai, A.; Jahanpour, A.; Urchs, S.; Cansiz, S.; Chambon, L.; Dincer, A. T.; Torres, J.; Vesin, M.; Pinilla-Monsalve, G.; Song, Y.; Vriend, C.; Jeanson, F.; Monchi, O.; van der Werf, Y. D.; Lorenzi, M.; Poline, J.-B.

2026-05-05 neuroscience 10.64898/2026.04.30.721944 medRxiv

Top 0.1%

10.2%

Show abstract

Despite growing understanding of the benefits of having Findable, Accessible, Interoperable, and Reusable (FAIR) data, many datasets still cannot be shared. Federated analysis methods can enable multisite studies that do not require the sharing of participant-level information. However, there are many practical hurdles that prevent the large-scale adoption of federated methods. We discuss challenges related to cross-site data preparation for federated learning, present solutions offered by recent neuroinformatics projects, and showcase an example of tool integration applied to neurodegenerative disease data.

10

User-driven development and evaluation of an agentic framework for analysis of large pathway diagrams

Corradi, M.; Djidrovski, I.; Ladeira, L.; Staumont, B.; Verhoeven, A.; Sanz Serrano, J.; Rougny, A.; Vaez, A.; Hemedan, A.; Mazein, A.; Niarakis, A.; de Carvalho e Silva, A.; Auffray, C.; Wilighagen, E.; Kuchovska, E.; Schreiber, F.; Balaur, I.; Calzone, L.; Matthews, L.; Veschini, L.; Gillespie, M. E.; Kutmon, M.; Koenig, M.; van Welzen, M.; Hiroi, N.; Lopata, O.; Klemmer, P.; Overall, R.; Hofer, T.; Satagopam, V.; Schneider, R.; Teunis, M.; Geris, L.; Ostaszewski, M.

2026-03-12 bioinformatics 10.64898/2026.03.10.710813 medRxiv

Top 0.1%

10.0%

Show abstract

As biomedical knowledge keeps growing, resources storing available information multiply and grow in size and complexity. Such resources can be in the format of molecular interaction maps, which represent cellular and molecular processes under normal or pathological conditions. However, these maps can be complex and hard to navigate, especially to novice users. Large Language Models (LLMs), particularly in the form of agentic frameworks, have emerged as a promising technology to support this exploration. In this article, we describe a user-driven process of prototyping, development, and user testing of Llemy, an LLM-based system for exploring these molecular interaction maps. By involving domain experts from the very first prototyping in the form of a hackathon and collecting both fine-grained and general feedback on more refined versions, we were able to evaluate the perceived utility and quality of the developed system, in particular for summarising maps and pathways, as well as prioritise the development of future features. We recommend continued user-driven development and benchmarking to keep the community engaged. This will also facilitate the transition towards open-weight LLMs to support the needs of the open research environment in an ever-changing technology landscape.

11

VX: an AI-enabled desktop genome viewer and transcriptome browser with a programmable analysis framework

Shirokikh, N. E.; Cleynen, A.

2026-05-20 bioinformatics 10.64898/2026.05.17.725790 medRxiv

Top 0.1%

10.0%

Show abstract

BackgsroundGenome and transcriptome browsers are central to the interpretation of high-throughput sequencing data, but todays tools assume a human operator at a graphical interface and offer only limited programmability. As large-language-model assistants become routine in bioinformatics [Anthropic, 2024], this creates a bottleneck: agents cannot observe the visual state of the browser or drive it through the same interface as the human user, and analyses remain fragmented across a separate ecosystem of external tools. Transcript-coordinate data, produced by ribosome profiling [Ingolia et al., 2012] and direct RNA sequencing [Garalde et al., 2018], is also awkwardly supported in chromosome-oriented viewers. ResultsWe present VX, a desktop genome and transcriptome viewer written in D, using GTK 3 and OpenGL, that handles genome-scale and transcriptome-scale data in a unified interface. VX exposes its full functionality through an embedded HTTP API on the loopback interface and a Model Context Protocol server of currently thirty-nine tools, so that scripts and LLM agents can load data, navigate, manage tracks, run analyses, and capture figures through the same contract used by the GUI. An integrated analysis framework provides more than fifty analyses and includes signal processing and peak calling, quantification, variant analysis, alignment statistics, interaction and cross-track comparisons, all with an explicit four-level scope hierarchy running from viewport to whole dataset; results are written to disk and, where appropriate, added as new tracks. Additional features include a magnifier popup for base-resolution inspection (Alt+hover), chromosome-alias resolution across UCSC, Ensembl, and NCBI conventions, viewport video recording via an ffmpeg pipe, and INI-based configuration. ConclusionsVX complements existing desktop and web browsers by providing a native agent-control layer, an integrated analysis framework, and first-class transcriptspace handling. The binary is freely available for non-commercial use; the HTTP API and MCP protocol are fully specified in this article, so third-party clients can be written independently of the core implementation.

12

Exploring genetic, expression and regulatory patterns of parental alleles in Muscovy duck (Cairina moschata) using haplotype-resolved assemblies

Li, T.; Wang, y.; Zhang, Z.; Chen, c.; Zheng, n.; Wang, j.; Ning, m.; Wang, j.; Ai, H.; Huang, Y.

2026-03-07 genomics 10.64898/2026.03.04.709678 medRxiv

Top 0.1%

10.0%

Show abstract

BackgroundAlthough the biological mechanism for heterosis has been debated for a long time, heterosis is widely utilized to increase the global productivity of crops and livestock. Recently, the mechanism has been well characterized in crops and livestock with a male-heterogametic XY system due to genomic assembly advancements, especially the availability of haploid genomes. However, the biological mechanism for heterosis remains unclear in poultry possessing the female-heterogametic ZW system. ResultsHere, we assembled chromosome-level diploid and haploid genomes of the Muscovy duck. We developed an efficient and cost-effective method to assemble 12 variation graph-haploid Muscovy duck genomes from three full-sibling pairs with high quality using short-read Illumina sequences. We further characterized genetic, expression and regulatory patterns of parental alleles at multiple scales. We found that maternal haploid genomes generally had more open chromatin organization and higher accessibility, and higher levels of gene expression, while showing similar DNA methylation levels when compared to paternal haploid genomes. In contrast, the female paternal Z chromosome showed the most, and the male paternal Z chromosome presented more, relaxed chromatin organization and chromatin accessibility, and gene expression compared to the male maternal Z chromosome. Thus, the ZW system largely relies on compensation and balance to regulate gene expression on the sex Z chromosome. Moreover, we identified non-Mendelian regions covering 0.26% of the genome ([~]3.18 Mb). These regions contained lower gene density, GC content, and repeat sequence frequency, but were enriched for DNA motifs bound by transcription factors, likely leading to a compacted chromatin structure and lower chromatin accessibility. ConclusionsOur work here provides a comprehensive profile of parental alleles genetic, expression and regulatory patterns in the female-heterogametic ZW system, and might be useful for the utilization of heterosis in poultry.

13

From expansion to consolidation: two decades ofGene Ontology evolution

Pitarch, B.; Pazos, F.; Chagoyen, M.

2026-03-06 bioinformatics 10.64898/2026.03.04.709507 medRxiv

Top 0.1%

9.9%

Show abstract

The Gene Ontology (GO) is a long-standing, community-maintained knowledge resource that underpins the functional annotation of gene products across numerous biological databases. Released regularly, GO and its associated annotations form a large, continuously evolving dataset whose temporal dynamics have direct consequences for data reuse, versioning, and reproducibility. Because analytical results derived from GO are inherently tied to specific ontology and annotation releases, a systematic understanding of how GO changes over time is essential for transparent interpretation and long-term reuse of GO-based analyses. Here, we present a comprehensive temporal characterization of the Gene Ontology and its annotations spanning 21 years of publicly available releases. Treating successive ontology and annotation versions as longitudinal research data, we quantify changes in ontology structure, term composition, relationships, and annotation content across time and across three representative annotation resources. Our analysis reveals sustained growth of GO over its lifetime, accompanied by marked structural reorganization, particularly affecting high-level, general ontology terms. Notably, across multiple structural and annotation metrics, we identify a transition toward increased stability beginning around 2017, consistent with a maturation phase of the resource. This work provides a reference framework for researchers who rely on GO releases for data integration, benchmarking, and reproducible functional analysis.

14

Benchmarking Agentic Large Language Models for ComplexProtein-Set Functional Annotation

Zhang, X.

2026-04-21 bioinformatics 10.64898/2026.04.18.719404 medRxiv

Top 0.1%

9.1%

Show abstract

Large language model (LLM) agents are increasingly used to synthesize heterogeneous bioinformatics evidence, but their reliability for high-volume biological annotation remains poorly characterized. We evaluated three agent configurations on a controlled protein annotation task: Claude App with Claude Opus 4.7, Claude Code CLI with Claude Opus 4.7 and Claude Scientific Skills, and Codex App with GPT-5.4 and Claude Scientific Skills. Each configuration was run three times on the same verbatim prompt, the same 73 selected orthogroup FASTA files (1,705 protein sequences), and the same local evidence: Swiss-Prot BLASTP output, Pfam/HMMER domain hits, DeepTMHMM topology predictions, and SignalP secretion predictions. We audited the nine outputs for coverage, biological correctness, missing evidence, hallucinated or over-specific annotations, and within-method consistency, then merged the best-supported evidence into a final orthogroup annotation table. All nine runs covered all 73 orthogroups, indicating that the agents could retrieve and organize the complete input set. However, normalized calcification-relevance calls were only moderately reproducible: within-method exact tier agreement ranged from 0.397 to 0.685 for Claude App (mean 0.562), 0.342 to 0.740 for Claude Code (mean 0.516), and 0.411 to 0.630 for Codex App (mean 0.539), and the per-run number of high-confidence calls varied from 0 to 12 across the nine runs. The final curated table retained 3 high-confidence, 9 moderate, 18 watchlist, and 43 low-relevance orthogroups. The most robust direct candidates were sulfatase (OG0017138) and sulfotransferase (OG0020703) families and an FG-GAP/integrin-like surface protein family (OG0018986), whereas common error modes included elevating pentapeptide-repeat orthogroups on motif evidence alone, treating weakly secreted housekeeping enzymes as matrix proteins, and taking low-complexity BLAST labels at face value. Skill-enabled agents improved file handling, evidence traceability, and reproducibility of computational checking, but they did not eliminate biological overinterpretation. These results support a best-practice workflow in which LLM agents draft annotations only after deterministic evidence tables are generated, with explicit scoring rules, provenance columns, run-to-run replication, and expert review of high-impact claims.

15

geneslator: an R package for comprehensive gene identifier conversion and annotation

Cavallaro, G.; Micale, G.; Privitera, G. F.; Pulvirenti, A.; Forte, S.; Alaimo, S.

2026-04-01 bioinformatics 10.64898/2026.03.30.714723 medRxiv

Top 0.1%

8.8%

Show abstract

MotivationHigh-throughput sequencing generates large gene lists, making data interpretation challenging. Accurate gene annotation and reliable conversion between identifiers (e.g., gene symbols, Ensembl GeneIDs, Entrez GeneIDs) are essential for integrating datasets, conducting functional analyses, and enabling cross-species comparisons. Existing tools and databases facilitate annotation but often suffer from inconsistencies, missing mappings, and fragmented workflows, limiting reproducibility and interpretability. ResultsTo address these limitations, we developed geneslator, an R package that unifies gene identifier conversion, orthologs mapping, and pathway annotation across eight model organisms (Homo sapiens, Mus musculus, Rattus norvegicus, Drosophila melanogaster, Danio rerio, Saccharomyces cerevisiae, Caenorhabditis elegans, Arabidopsis thaliana). geneslator provides an up-to-date, precise, and coherent framework that preserves data integrity, enables cross-species analyses, and facilitates robust interpretation of gene function and regulation, outperforming state-of-the-art gene annotation tools. Availabilitygeneslator is available at https://github.com/knowmics-lab/geneslator. Contactgrete.privitera@unict.it

16

Nipoppy: A framework for standardizing neuroimaging studies to facilitate international derived-data sharing

Bhagwat, N.; Wang, M.; Dugre, M.; Pfarr, J.-K.; Dai, A.; Urchs, S.; McPherson, B.; Gau, R.; van Heese, E. M.; d'Angremont, E.; Laansma, M. A.; Prasad, S.; Sanz-Robinson, J.; Torabi, M.; Jahanpour, A.; Danyluik, M.; Joubert, A.; Macdonald, A.; Waller, L.; Stewart, A.; Joulot, M.; Dickie, E.; Devenyi, G. A.; Bouix, S.; Bollmann, S.; Jahanshad, N.; Thompson, P. M.; Burgos, N.; Chakravarty, M. M.; Halchenko, Y. O.; van der Werf, Y. D.; Poline, J.-B.

2026-05-21 bioinformatics 10.64898/2026.05.18.723593 medRxiv

Top 0.1%

8.7%

Show abstract

Neuroimaging data management and processing are tedious and error-prone, prompting reproducibility concerns. Globally, studies with heterogeneous infrastructure and governance policies lead to eclectic data processing and sharing, necessitating standardization of data workflows to ensure reusability and comparability of multi-centric datasets. The Nipoppy neuroinformatics framework facilitates such standardization by combining specification, protocol, and software to manage study-level data workflows. With its adoption, researchers can share standardized, derived datasets enabling efficient, reproducible, and inclusive research.

17

Dingent: An Easily Deployable Database Retrieval and Integration Agent framework

Kong, D.; Bei, S.; Wu, Y.; Tang, B.; Zhao, W.

2026-03-20 bioinformatics 10.64898/2026.03.17.712026 medRxiv

Top 0.1%

8.5%

Show abstract

AI-driven data search and integration represent an emerging research direction. Although several LLM-based backend frameworks and agentic frameworks have emerged, significant gap remains in developing a one-stop, configurable agent framework that supports various data sources and provides a web interface for efficient data retrieval using natural language. To address this, we present Dingent, a novel and configurable agent framework that facilitates data access from various resources and enables the flexible constructions of agent applications. We demonstrate its capabilities across three distinct application scenarios, achieving promising results. The Dingent framework can be readily applied to other fields, such as earth sciences and ecology, to facilitate data discovery.

18

AI in Practice: A Multilingual Survey of 2025 BioHackathon Participants

Sriwichai, N.; Feriau, L.; Tongyoo, P.; Noda, Y.; Gyoji, H.; Noisagul, P.; Goto, S.; Steinberg, D.; Wangsanuwat, C.

2026-03-27 scientific communication and education 10.64898/2026.03.25.713611 medRxiv

Top 0.1%

8.4%

Show abstract

This dataset arises from a multilingual survey of AI use among participants and community members in the DBCLS BioHackathon 2025 in Japan. The questionnaire, offered in English, Japanese, and Thai, asked about how often respondents use AI tools, what they use them for, obstacles they encounter, institutional support, satisfaction, and concerns. Additional items captured role, institution type, work country, and other demographics, totaling 105 responses. The dataset includes both raw anonymized responses and a cleaned, standardized English-only version suitable for quantitative analysis, along with the full questionnaire, a data dictionary for cleaned dataset, and a translation lookup table. Free-text answers were screened and redacted to remove URLs, names, and other potentially identifiable information. Together, these materials provide a community-level view of AI practice in genomics, bioinformatics, software development, and related areas, and can support work on AI adoption, policy, and methods for analyzing survey data on AI use in science.

19

Benchmarking Heritability Estimation Strategies Across 86 Configurations and Their Downstream Effect on Polygenic Risk Score Performance

Muneeb, M.; Ascher, D.

2026-04-02 bioinformatics 10.64898/2026.04.02.716079 medRxiv

Top 0.1%

8.4%

Show abstract

ObjectiveSNP heritability estimates vary substantially across estimation strategies, yet the downstream consequences for polygenic risk score (PRS) construction remain poorly characterised. We systematically benchmarked heritability estimation configurations and assessed their propagation into downstream PRS performance. MethodsWe benchmarked 86 heritability-estimation configurations spanning six tool families (GEMMA, GCTA, LDAK, DPR, LDSC, SumHer) and ten method groups across 10 UK Biobank phenotypes, yielding 844 configuration-level estimates. Each estimate was propagated into GCTA-SBLUP and LDpred2-lassosum2 PRS frameworks and evaluated across five cross-validation folds using null, PRS-only, and full models. Eleven binary analytical contrasts were tested using Mann-Whitney U tests to identify drivers of heritability variability. ResultsHeritability ranged from -0.862 to 2.735 (mean = 0.134, SD = 0.284), with 133 of 844 estimates (15.8%) negative and concentrated in unconstrained estimation regimes. Ten of eleven analytical contrasts significantly affected heritability magnitude, with algorithm choice and GRM standardisation showing the largest effects. Despite this upstream variability, downstream PRS test performance was only weakly coupled to heritability magnitude: pooled Pearson correlations between h2 and test AUC were r = -0.023 for GCTA-SBLUP and r = +0.014 for LDpred2-lassosum2 (both non-significant). ConclusionSNP heritability is best interpreted as a configuration-sensitive modelling parameter rather than a universally stable scalar input. Heritability estimates should always be reported alongside their full estimation specification, and downstream PRS performance is comparatively robust to moderate variation in the heritability input. Graphical Abstract O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=80 SRC="FIGDIR/small/716079v1_ufig1.gif" ALT="Figure 1"> View larger version (27K): org.highwire.dtl.DTLVardef@112929borg.highwire.dtl.DTLVardef@573c36org.highwire.dtl.DTLVardef@132170borg.highwire.dtl.DTLVardef@1871363_HPS_FORMAT_FIGEXP M_FIG C_FIG

20

muat: portable transformer-based method for tumour classification and representation learning from somatic variants

Sanjaya, P.; Pitkänen, E.

2026-04-03 bioinformatics 10.64898/2026.04.01.715762 medRxiv

Top 0.1%

8.4%

Show abstract

MotivationDeep neural networks have proven effective in classifying tumour types using next-generation sequencing data. However, developing transferable models that work across heterogeneous operating environments remains challenging due to differences in cohort compositions and data generation protocols, privacy concerns, and limited computational capabilities. ResultsWe introduce muat, a transformer-based software for tumour classification using somatic variant data from whole-genome (WGS) and whole-exome sequencing (WES). Building on previously developed MuAt and MuAt2 models, we distribute the software via Docker containers and Bioconda for deployment in high-performance computing (HPC) systems and Secure Processing Environments (SPEs). Using a downloadable MuAt checkpoint, we reproduce the performance reported in the original study on whole genome (PCAWG; 89% accuracy in histological tumour typing) and exome sequencing data (TCGA; 64% accuracy). Cross-cohort evaluation in Genomics England SPE achieved 81% accuracy without retraining and 89% following fine-tuning. As a demonstration of the softwares adaptability, we also deployed muat within the iCAN Digital Precision Cancer Medicine Flagships SPE and integrated it into a Nextflow-managed workflow. Availability and implementationmuat is available through conda (www.anaconda.org/bioconda/muat) and GitHub (https://github.com/primasanjaya/muat), under the Apache 2.0 License. Contactprima.sanjaya@helsinki.fi, esa.pitkanen@helsinki.fi; website: mlbiomed.net