GigaScience — Latest Matching Preprints

1

Machine learning-based prediction of memory requirements for metagenomic assembly in high-performance computing environments

Zierep, P. F.; Faack, S.; Beracochea, M.; Sanchez, S.; Batut, B.; Finn, R. D.; Gruening, B. A.

2026-05-13 microbiology 10.64898/2026.05.12.724571 medRxiv

Top 0.1%

17.0%

Show abstract

Metagenomic assembly can be a computationally intensive step in microbiome analysis, with memory requirements that vary widely depending on input data characteristics. In workflow systems like Galaxy and large-scale platforms like MGnify, which run thousands of heterogeneous jobs, inaccurate memory allocation drives job failures and costly retries when underestimated, and reduces throughput when overestimated. Current approaches rely primarily on heuristic rules based on input file size or sample metadata, which often fail to generalize across diverse datasets. In this study, we present a machine learning-based framework for predicting memory requirements of metagenomic assembly using metaSPAdes. We analyzed 300 assembly jobs from diverse biomes and evaluated 18 predictive models using combinations of input file size, biome classification, and sequence-derived k-mer features. K-mer profiles were computed from raw sequencing data and summarized into statistical descriptors capturing sequence complexity and diversity. Model performance was assessed using both conventional regression metrics and a production-oriented cost function that accounts for retry policies and resource waste in high-performance computing environments. Our results show that machine learning models can outperform commonly used heuristics. In particular, models incorporating biome information achieved the best performance and can be tuned to favor conservative predictions that reduce job failure rates. Simpler models based solely on input file size also performed competitively, offering a practical alternative for systems with limited feature availability. When evaluated under realistic workload distributions, predictive approaches reduced total memory waste by several million gigabyte-hours per 1,000 jobs compared to static allocation strategies. These findings demonstrate that data-driven resource prediction can substantially improve efficiency in metagenomic workflows. The proposed framework is adaptable to different computational environments and provides a foundation for integrating predictive resource allocation into large-scale bioinformatics platforms beyond Galaxy.

2

OncoContour: An Interactive Platform for Geographic Visualization and Demographic Analysis of Cancer Incidence.

White, D.; Uzun, A.

2026-05-22 bioinformatics 10.64898/2026.05.20.726625 medRxiv

Top 0.1%

10.2%

Show abstract

Cancer incidence varies substantially across geographic regions and demographic groups, yet translating large-scale surveillance datasets into accessible, interpretable visualizations remains a challenge for researchers and public health professionals without computational expertise. We developed OncoContour, an interactive web-based platform that enables geographic visualization and demographic analysis of cancer incidence data through a browser-accessible interface. To demonstrate its capabilities, we analyzed publicly available cancer incidence data from the United States Cancer Statistics database via CDC WONDER, covering five major cancer types across four northeastern U.S. metropolitan statistical areas from 2017 through 2022, supplemented by demographic data from the U.S. Census Bureau American Community Survey. OncoContour integrates population distribution heatmaps, per-capita cancer incidence heatmaps, interactive multi-city temporal trend charts, structured cancer data tables, and demographic visualizations covering race, ethnicity, age, and sex distributions into a single dynamically generated HTML report. The platform is implemented in Python using Flask, Folium, Plotly, and Matplotlib, and is containerized using Docker for reproducible local deployment. Across all four metropolitan areas, breast and prostate cancers accounted for the highest incidence counts over the study period, while a decline in reported cases observed in 2020 is consistent with documented disruptions to cancer screening during the COVID-19 pandemic. By integrating geospatial mapping, temporal analysis, and demographic visualization within a unified, no-code interface, OncoContour aims to support cancer surveillance, epidemiological investigation, and targeted public health planning. OncoContour is freely available at https://github.com/alperuzun/oncocontour_docker.

3

Open neuroinformatics infrastructure ecosystem for federated multisite studies

Wang, M.; Bhagwat, N.; Cremonesi, F.; Dugre, M.; Pfarr, J.-K.; d'Angremont, E.; Dai, A.; Jahanpour, A.; Urchs, S.; Cansiz, S.; Chambon, L.; Dincer, A. T.; Torres, J.; Vesin, M.; Pinilla-Monsalve, G.; Song, Y.; Vriend, C.; Jeanson, F.; Monchi, O.; van der Werf, Y. D.; Lorenzi, M.; Poline, J.-B.

2026-05-05 neuroscience 10.64898/2026.04.30.721944 medRxiv

Top 0.1%

10.2%

Show abstract

Despite growing understanding of the benefits of having Findable, Accessible, Interoperable, and Reusable (FAIR) data, many datasets still cannot be shared. Federated analysis methods can enable multisite studies that do not require the sharing of participant-level information. However, there are many practical hurdles that prevent the large-scale adoption of federated methods. We discuss challenges related to cross-site data preparation for federated learning, present solutions offered by recent neuroinformatics projects, and showcase an example of tool integration applied to neurodegenerative disease data.

4

VX: an AI-enabled desktop genome viewer and transcriptome browser with a programmable analysis framework

Shirokikh, N. E.; Cleynen, A.

2026-05-20 bioinformatics 10.64898/2026.05.17.725790 medRxiv

Top 0.1%

10.0%

Show abstract

BackgsroundGenome and transcriptome browsers are central to the interpretation of high-throughput sequencing data, but todays tools assume a human operator at a graphical interface and offer only limited programmability. As large-language-model assistants become routine in bioinformatics [Anthropic, 2024], this creates a bottleneck: agents cannot observe the visual state of the browser or drive it through the same interface as the human user, and analyses remain fragmented across a separate ecosystem of external tools. Transcript-coordinate data, produced by ribosome profiling [Ingolia et al., 2012] and direct RNA sequencing [Garalde et al., 2018], is also awkwardly supported in chromosome-oriented viewers. ResultsWe present VX, a desktop genome and transcriptome viewer written in D, using GTK 3 and OpenGL, that handles genome-scale and transcriptome-scale data in a unified interface. VX exposes its full functionality through an embedded HTTP API on the loopback interface and a Model Context Protocol server of currently thirty-nine tools, so that scripts and LLM agents can load data, navigate, manage tracks, run analyses, and capture figures through the same contract used by the GUI. An integrated analysis framework provides more than fifty analyses and includes signal processing and peak calling, quantification, variant analysis, alignment statistics, interaction and cross-track comparisons, all with an explicit four-level scope hierarchy running from viewport to whole dataset; results are written to disk and, where appropriate, added as new tracks. Additional features include a magnifier popup for base-resolution inspection (Alt+hover), chromosome-alias resolution across UCSC, Ensembl, and NCBI conventions, viewport video recording via an ffmpeg pipe, and INI-based configuration. ConclusionsVX complements existing desktop and web browsers by providing a native agent-control layer, an integrated analysis framework, and first-class transcriptspace handling. The binary is freely available for non-commercial use; the HTTP API and MCP protocol are fully specified in this article, so third-party clients can be written independently of the core implementation.

5

Nipoppy: A framework for standardizing neuroimaging studies to facilitate international derived-data sharing

Bhagwat, N.; Wang, M.; Dugre, M.; Pfarr, J.-K.; Dai, A.; Urchs, S.; McPherson, B.; Gau, R.; van Heese, E. M.; d'Angremont, E.; Laansma, M. A.; Prasad, S.; Sanz-Robinson, J.; Torabi, M.; Jahanpour, A.; Danyluik, M.; Joubert, A.; Macdonald, A.; Waller, L.; Stewart, A.; Joulot, M.; Dickie, E.; Devenyi, G. A.; Bouix, S.; Bollmann, S.; Jahanshad, N.; Thompson, P. M.; Burgos, N.; Chakravarty, M. M.; Halchenko, Y. O.; van der Werf, Y. D.; Poline, J.-B.

2026-05-21 bioinformatics 10.64898/2026.05.18.723593 medRxiv

Top 0.1%

8.7%

Show abstract

Neuroimaging data management and processing are tedious and error-prone, prompting reproducibility concerns. Globally, studies with heterogeneous infrastructure and governance policies lead to eclectic data processing and sharing, necessitating standardization of data workflows to ensure reusability and comparability of multi-centric datasets. The Nipoppy neuroinformatics framework facilitates such standardization by combining specification, protocol, and software to manage study-level data workflows. With its adoption, researchers can share standardized, derived datasets enabling efficient, reproducible, and inclusive research.

6

Benchmarking long-read simulators against Oxford Nanopore whole-genome sequencing data

Taouk, M. L.; Ingle, D. J.; Wick, R. R.

2026-05-11 bioinformatics 10.64898/2026.05.06.723380 medRxiv

Top 0.1%

8.2%

Show abstract

BackgroundOxford Nanopore Technologies (ONT) sequencing is increasingly used for whole-genome sequencing (WGS) across a wide range of applications. However, the platform has evolved rapidly through updates to flow cell chemistry and basecalling algorithms, altering the characteristics of the resulting sequencing data. Read simulators provide synthetic datasets with known ground truth, enabling controlled development and evaluation of methods. However, many existing simulators were developed for earlier versions of ONT sequencing or use generic long-read assumptions, and their realism for contemporary ONT data is unclear. ResultsWe benchmarked six ONT-compatible read simulators (Badread, LongISLND, lrsim, NanoSim, PBSIM3 and SimLoRD) using a microbial genome reference and ONT R10.4.1 reads as the empirical standard. Each tool was configured to maximise realism, including training on empirical reads when supported. We compared simulated and real datasets with respect to read length, read accuracy, FASTQ quality scores and sequence error profiles. No simulator reproduced all metrics of the real data well. PBSIM3 most closely reproduced read length, read accuracy and FASTQ quality scores, making it a strong simulator for broad read-level realism. However, it did not capture important features of the real error profile, including context-dependent substitution rates and homopolymer-length errors. Badread and LongISLND better reproduced some aspects of the error profile, but showed other departures from the real data. ConclusionPBSIM3 is a good general-purpose choice for many ONT WGS simulation tasks because it reproduced several key read-level properties well. However, Badread or LongISLND may be preferable for applications where error structure is more important. No evaluated tool was realistic across all tested metrics, highlighting a gap for improved long-read simulators.

7

Evaluation of MeaSeq: comprehensive analysis and reporting of measles virus whole genome sequences.

Hole, D. T.; Abdalla, A.; Zubach, V.; Pratt, M.; Van Driel, S.; Ashfaq, S.; Hiebert, J.; Duggan, A. T.

2026-05-14 bioinformatics 10.64898/2026.05.12.724559 medRxiv

Top 0.1%

7.9%

Show abstract

Although vaccine-preventable, measles virus (MeV) continues to pose a significant public health challenge, with a substantial resurgence of cases worldwide. As whole-genome sequencing (WGS) becomes increasingly affordable and routinely adopted in public health laboratories, reliable and accessible analysis of next-generation sequencing (NGS) data is critical for outbreak investigation and molecular surveillance. Here, we present MeaSeq, a fast, user-friendly, open-source bioinformatics pipeline for MeV analysis using Illumina or Oxford Nanopore Technologies (ONT) NGS data. MeaSeq performs quality control assessments, consensus genome assembly and variant detection, optional genotype-specific reference selection, Distinct Sequence Identifier (DSId) assignment via user-provided databases or hashing, sub-consensus variant visualization, genome quality assessment, and standardized HTML reporting. We compared the performance of MeaSeq on NGS data generated from multiple sequencing platforms and targeted enrichment strategies against gold-standard Sanger data, reference genomes, and publicly available comparative data. This validation demonstrates that MeaSeq provides an accurate, reproducible, and accessible solution for routine MeV WGS analysis, supporting genomic surveillance and outbreak response workflows in public health and research settings. Impact StatementThe recent surge in measles cases worldwide, causing several countries to lose their measles elimination status, underscores the urgent need for effective and accessible genomic surveillance. Our manuscript introduces MeaSeq, a comprehensive and open-source bioinformatics pipeline specifically designed for analyzing MeV NGS data. MeaSeq includes MeV specific analyses such as genotype prediction from sequencing reads with optional genotype-specific reference selection; DSId assignment; quality control checks such as genome rule-of-six divisibility and gene CDS validation; subconsensus nucleotide analysis with mixed-site highlighting; and genomic plotting. By leveraging NGS technology, our pipeline can facilitate the identification of transmission chains and may provide critical insights into the dynamics of MeV outbreaks. This information is essential for public health officials and researchers to implement targeted interventions and optimize vaccine strategies. Additionally, the open-source nature of MeaSeq fosters collaboration and innovation within the scientific measles community along with providing access to a wider range of researchers. Data SummaryThe MeaSeq pipeline code is available on GitHub (https://github.com/phac-nml/measeq). Comparative datasets of publicly available WGS data were accessed through the NCBI Sequence Read Archive under the following BioProjects: PRJNA869081 (https://www.ncbi.nlm.nih.gov/bioproject/PRJNA869081) PRJNA480551 (https://www.ncbi.nlm.nih.gov/bioproject/PRJNA480551) PRJNA1017431 (https://www.ncbi.nlm.nih.gov/bioproject/PRJNA1017431) PRJNA1241325 (https://www.ncbi.nlm.nih.gov/bioproject/PRJNA1241325) PRJNA1174053 (https://www.ncbi.nlm.nih.gov/bioproject/PRJNA1174053) PRJNA1293457 (https://www.ncbi.nlm.nih.gov/bioproject/PRJNA1293457) PRJNA843031 (https://www.ncbi.nlm.nih.gov/bioproject/PRJNA843031) Whole-genome sequences were included in the validation analysis if they consisted of paired-end data (Illumina) and achieved [≥]95% genome completeness following trimming of the 5' and 3' untranslated regions (UTRs). This criterion ensured sufficient genome coverage for robust validation while allowing for limited missing data arising from regions of low sequencing depth or amplicon dropout. A complete list of sequences included in the validation, along with their accession numbers, is provided in Supplementary Table 1.

8

cran2crux: automatically create CRUX ports for R-packages

Petrov, P.; Izzi, V.

2026-05-13 bioinformatics 10.64898/2026.05.09.723963 medRxiv

Top 0.1%

7.4%

Show abstract

MotivationR together with CRAN and Bioconductor provides one of the richest ecosystems for bioinformatics and computational biology, with thousands of specialized packages. While GNU/Linux is a vastly-used operating system in this field, R-packages are typically managed independently of the systems native package manager. This separation makes installation, updates and mass rebuilds cumbersome. CRUX, a minimalist semi-source GNU/Linux distribution, offers great flexibility with its ports-based system for the seamless integration of R-packages with its native package manager. ResultsThe hereby presented cran2crux tool automatically generates CRUX ports for packages from both CRAN and Bioconductor. It performs recursive dependency resolution, handles naming conventions, extracts dependencies information, and supports inclusion of optional dependencies. The tool also provides convenient functions for checking updates and regenerating outdated ports. It can generate over 140 ports for complex packages such as Seurat in approximately 11 seconds, dramatically simplifying the maintenance of large R-dedicated repositories on CRUX. Availabilitycran2crux is available under the MIT license at https://github.com/izzilab/cran2crux. As of now, more than 650 R package ports, generated with the tool, are available in the CRUX ports database.

9

Evaluating open LLMs for agentic analysis orchestration in a typical biomedical lab

Nekrutenko, A.

2026-05-18 bioinformatics 10.64898/2026.05.13.724985 medRxiv

Top 0.1%

7.2%

Show abstract

Agentic tools -- software environments where a large language model plans, calls external tools, executes code, and iterates with minimal human intervention -- will run a substantial share of routine biomedical data analysis within the next few years. However, per-call inference cost on frontier models is the bottleneck and can add up quickly. Here, we tested whether a free, locally-runnable open-weight model could take over the repetitive execution steps at frontier accuracy. We used Claudes Opus to author plans of increasing detail for per-sample variant calling, and ran six 2026-release open-weight implementer LLMs against those plans on a set of desktop GPUs. qwen3.6:27b reproduced frontier accuracy on every plan and matched Opus cell-for-cell on a 36-cell error-injection matrix. A sub-$2,000 Jetson or Apple Mac Mini sufficed for the implementer side. The open-weight model landscape evolves on the order of months, so the specific implementer recommended here will be superseded; we provide the plans, harness, scoring code, and per-cell artifacts at https://github.com/nekrut/LLM-eval-paper as a framework for re-evaluating future models.

10

Figra: A WebAssembly-based Excel Add-in for publication-quality scientific visualization with ggplot2

Sato, Y.

2026-05-12 bioinformatics 10.64898/2026.05.06.723320 medRxiv

Top 0.2%

6.4%

Show abstract

Data visualization is a critical step in scientific communication. Most researchers rely on subscription-based software for this purpose, which requires ongoing licensing costs. Free alternatives such as R and Python offer publication-quality output but demand programming expertise that many researchers do not possess. Artificial intelligence tools can assist with figure generation but remain frustrating when users wish to fine-tune specific visual parameters to their preference. Meanwhile, Microsoft Excel, the most widely used tool for scientific data storage and management, offers limited visualization capabilities, forcing researchers to transfer their data to external software as an extra step before creating figures. Here we present Figra, a free Excel Office Add-in that eliminates this extra step by enabling publication-quality ggplot2-based figure generation directly within Excel, with simple and direct control over every visual option. Figra leverages WebAssembly technology (webR) to execute R code entirely within the browser, requiring no R installation, no subscription, and no server connection. The add-in supports over 20 chart types spanning distribution plots, grouped comparisons, time-series, scatter plots, and specialized curve-fitting analyses. For applicable chart types, Figra performs automated or manual statistical analysis supporting both paired and unpaired designs across two or more groups. Additionally, Figra exports simplified, executable R code that reproduces the displayed figure, serving as an educational tool for researchers wishing to learn ggplot2. Figra is open-source and freely available at https://h20gg702.github.io/figra-pages/index.html while the source code is provided at https://github.com/h20gg702/Figra.

11

Multi-Scale Tri-Modal Histology Dataset Integrating Tumor Morphology, Immune Patterns, and Clinical Outcomes

Jung, K. J.; Qiu, J.; Cho, S.; McDonough, E.; Chadwick, C.; Ghose, S.; West, R. B.; Brooks, J. D.; Ginty, F.; Machiraju, R.; Mallick, P.

2026-05-19 bioinformatics 10.64898/2026.05.15.725535 medRxiv

Top 0.2%

6.3%

Show abstract

Accurate prognostic assessment of prostate cancer (PCa) requires an integrated understanding of tissue morphology-encompassing cell structure, glandular architecture, and tissue organization-and the immune environment. We present Prostate-TriMod, a novel tri-modal histology dataset designed to integrate high-resolution visual morphology with spatial tissue maps, immune infiltration patterns, and clinical outcomes. This dataset, generated from the Cell DIVE multiplexed imaging platform, consists of three synchronized modalities: (1) multiscale virtual H&E tiles (224px, 256px, 512px, and 2040px) providing visual morphological context, (2) spatial tissue maps identifying cancerous/non-cancerous epithelial cells, stroma and immune cell populations (via TOPAZ and CAT models), and (3) text captions generated from single-cell data and patterns. The dataset includes comprehensive clinical annotations, including Grade Groups and biochemical recurrence (BCR) status. By providing high-fidelity alignment between visual features, spatial tissue maps, and textual descriptions, Prostate-TriMod empowers the development of advanced multimodal AI frameworks. We expect this resource to support reuse in multimodal representation learning, spatial analysis, and benchmarking studies that link histology morphology and immune context to clinical outcomes in prostate cancer.

12

Building computational benchmarks: an Omnibenchmark reimplementation of a single-cell preprocessing pipeline evaluation

Choudhury, A.; Kitak, T.; Carrillo, B.; Busch, P.; Emons, M.; Gunz, S.; Koderman, M.; Luo, S.; Mallona, I.; Meara, A.; Wissel, D.; Robinson, M. D.

2026-05-05 bioinformatics 10.64898/2026.05.01.722166 medRxiv

Top 0.2%

6.2%

Show abstract

In the past few years, we have seen a veritable surge in single-cell (e.g., RNA sequencing) techniques and datasets, enabling increasingly detailed characterization of cellular heterogeneity across tissues and conditions. This surge in single-cell techniques has been complemented by a large number of analysis frameworks and pipelines, and a large parameter space and researcher degrees of freedom to use them. Many neutral benchmarks have been presented for various computational tasks, but most make design decisions that render them incompatible with each other, e.g., different datasets and metrics, or parameter sets used. In this work, we showcase a recently developed framework, Omnibenchmark, to build reproducible, extensible and standardized method comparisons. This not only facilitates the broad investigation of pipelines used in single-cell data analysis, but also highlights how the process of building benchmarks can be streamlined and unified. We do this as an initial proof-of-principle for an arms-length benchmark that evaluates five single-cell RNA sequencing pipelines (filtering to normalization to dimensionality reduction to clustering) on three datasets. This standardization enables benchmarks to be easily extended in several directions, including broader parameter sweeps, comparisons across software versions and architectures, isolation of pipeline steps, and integration of additional pipelines, datasets, and metrics.

13

Revealing the Hidden Landscape of Public Metabolomics Data Reuse in MetaboLights

Karaman, I.; Payne, T.; Vizcaino, J. A.

2026-05-05 bioinformatics 10.64898/2026.05.01.722142 medRxiv

Top 0.2%

6.1%

Show abstract

Public data reuse is a key driver of progress in omics sciences, including increasingly metabolomics data. In this study, we present a validated analysis of confirmed reuse of datasets from the MetaboLights data repository, one of the leading resources in the field. Candidate publications were collected via dataset identifiers (MTBLS#) using a Python-based retrieval pipeline across major publisher databases. They were next manually validated to distinguish active reuse from citation-only mentions. Overall, 272 unique publications were confirmed to have reused at least one MetaboLights dataset. Reuse is dominated by Method/Tool Development, with smaller contributions from Secondary Biological Analysis and Data Integration/Meta-analysis. LC-MS datasets account for the majority of reuse, whereas NMR and GC-MS also contribute but at a lower level. Data reuse has increased over time, with a noticeable acceleration in the most recent years. At the dataset level, reuse follows a long-tail distribution, where a small subset of datasets accounts for repeated reuse, mainly as community benchmarks. These results provide a conservative estimate of public metabolomics data reuse and show that public datasets are predominantly used for methodological and computational applications. They also indicate that reuse is under-detected when dataset identifiers are not consistently reported, highlighting the need for standardised dataset citation to improve traceability and recognition of reuse. Statement of significance of the studyThe impact of public metabolomics repositories has been difficult to assess due to the lack of reliable evidence distinguishing true data reuse from simple literature citations. This study addresses that gap by providing a conservative, manually validated baseline for confirmed reuse of datasets from the MetaboLights data repository. The analysis clarifies how MetaboLights datasets are used in practice, showing that reuse is concentrated to a limited number of datasets and is dominated by computational and methodological applications.

14

ToxCastLite: A portable semantic evidence graph linking in vitro bioactivity, in vivo toxicity, and exposure-use context

Dönmez, A.; Nosov, O.; Heck, K.; Mosig, A.; Fritsche, E.; Koch, K.

2026-05-19 bioinformatics 10.64898/2026.05.16.724895 medRxiv

Top 0.2%

5.0%

Show abstract

MotivationThe ToxCast database is a valuable resource for computational toxicology and new approach methodologies (NAMs), but the approximately 100 GB MySQL distribution is difficult to use for portable local analysis and cross-domain evidence mining. Many practical questions concern chemicals, in vitro bioactivity, in vivo toxicological evidence, and exposure-relevant product-use context rather than raw database keys. ResultsWe present ToxCastLite, a portable semantic evidence-access system that combines assay-scoped SQLite databases with a compact RDF layer for GraphDB-based querying. The system streams large ToxCast/invitrodb MySQL dumps into curated SQLite profiles, reducing the footprint to approximately 3 GB for focused use cases such as developmental neurotoxicity. Dense numerical evidence, including concentration-response rows, remains in SQLite, while the RDF projection exposes linked semantic entities such as chemicals, assays, endpoints, model results, potency parameters (AC50), and MC6 quality flags. We further extend the graph with CPDat v4.0 product-use and functional-use evidence and ToxRefDB v3.0 in vivo toxicity evidence, including processed studies, point-of-departure records, effect summaries, and observation summaries. These layers are linked through DSSTox Substance Identifiers, enabling integrated queries across NAM bioactivity, curated animal-study evidence, and exposure/use context. A Streamlit prototype supports exploration through a locally deployed LLM that translates natural-language questions into SPARQL, grounded by a versioned RDF schema to reduce hallucination risk. Case studies in developmental neurotoxicity demonstrate how ToxCastLite identifies concordance between high-confidence in vitro DNT activity and positive in vivo apical evidence, detects in vitro DNT activity beyond available DNT-specific in vivo evidence, and prioritizes chemicals where NAM signals, ToxRefDB evidence, and CPDat product-use context intersect. For selected results, users can drill down from the semantic graph to the underlying SQLite records and retrieve concentration-response curves for expert inspection without manually writing SQL or SPARQL. AvailabilityProject website at toxcast-lite.github.io/. Contactarif.doenmez@iuf-duesseldorf.de

15

cadmus: a robust pipeline for scalable retrieval of full-text biomedical literature

Campbell, J.; Lain, A. D.; Simpson, T. I.

2026-05-19 bioinformatics 10.64898/2026.05.16.725623 medRxiv

Top 0.2%

5.0%

Show abstract

cadmus is an open-source Python toolkit for automated retrieval and processing of full-text biomedical literature. It utilises programmatic access to PubMed, Crossref, Europe PMC, PMC, and publisher APIs, allowing users to construct large, domain-specific corpora with minimal manual intervention. cadmus parses PDF, HTML, XML, and plain text files, standardising them for downstream biomedical text mining. During the retrieval of a Developmental Disorders Corpus (204,043 publications), it achieved an 85.2% full-text retrieval rate with institutional subscriptions and 54.4% without. To test the fidelity of retrieved full-texts, we used ScispaCy to infer the similarity of paired documents from 44,264 open-access PubMed Central files and the files retrieved from cadmus, resulting in an average cosine similarity score of 0.98. Rarefaction analyses demonstrated that full-text corpora double the coverage of unique biomedical concepts over abstracts, resulting in better access to the depth of the biomedical information available. Availability and implementationcadmus is a freely available package for non-commercial research at https://github.com/biomedicalinformaticsgroup/cadmus and released under the MIT License.

16

Building an open ecosystem for molecular neuroimaging: standards and tools from the OpenNeuroPET initiative

Ganz, M.; Norgaard, M.; Pernet, C.; Matheson, G. J.; Galassi, A.; Ceballos, E. G.; Wighton, P.; Bilgel, M.; Eierud, C.; Gonzalez-Escamilla, G.; Buckholtz, J.; Blair, R.; Markiewicz, C. J.; Hardcastle, N.; Greve, D. N.; Thomas, A. G.; Poldrack, R. A.; Calhoun, V. D.; Innis, R. B.; Knudsen, G. M.

2026-05-09 bioinformatics 10.64898/2026.05.06.722876 medRxiv

Top 0.3%

4.8%

Show abstract

Molecular neuroimaging with positron emission tomography (PET) and single-photon emission computed tomography (SPECT) enables quantification of specific molecular targets in the living brain. Despite its scientific impact, molecular neuroimaging research has historically faced challenges due to high costs, small sample sizes, laboratory-specific analysis pipelines, and limited large-scale data sharing. These factors have hindered reproducibility and the broader reuse of valuable PET datasets. The OpenNeuroPET initiative was established to address these barriers by developing standards, infrastructure, and open-source tools for organizing, sharing, and analyzing molecular neuroimaging data. Through collaborations across Europe and North America, OpenNeuroPET has supported the PET extension of the Brain Imaging Data Structure (PET-BIDS), providing a standardized framework for PET datasets and metadata. Building on PET-BIDS, tools such as PET2BIDS, ezBIDS, and BIDSCoin facilitate data conversion and curation. In parallel, OpenNeuro now hosts PET-BIDS datasets for open sharing, while complementary platforms such as PublicnEUro enable GDPR-compliant controlled access. Emerging open-source workflows and BIDS applications further support automated, reproducible PET preprocessing and quantitative analysis, promoting harmonized processing across centers. Together, these developments mark an important step toward an open molecular neuroimaging ecosystem in which datasets, software, and workflows can be transparently shared, reused, and scaled for collaborative research.

17

A new method based on genome alignments provides a highly resolutive target enrichment set for weevils (Coleoptera, Curculionoidea)

ZELVELDER, B.; BENOIT, L.; LOISEAU, A.; HARAN, J.; ALLIO, R.

2026-05-13 evolutionary biology 10.64898/2026.05.09.724036 medRxiv

Top 0.3%

4.8%

Show abstract

Target enrichment methods have provided unprecedented advances in phylogenomics. Targeting hundreds of conserved regions has proven to be a good tradeoff between cost and efficiency, while being useful for museomics and diversified non-model clades. Unfortunately, current methods used for identifying such regions involve high degrees of conservation within targeted elements, usually pushing researchers to rely on flanking data with little guarantee for homology. With a growing number of high quality genomes available throughout the Tree of Life emerges new opportunities to improve marker selection. In this study, we introduce GABBI, a new method for designing target capture probes by taking advantage of genome alignments, avoiding the selection of a single reference genome that can cause notable biases. We compare GABBI-derived markers to the most commonly used probe design method, PHYLUCE, at two taxonomic scales, the weevil superfamily Curculionoidea and the tribe Pachyrhynchini. At both taxonomic scales, results show that our new method allows identifying more variable loci that prove to be more phylogenetically resolutive than the PHYLUCE-derived ones. Doing so, we provide the first probe set specifically designed for weevils, targeting a wide set of 4,255 shared homologous regions, encouraging future research on systematics and macroevolution of one of the most diverse and economically important groups of insects. By providing GABBI as an automated and open-access pipeline, we hope to open new probe design opportunities to other taxonomic groups that face similar phylogenetic obstacles.

18

PromptBio-Bench: Benchmarking LLM-based Bioinformatics Agents for End-to-End Data Analysis

Guo, W.; Zhang, M.; Han, B.; Ma, Y.; Leng, Y.; Hebbar, S.; Zhou, X.; Gu, W.; Yang, X.; Dhar, S.

2026-05-08 bioinformatics 10.64898/2026.05.05.723092 medRxiv

Top 0.3%

4.7%

Show abstract

Large language model (LLM)-based agents hold transformative potential for automating bioinformatics workflows; however, systematic evaluations of their capabilities remain limited, hindering a clear assessment of their readiness for real-world application. We introduce PromptBio-Bench, a comprehensive evaluation suite of 194 expert-curated tasks spanning bioinformatics and data science at varied difficulty levels, and an evaluation framework for structured file comparison and scoring against expert reference answers. Benchmarking three state-of-the-art agents revealed that Biomni and ToolsGenie achieved comparable performance, and accuracy declined markedly at higher difficulty levels across all agents. As foundation models and agent frameworks continue to evolve, PromptBio-Bench provides a valuable benchmark infrastructure for the community to systematically track the progress of agentic bioinformatics.

19

New chromosome-level haplotyped genome assemblies and annotation for the Japanese Quail (Coturnix Japonica)

Cabau, C.; Degalez, F.; Leroux, S.; Gourichon, D.; Serre, R.-F.; Vernette, C.; Donnadieu, C.; Iampietro, C.; Vandecasteele, C.; Pitel, F.; Klopp, C.

2026-05-14 genomics 10.64898/2026.05.12.724545 medRxiv

Top 0.3%

4.7%

Show abstract

The Japanese quail (Coturnix japonica) is a widely used model organism in developmental biology, genetics, and agriculture. Here, we present new, haplotyped, high-quality genome assemblies of the Japanese quail, generated using a combination of state-of-the-art sequencing technologies, including PacBio HiFi long reads, Oxford Nanopore sequencing, and Hi-C scaffolding. This assembly has a total length of 1.19 Gb, 80% of which is included in chromosomes, and is highly complete (BUSCO score aves_odb10: 97.3). Assembly metrics show a marked improvement in contiguity, with a significantly higher scaffold N50 and a lower number of contigs compared to the reference genome assembly. Remarkably, the assembly extends previously truncated chromosome ends, with 31 telomeres detected. In addition, we merged the existing Ensembl and Refseq annotations and obtained a combined set of 26,102 genes, of which 25,038 genes were successfully mapped on the improved assembly haplotype 1 (Cjap1.hap1). Together, these new genome assemblies and their enriched annotation provide a robust genomic framework for future research. They enhance our ability to investigate developmental processes, genetic and epigenetic inheritance, and host-pathogen interactions. Furthermore, they offer valuable insights for conservation genetics and sustainable breeding programs. This resource represents a critical step forward in leveraging the full potential of the Japanese quail as a model species in both basic and applied research.

20

Chromosome-level genome assembly and annotation of the threatened marbled teal (Marmaronetta angustirostris)

Ortego, J.; Lopez-Luque, R.; Backstrom, N.; Green, A. J.

2026-05-14 genomics 10.64898/2026.05.12.723956 medRxiv

Top 0.3%

4.7%

Show abstract

The marbled teal (Marmaronetta angustirostris) is a widely distributed but declining waterfowl species, classified as Near Threatened globally and Critically Endangered in Spain. Despite ongoing conservation actions, including ex situ management and population reinforcement programmes, the genomic consequences of long-term captivity, inbreeding, and patterns of functional genetic variation remain unknown due to the absence of a species-specific reference genome. Here, we present the first chromosome-level genome assembly for this species. The genome was generated using PacBio HiFi long reads and Omni-C data, yielding a 1.15Gb assembly with a scaffold N50 of 76.95Mb. A total of 97.16% of the assembly was anchored into 36 chromosome-scale scaffolds, including the Z and W sex chromosomes. BUSCO analysis recovered 99.2% of conserved avian genes. Gene prediction was performed using both ab initio and homology-based strategies, resulting in 16,048 protein-coding genes. This resource provides a foundation for genomewide analyses of inbreeding, demographic history, and adaptive variation, and will support evidencebased in situ and ex situ conservation strategies for this threatened species.