GigaScience
◐ Oxford University Press (OUP)
Preprints posted in the last 30 days, ranked by how well they match GigaScience's content profile, based on 172 papers previously published here. The average preprint has a 0.10% match score for this journal, so anything above that is already an above-average fit.
Tartaglia, J.; Giorgioni, M.; Cattivelli, L.; Faccioli, P.
Show abstract
BackgroundAdvances in high-throughput DNA sequencing technologies have dramatically reduced the time and cost required to generate genomic data. As sequencing is no longer a limiting factor, increasing attention must be paid to optimizing the analyses of the large-scale datasets produced. Efficient processing of such data is essential to reduce computational time and operational costs. In this context, workflow management systems (WMSs) have become key instruments for orchestrating complex bioinformatic pipelines. Among these systems, Nextflow has emerged as one of the most widely adopted solutions in bioinformatics. MethodsTo improve scalability and computational efficiency, we employed Nextflow to re-design an already existing pipeline dedicated to the analysis of MNase-defined cistrome-Occupancy (MOA-seq) data. The re-engineering process focused on modularizing the workflow and integrating containerization technologies to ensure reproducibility and easier deployment across heterogeneous computing environments. ResultsThe resulting workflow, named MOAflow, represents a modernized and fully containerized pipeline for MOA-seq data analysis. With only Docker and Nextflow required, the pipeline guarantees high portability and reproducibility. The data of the original article was used to benchmark the new pipeline. Its outputs closely match those of the original study with minor variations. ConclusionsMOAflow demonstrates how the adoption of robust WMS can substantially enhance the performance and usability of pre-existing bioinformatic pipelines. By leveraging containerization and Nextflow, it ensures consistent results across platforms while minimizing setup complexity. This work highlights the value of modern WMS-driven approaches in meeting the computational demands.
Xu, M.; Chen, J.; Zhang, Z.
Show abstract
Large language models have enabled a new class of scientific software in the form of AI agents that can execute research workflows across bioinformatics, drug discovery, and related domains. Among these systems, OpenClaw introduced a skill-based design that allows workflows to be expressed as structured Markdown files, lowering the barrier to contribution and enabling rapid ecosystem growth. However, this growth has led to fragmentation. Projects are distributed across independent repositories, skills vary widely in quality, naming is inconsistent, and there is no unified way to discover or compare available tools. In this work, we construct the first curated dataset of the OpenClaw scientific ecosystem. The dataset includes 91 projects organized by functional role and 2,230 skills spanning 34 scientific categories. Based on this dataset, we perform a systematic analysis of the structure, distribution, and emerging patterns of scientific agent development. To make this ecosystem accessible in practice, we further build Claw4Science, a public platform at https://claw4science.org, which is built on top of our dataset. The platform organizes projects and aggregates distributed skill repositories into a unified interface, with a focus on bioinformatics and scientific workflows, providing a practical entry point for navigating the ecosystem. Our results show that the OpenClaw ecosystem reflects a shift from isolated systems to a more modular and shareable model of scientific computation. At the same time, challenges in evaluation, reproducibility, and governance remain open. We argue that our dataset provides a foundation for future benchmark development and standardized infrastructure for scientific AI agents.
Martelli, E.; Ratto, M. L.; Nuvolari, B.; Arigoni, M.; Tao, J.; Micocci, F. M. A.; Alessandri, L.
Show abstract
BackgroundAchieving FAIR-compliant computational research in bioinformatics is systematically undermined by two compounding challenges that existing tools leave unresolved: long-term reproducibility and accessibility. Standard package managers re-download dependencies from live repositories at every build, making environments vulnerable to library disappearance and version drift, and pinning a package version does not pin the versions of its transitive dependencies, causing divergences between builds performed at different points in time. Compounding this, packages from repositories such as CRAN, Bioconductor, and PyPI frequently omit critical system-level dependencies from their installation metadata, leaving users to manually discover which underlying library is missing or which version is required. Beyond these technical failures, constructing a truly reproducible environment demands expertise in containerization making reproducibility in practice a privilege and not a standard. FindingsWe present REBEL (Reproducible Environment Builder for Explicit Library Resolution), a framework that addresses both challenges through three dependency inference heuristics: (i) Deep Inspection of source code, (ii) Fuzzy Matching against a manually curated knowledge base, and (iii) Conservative Dependency Locking. The resolved dependency stack is then archived into a self-contained local store, enabling offline and deterministic rebuilds at any future time. We compared the installation of 1,000 randomly sampled CRAN packages in isolated Docker containers versus the standard package manager and REBEL resolved 149 of 328 standard installation failures (45.4%). Moreover through its DockerBuilder component, REBEL further generates fully reproducible Docker images from a plain text requirements file, making deterministic environment construction accessible without expertise in containerization. ConclusionsREBEL provides a practical foundation for FAIR-compliant, long-term reproducible bioinformatics analyses, making deterministic environment construction accessible to researchers regardless of their technical background. REBEL is freely available at https://github.com/Rebel-Project-Core
Zhang, X.
Show abstract
Large language model agents are increasingly used for bioinformatics tasks that require external databases, tool use, and long multi-step retrieval workflows. However, practical evaluation of these systems remains limited, especially for prompts whose target set is both large and biologically heterogeneous. Here, I benchmarked three agent systems on the same difficult retrieval task: downloading coccolithophore calcification-related proteins from UniProt across six mechanistically distinct categories, while producing category-separated FASTA files and supporting evidence. The compared systems were Codex app agents extended with Claude Scientific Skills, Biomni Lab online, and DeerFlow 2 with default skills only. Outputs were normalized at the UniProt accession level and compared category by category using overlap analysis, Venn decomposition, and a heuristic relevance assessment of each subset relative to the benchmark prompt. Across the six shared categories, Codex retrieved 2,118 proteins, DeerFlow 6,255, and Biomni 8,752 in a run. Codex showed the best balance between sensitivity and specificity: 92.4% of its proteins fell into subsets labeled high relevance and the remaining 7.6% into medium relevance. DeerFlow was substantially more exhaustive, but 43.8% of its proteins fell into low or low-medium relevance subsets. Biomni produced the largest sets, yet 69.5% of its proteins fell into low or low-medium relevance subsets, mainly due to broad expansion into generic calcium sensors, kinases, transcription factors, and poorly specific domain families. Category-specific analysis showed that Codex was the strongest primary source for inorganic carbon transport, calcium and pH regulation, vesicle trafficking, and signaling, whereas DeerFlow contributed valuable complementary matrix and polysaccharide candidates. A second run for each system also separated them strongly by repeatability: Codex had the highest within-system stability (mean category Jaccard 0.982; micro-Jaccard 0.974), DeerFlow was intermediate (0.795; 0.571), and Biomni was least stable (0.412; 0.319). These results suggest that for complex protein-family retrieval tasks, agent quality depends less on raw output volume than on prompt decomposition, taxonomic scoping, exact query generation, provenance-rich export artifacts, and repeated-run stability.
Cavallaro, G.; Micale, G.; Privitera, G. F.; Pulvirenti, A.; Forte, S.; Alaimo, S.
Show abstract
MotivationHigh-throughput sequencing generates large gene lists, making data interpretation challenging. Accurate gene annotation and reliable conversion between identifiers (e.g., gene symbols, Ensembl GeneIDs, Entrez GeneIDs) are essential for integrating datasets, conducting functional analyses, and enabling cross-species comparisons. Existing tools and databases facilitate annotation but often suffer from inconsistencies, missing mappings, and fragmented workflows, limiting reproducibility and interpretability. ResultsTo address these limitations, we developed geneslator, an R package that unifies gene identifier conversion, orthologs mapping, and pathway annotation across eight model organisms (Homo sapiens, Mus musculus, Rattus norvegicus, Drosophila melanogaster, Danio rerio, Saccharomyces cerevisiae, Caenorhabditis elegans, Arabidopsis thaliana). geneslator provides an up-to-date, precise, and coherent framework that preserves data integrity, enables cross-species analyses, and facilitates robust interpretation of gene function and regulation, outperforming state-of-the-art gene annotation tools. Availabilitygeneslator is available at https://github.com/knowmics-lab/geneslator. Contactgrete.privitera@unict.it
Kong, D.; Bei, S.; Wu, Y.; Tang, B.; Zhao, W.
Show abstract
AI-driven data search and integration represent an emerging research direction. Although several LLM-based backend frameworks and agentic frameworks have emerged, significant gap remains in developing a one-stop, configurable agent framework that supports various data sources and provides a web interface for efficient data retrieval using natural language. To address this, we present Dingent, a novel and configurable agent framework that facilitates data access from various resources and enables the flexible constructions of agent applications. We demonstrate its capabilities across three distinct application scenarios, achieving promising results. The Dingent framework can be readily applied to other fields, such as earth sciences and ecology, to facilitate data discovery.
Sriwichai, N.; Feriau, L.; Tongyoo, P.; Noda, Y.; Gyoji, H.; Noisagul, P.; Goto, S.; Steinberg, D.; Wangsanuwat, C.
Show abstract
This dataset arises from a multilingual survey of AI use among participants and community members in the DBCLS BioHackathon 2025 in Japan. The questionnaire, offered in English, Japanese, and Thai, asked about how often respondents use AI tools, what they use them for, obstacles they encounter, institutional support, satisfaction, and concerns. Additional items captured role, institution type, work country, and other demographics, totaling 105 responses. The dataset includes both raw anonymized responses and a cleaned, standardized English-only version suitable for quantitative analysis, along with the full questionnaire, a data dictionary for cleaned dataset, and a translation lookup table. Free-text answers were screened and redacted to remove URLs, names, and other potentially identifiable information. Together, these materials provide a community-level view of AI practice in genomics, bioinformatics, software development, and related areas, and can support work on AI adoption, policy, and methods for analyzing survey data on AI use in science.
Muneeb, M.; Ascher, D.
Show abstract
ObjectiveSNP heritability estimates vary substantially across estimation strategies, yet the downstream consequences for polygenic risk score (PRS) construction remain poorly characterised. We systematically benchmarked heritability estimation configurations and assessed their propagation into downstream PRS performance. MethodsWe benchmarked 86 heritability-estimation configurations spanning six tool families (GEMMA, GCTA, LDAK, DPR, LDSC, SumHer) and ten method groups across 10 UK Biobank phenotypes, yielding 844 configuration-level estimates. Each estimate was propagated into GCTA-SBLUP and LDpred2-lassosum2 PRS frameworks and evaluated across five cross-validation folds using null, PRS-only, and full models. Eleven binary analytical contrasts were tested using Mann-Whitney U tests to identify drivers of heritability variability. ResultsHeritability ranged from -0.862 to 2.735 (mean = 0.134, SD = 0.284), with 133 of 844 estimates (15.8%) negative and concentrated in unconstrained estimation regimes. Ten of eleven analytical contrasts significantly affected heritability magnitude, with algorithm choice and GRM standardisation showing the largest effects. Despite this upstream variability, downstream PRS test performance was only weakly coupled to heritability magnitude: pooled Pearson correlations between h2 and test AUC were r = -0.023 for GCTA-SBLUP and r = +0.014 for LDpred2-lassosum2 (both non-significant). ConclusionSNP heritability is best interpreted as a configuration-sensitive modelling parameter rather than a universally stable scalar input. Heritability estimates should always be reported alongside their full estimation specification, and downstream PRS performance is comparatively robust to moderate variation in the heritability input. Graphical Abstract O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=80 SRC="FIGDIR/small/716079v1_ufig1.gif" ALT="Figure 1"> View larger version (27K): org.highwire.dtl.DTLVardef@112929borg.highwire.dtl.DTLVardef@573c36org.highwire.dtl.DTLVardef@132170borg.highwire.dtl.DTLVardef@1871363_HPS_FORMAT_FIGEXP M_FIG C_FIG
Sanjaya, P.; Pitkänen, E.
Show abstract
MotivationDeep neural networks have proven effective in classifying tumour types using next-generation sequencing data. However, developing transferable models that work across heterogeneous operating environments remains challenging due to differences in cohort compositions and data generation protocols, privacy concerns, and limited computational capabilities. ResultsWe introduce muat, a transformer-based software for tumour classification using somatic variant data from whole-genome (WGS) and whole-exome sequencing (WES). Building on previously developed MuAt and MuAt2 models, we distribute the software via Docker containers and Bioconda for deployment in high-performance computing (HPC) systems and Secure Processing Environments (SPEs). Using a downloadable MuAt checkpoint, we reproduce the performance reported in the original study on whole genome (PCAWG; 89% accuracy in histological tumour typing) and exome sequencing data (TCGA; 64% accuracy). Cross-cohort evaluation in Genomics England SPE achieved 81% accuracy without retraining and 89% following fine-tuning. As a demonstration of the softwares adaptability, we also deployed muat within the iCAN Digital Precision Cancer Medicine Flagships SPE and integrated it into a Nextflow-managed workflow. Availability and implementationmuat is available through conda (www.anaconda.org/bioconda/muat) and GitHub (https://github.com/primasanjaya/muat), under the Apache 2.0 License. Contactprima.sanjaya@helsinki.fi, esa.pitkanen@helsinki.fi; website: mlbiomed.net
Albuja, D. S.; Maldonado, P. S.; Zambrano, P. E.; Olmos, J. R.; Vera, E. R.
Show abstract
Accurate fungal species identification is critical for microbial ecology, food safety, and plant pathology. However, morphological limitations and genomic complexity hinder this process. Molecular markers such as the ITS region, along with Oxford Nanopore long-read sequencing, offer a robust solution, albeit limited by error rates in homopolymeric regions and a high dependence on advanced computational resources (GPUs) to achieve high accuracy. This study benchmarks two bioinformatics workflows on a multiplexed dataset of complex fungal communities to address this technological gap: a CPU-based workflow optimized using a Bayesian machine learning engine and a GPU-accelerated workflow incorporating "super high accuracy" (SUP) models and refinement with neural networks. The results establish a scalable framework for evaluating the impact of computational architecture on final taxonomic resolution. It is demonstrated that GPU processing maximizes data retention and species-level accuracy by correcting systematic errors. Alternately, implementing automated hyperparameter optimization in CPU environments stabilizes sequence clustering and achieves high taxonomic concordance at the genus level. This conceptual advance validates the feasibility of performing ITS metabarcoding analysis in resource-constrained infrastructures, thus providing the scientific community with a reproducible protocol that balances the need for taxonomic precision with hardware availability.
Kawato, S.
Show abstract
MotivationGenerating graphical diagrams of microbial and organellar genomes is a common and essential task in bioinformatics. Existing tools often present a trade-off; while powerful programming libraries that require coding skills, graphical applications require server processing or local installation with complex dependency. This highlights the need for a tool that offers both programmatic control for batch processing and graphical accessibility for ease of use. ResultsTo fill this gap, I developed gbdraw, a web application that generates circular and linear genome diagrams from self-contained GenBank or DDBJ files or combinations of GFF3 annotation and FASTA sequence files. Its core functions include visualizing annotated features, plotting GC content/skew tracks, and optionally generating pairwise sequence comparisons for comparative genomics. It is available as both a GUI web application and a command-line utility. Unlike existing web-based tools that require data upload to a remote server, gbdraw operates entirely within the users web browser. This serverless architecture ensures that sensitive sequence data never leaves the local machine, providing a secure environment for visualizing unpublished genomic data. Availability and Implementationgbdraw is implemented in Python 3 (version 3.10+) and is freely available under the MIT license. The web app is available at https://gbdraw.app/. Source code and documentation are available at https://github.com/satoshikawato/gbdraw. The local version can be installed from the Bioconda channel using a conda-compatible package manager.
Emissah, H. A.; Tecuatl, C.; Ascoli, G. A.
Show abstract
Background: The rapid expansion of large-scale neuroscience datasets has increased the need for automated, accurate, and standardized quality control (QC). Manual proofreading of 3-dimensional neural morphology (SWC files) remains labor-intensive, error-prone, and non-scalable. We developed and evaluated a fully automated, machine-learning driven QC pipeline to standardize neural reconstructions, detect and correct structural anomalies, and rectify dendritic labeling in pyramidal neurons. Methods: We developed an end-to-end, cloud-deployed pipeline for automated QC, correction, and standardization of SWC-formatted neural morphologies. The framework integrates deterministic structural normalization, topology repair, geometric correction, quantitative morphometric analysis, and graph-based dendritic relabeling within a containerized React/Flask architecture deployed on Amazon Web Services. Rule-based algorithms systematically detect, classify, and correct structural irregularities including overlapping nodes, spurious side branches, non-positive radii, disconnected components, and anomalously long parent-child connections. A graph convolutional network, trained on Sholl-derived features from 20,500 pyramidal neurons, performs dendritic relabeling. Model training employed an 80/10/10 train-validation-test split with adaptive learning-rate scheduling and distributed execution across ten runs to evaluate stability and reproducibility. The pipeline generates images of the final product and computes quantitative morphometrics using L-Measure. Results: All neuronal reconstructions were processed without manual intervention. Automated normalization and topology repair restored structurally coherent and biologically accurate morphologies suitable for quantitative analysis and visualization without data loss. Dendritic relabeling achieved a mean accuracy of 99.51%, consistent between validation and test sets, with class-weighted precision of 0.978, recall of 0.977, and F1-score of 0.977. Enforcing a single apical dendritic tree per neuron improved anatomical consistency without reducing classification performance. Distributed training completed all runs in approximately 25 hours, demonstrating scalability and reproducibility for large datasets. Conclusions: We present a fully automated and cloud-scalable open-source pipeline for standardizing neural reconstructions and performing biologically consistent dendritic classification with near-perfect accuracy. The automated correction and relabeling procedures do not alter or compromise the size or unaffected morphological detail of the original SWC files, ensuring geometric fidelity and compatibility with downstream analysis tools. This open-access framework provides a robust foundation for high-throughput neural morphology curation and large-scale neuroanatomical analysis.
Moshiri, N.
Show abstract
Motivation: Viral surveillance from mixed samples (e.g. wastewater) has become critical in public health efforts to track and contain pathogens. However, existing open-source bioinformatics tools for viral consensus sequence generation are optimized for individual viruses (rather than multiple potential viruses of interest). Results: MultiVirusConsensus is an accurate and efficient open-source pipeline for identification and consensus sequence generation of multiple viruses from mixed samples. It utilizes the memory-efficient ViralConsensus tool via bash process substitution to simultaneously perform consensus sequence calling on all viruses of interest (1) completely in parallel, and (2) by piping datastreams between tools without writing/reading intermediate files (thus eliminating slowdowns related to slow disk accesses). Availability: MultiVirusConsensus is freely available as an open-source software project at: https://github.com/niemasd/MultiVirusConsensus
Muneeb, M.; Ascher, D.
Show abstract
Polygenic risk score (PRS) tools differ substantially in statistical assumptions, input requirements, and implementation complexity, making direct comparison difficult. We developed a harmonized, implementation-aware benchmarking framework to evaluate 46 PRS tools across seven binary UK Biobank phenotypes and one continuous trait under three model configurations: null, PRS-only, and PRS plus covariates. The framework integrates standardized preprocessing, tool-specific execution, hyperparameter exploration, and unified downstream evaluation using five-fold cross-validation on high-performance computing infrastructure. In addition to predictive performance, we assessed runtime, memory use, input dependencies, and failure modes. A Friedman test across 40 phenotype-fold combinations confirmed significant differences in tool rankings ({chi}2 = 102.29, p = 2.57 x 10-11), with no single method universally optimal. These findings provide a reproducible framework for comparative PRS evaluation and demonstrate that tool performance is shaped not only by statistical methodology but also by phenotype architecture, preprocessing choices, covariate structure, computational demands, software robustness, and practical implementation constraints.
Halpern, M.
Show abstract
BackgroundData extraction is the primary bottleneck in meta-analysis, consuming weeks of researcher time with single-extractor error rates of 17.7%. Existing LLM-based systems achieve only 26-36% accuracy on continuous outcomes, and no study has validated AI-extracted continuous data against multiple independent datasets using formal equivalence testing. MethodsA single AI agent (Claude Opus 4.6) extracted treatment means, control means, sample sizes, and variance measures from source PDFs across five published agricultural meta-analyses spanning zinc biofortification, biostimulant efficacy, biochar amendments, predator biocontrol, and elevated CO2 effects on plant mineral nutrition. Observations were matched to reference standards using an LLM-driven alignment method. Validation employed proportional TOST equivalence testing, ICC(3,1), Bland-Altman analysis, and source-type stratification. ResultsAcross five datasets, the agent produced 1,149 matched observations from 136 papers. Pearson correlations ranged from 0.984 to 0.999. Proportional TOST confirmed statistical equivalence for all five datasets (all p < 0.05). Table-sourced observations achieved 5.5x lower median error than figure-sourced observations. Aggregate effects were reproduced within 0.01-1.61 pp of published values. Independent duplicate runs confirmed extraction stability (within 0.09-0.23 pp). ConclusionsA single AI agent achieves statistical equivalence with human-extracted meta-analysis data across five independent agricultural datasets. The approach reduces extraction cost by approximately one to two orders of magnitude while maintaining accuracy sufficient for aggregate meta-analytic pooling. HighlightsO_ST_ABSWhat is already knownC_ST_ABSO_LIData extraction is the primary bottleneck in meta-analysis, with single-extractor error rates of 17.7% C_LIO_LIExisting LLM-based extraction systems achieve only 26-36% accuracy on continuous outcomes C_LIO_LINo study has validated AI extraction against multiple independent datasets using formal equivalence testing C_LI What is newO_LIA single AI agent achieves statistical equivalence with human-extracted data across five agricultural meta-analyses (1,149 observations, 136 papers) C_LIO_LILLM-driven alignment resolves the previously underappreciated bottleneck of moderator matching, improving correlations from 0.377-0.812 to 0.984-0.997 without changing extracted values C_LIO_LITable-sourced observations achieve 5.5x lower error than figure-sourced data C_LI Potential impact for RSM readersO_LIProvides a validated, reproducible workflow for AI-assisted data extraction in meta-analysis C_LIO_LIDemonstrates that most apparent "extraction error" in validation studies is actually alignment error C_LIO_LIOffers practical quality signals (source-type labeling) for downstream meta-analysts C_LI
Rikk, L.; Ghaffarinia, A.; Leigh, N. D.
Show abstract
Accurate genome annotation remains challenging as assembly quality often exceeds annotation reliability. Resolving ambiguities of gene presence, absence, and orthology typically requires integrating two complementary lines of evidence: sequence homology between species and the conservation of gene order (i.e., synteny). BLAST remains the standard for homology detection, yet its raw output can be difficult to interpret. Existing tools address this challenge but operate at opposing scales. Alignment viewers provide detailed pairwise statistics without genomic context, while synteny tools offer chromosome-scale perspectives without sequence-level resolution. To fill this intermediate gap, we developed Novabrowse, an interactive BLAST results interpretation framework featuring high-resolution multi-species synteny analysis, chromosomal re-arrangement investigation, ortholog detection, and gene signal discovery. Users define a genomic region of interest in a query species and/or use custom sequences, then select one or more subject species for comparison. The pipeline retrieves query gene sequences via NCBI API integration and performs BLAST searches against each subject transcriptome or genome. Results are presented via an interactive HTML file featuring alignment statistics, chromosomal maps, coverage visualizations, ribbon plots, and distance-based clustering of high-scoring segment pairs into putative gene units. We demonstrate these capabilities by investigating Foxp3, Aire, and Rbl1, three highly conserved vertebrate genes, in the recently assembled genome of the newt Pleurodeles waltl. Foxp3 and Aire have not been described in any salamander species to date, despite availability of multiple assemblies and extensive transcriptomic datasets. Using Novabrowse, we discovered conserved loci and gene signals for both genes in P. waltl, the presence of which was subsequently confirmed via Nanopore long-read RNA sequencing. In contrast, Rbl1 analysis uncovered a chromosomal rearrangement at its expected locus with no gene signal detected, indicating a gene loss specific to P. waltl despite the genes retention in the closely related axolotl (Ambystoma mexicanum). Our findings demonstrate Novabrowses capacity for evidence-based evaluation of annotation artifacts, an essential capability as high-quality assemblies become more available for phylogenetically diverse species. Novabrowse is open source (MIT license) and freely available at: https://github.com/RegenImm-Lab/Novabrowse.
Kuenkel, J. M.; Bornemann, T. L. V.; Xiu, W.; Starke, J.; Stach, T. L.; Rodrigues Soares, A.; Schloetterer, J.; Seifert, C.; Probst, A. J.
Show abstract
Most prokaryotic genomes in public databases are genomes reconstructed from metagenomes, forming a compendium of multiple contiguous sequences (contigs) assembled from shotgun sequencing data. Binning algorithms for assigning contigs to metagenome assembled genomes (MAGs) are manifold and continuously improving in accuracy. However, binning errors, i.e. the incorrect assignment of a contig and coding sequence to a MAG, often propagate through various databases and confound taxonomic, metabolic and/or evolutionary analyses. Here we present itBins, a fully automated python-based software that enables ultra-fast refinement of metagenomic bins using a rule-based approach harnessing information from %GC content (%GC for brevity), coverage, and taxonomy of individual contigs. When applied to the low, medium, and high complexity data of the Critical Assessment of Metagenome Interpretation (CAMI I) challenge [1], itBins produced higher F1 scores (the harmonic mean of precision and recall) for all levels compared to other automated refinement tools, i.e., MDMcleaner and Rosella. Compared to manual refinement via uBin, itBins performed similarly well across all three complexity levels of the CAMI I dataset. With an average speed of 61 ms per bin, itBins is faster than all other refinement tools by at least three orders of magnitude when input data is accordingly available (%GC, coverage, and taxonomy), and was similarly fast when input data preparation was included in the processing time. Application to 64 real-world metagenomes from highly complex river mesocosms resulted in 259 medium-quality and 19 high-quality MAGs refined by itBins, while the other automated refinement tools failed in generating output at all or within 5000 hours of runtime. Finally, itBins also utilizes marker genes to determine the overall binning success for individual metagenomes, providing a crucial benchmark for the user to estimate the ecological relevance of their binned data. The herein introduced software itBins is broadly applicable to any type of metagenome data, integrates well with other software like DASTool, and enables swift and reliable refinement of genomes from metagenomes along with estimation of the overall binning success. itBins is distributed via EUPL 1.2 license and available at Codeberg (codeberg.org/JMK/itBins), GitHub (github.com/ProbstLab/itBins) and through Bioconda [2](bioconda.github.io/recipes/itbins/README.html).
Kinjo, A. R.; Yamamoto, Y.; Bustamante-Larriet, S.; Labra-Gayo, J. E.; Fujisawa, T.
Show abstract
Querying the RDF Portal knowledge graph maintained by DBCLS--which aggregates more than 70 life-science databases--requires proficiency in both SPARQL and database-specific RDF schemas, placing this resource beyond the reach of most researchers. Large Language Models (LLMs) can, in principle, translate natural-language questions into executable SPARQL, but without schema-level context, they frequently fabricate non-existent predicates or fail to resolve entity names to database-specific identifiers. We present TogoMCP, a system that recasts the LLM as a protocol-driven inference engine orchestrating specialized tools via the Model Context Protocol (MCP). Two mechanisms are essential to its design: (i) the MIE (Metadata-Interoperability-Exchange) file, a concise YAML document that dynamically supplies the LLM with each target databases structural and semantic context at query time; and (ii) a two-stage workflow separating entity resolution via external REST APIs from schema-guided SPARQL generation. On a benchmark of 50 biologically grounded questions spanning five types and 23 databases, TogoMCP achieved a large improvement over an unaided baseline (Cohens d = 0.92, Wilcoxon p < 10-6), with win rates exceeding 80% for question types with precise, verifiable answers. An ablation study identified MIE files as the single indispensable component: removing them reduced the effect to a non-significant level (d = 0.08), while a one-line instruction to load the relevant MIE file recovered the full benefit of an elaborate behavioral protocol. These results suggest a general design principle: concise, dynamically delivered schema context is more valuable than complex orchestration logic. Database URLhttps://togomcp.rdfportal.org/
Corval, H.; Ducrest, A.-L.; Bachmann Salvy, M.; Burns, A.; Topaloudis, A.; Simon, C.; Cora, E.; Cavaleri, D.; Almasi, B.; Roulin, A.; Iseli, C.; Guex, N.; Cumer, T.; Goudet, J.
Show abstract
Recent advances in long-read sequencing have enabled near telomere-to-telomere (T2T) assemblies across diverse taxa. However, avian genomes remain challenging due to numerous microchromosomes, small, typically < 20Mb, elements that are gene-, GC-, and repeat-rich. As a consequence, microchromosomes are often missing from genome assemblies. Here, we present a chromosome-level, haplotype-resolved genome assembly for the Western barn owl (Tyto alba). Using a trio-binning strategy with Illumina parental reads combined with PacBio HiFi and Oxford Nanopore Technologies data, we generated two phased contig sets. These were scaffolded into 40 linkage groups using a linkage map. Comparative analyses identified unplaced HiFi scaffolds corresponding to microchromosomes, which we integrated into six additional microchromosomes using long reads information. The two assemblies present 46 chromosomes, matching the karyotype of the species. They exhibit strong synteny between parental haplotypes, except for a [~]38 Mb complex region on chromosome 7 containing nested inversions. This high-quality reference provides the first haplotype-resolved and chromosome-level genome for Strigiformes, enabling fine-scale studies of structural variation and avian genome evolution.
Nguyen-Hoang, A.; Arslan, K.; Kopalli, V.; Windpassinger, S.; Perovic, D.; Stahl, A.; Golicz, A.
Show abstract
Hi-C data is commonly used for reference-free de novo scaffolding. However, with the rapid increase in high-quality reference genomes, reference-guided workflows are now more practical for assembling large numbers of target genomes without relying on costly and labor-intensive Hi-C sequencing. Recently, a pangenome graph-based haplotype sampling algorithm was introduced to generate personalized graphs for target genomes. Such graphs have strong potential as references for reference-guided contig scaffolding. Here, we present noHiC, a reference-guided scaffolding pipeline supporting key steps of plant contig scaffolding. A distinctive feature of noHiC is the nohic-refpick script, generating a best-fit synthetic reference (synref) from a pangenome graph that is genetically close to the target contigs. This enables the integration of genetic information from many references (up to 48 in our tests) without using them separately during scaffolding. Synrefs showed advantages over highly contiguous conventional references in reducing false contig breaking during reference-based correction. Additionally, nohic-refpick can be combined with fast scaffolders (ntJoin) to rapidly produce highly contiguous assemblies using synrefs derived from pangenome graphs. The noHiC pipeline, used alone or in combination with ntJoin, can generally produce assemblies that are structurally consistent with public Hi-C-based or manually curated genomes. The pipeline is publicly available at https://github.com/andyngh/noHiC. GRAPHICAL ABSTRACT O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=82 SRC="FIGDIR/small/712436v1_ufig1.gif" ALT="Figure 1"> View larger version (9K): org.highwire.dtl.DTLVardef@40bd8forg.highwire.dtl.DTLVardef@5d2bbborg.highwire.dtl.DTLVardef@e214a3org.highwire.dtl.DTLVardef@b90b06_HPS_FORMAT_FIGEXP M_FIG C_FIG