GigaScience
◐ Oxford University Press (OUP)
Preprints posted in the last 90 days, ranked by how well they match GigaScience's content profile, based on 172 papers previously published here. The average preprint has a 0.10% match score for this journal, so anything above that is already an above-average fit.
Patel, H.; Crosslin, D.; Jarvik, G. P.; Hall, T.; Veenstra, D.; Xie, S.
Show abstract
The lack of user-centered design principles in the current landscape of commonly-used bioinformatics software tools poses challenges for novice genomics researchers (NGRs) entering the genomics ecosystem. Comparing the usability of one analysis software to that of another is a non-trivial task and requires evaluation criteria that incorporates perspectives from both existing literature and a diverse, underrepresented user base of NGRs. To better characterize these barriers, we utilized a two-pronged approach consisting of a literature review of existing bioinformatics tools and semi-structured interviews of the needs of NGRs. From both knowledge sources, the key attributes that resulted in poor adoption and sustained use of most bioinformatics tools included poor documentation, lack of readily-accessible informational content, challenges with installation and dependency coordination, and inconsistent error messages/progress indicators. Combining the findings from the literature review and the insights gained by interviewing the NGRs, an evaluation rubric was created that can be utilized to grade existing and future bioinformatics tools. This rubric acts as a summary of key components needed for software tools to cater to the diverse needs of both NGRs and experienced users. Due to the rapidly evolving nature of genomics research, it becomes increasingly important to critically evaluate existing tools and develop new ones that will help build a strong foundation for future exploration.
Grillo-Risco, R.; Kupchyk Tiurin, M.; Perpina-Clerigues, C.; Cordero Felipe, F. J.; Lozano, S.; de la Iglesia, M.; Garcia-Garcia, F.
Show abstract
The growing number of omics datasets in public repositories provides an opportunity to enhance data reusability through data integration; however, complex statistical barriers often hinder the effective combination of independent studies. To address this problem, we present MetaOmixTools, an interactive web-based suite that streamlines the meta-analysis of ranked feature lists and functional enrichment profiles. The platform integrates two primary modules - MetaRank and MetaEnrich within a code-free environment. MetaRank generates robust consensus rankings from multiple lists by implementing weighted (e.g., Rank Product) and unweighted (e.g., Robust Rank Aggregation) strategies, while MetaEnrich performs functional meta-analyses by combining probability values from individual over-representation analyses using established statistical techniques. Using case studies, we established consensus rankings for acute spinal cord injury across heterogeneous platforms, identifying conserved inflammatory marker genes in the upregulated gene list (e.g., Slpi, Ccl2, Msr1) and synaptic loss genes in the downregulated gene list (e.g., Kcna2, Dao, Ppp1r1b), and also characterized inverse functional intersections between melanoma brain metastasis and neurodegenerative diseases. By providing intuitive, real-time visualization and reproducible workflows, MetaOmixTools empowers the research community to extract consistent biological insights from multi-study data. We have made MetaOmixTools freely available at https://bioinfo.cipf.es/metaomixtools/. Graphical Abstract O_FIG O_LINKSMALLFIG WIDTH=165 HEIGHT=200 SRC="FIGDIR/small/707748v1_ufig1.gif" ALT="Figure 1"> View larger version (45K): org.highwire.dtl.DTLVardef@1fd5211org.highwire.dtl.DTLVardef@170c59org.highwire.dtl.DTLVardef@12bc6e7org.highwire.dtl.DTLVardef@10f9674_HPS_FORMAT_FIGEXP M_FIG C_FIG
Tartaglia, J.; Giorgioni, M.; Cattivelli, L.; Faccioli, P.
Show abstract
BackgroundAdvances in high-throughput DNA sequencing technologies have dramatically reduced the time and cost required to generate genomic data. As sequencing is no longer a limiting factor, increasing attention must be paid to optimizing the analyses of the large-scale datasets produced. Efficient processing of such data is essential to reduce computational time and operational costs. In this context, workflow management systems (WMSs) have become key instruments for orchestrating complex bioinformatic pipelines. Among these systems, Nextflow has emerged as one of the most widely adopted solutions in bioinformatics. MethodsTo improve scalability and computational efficiency, we employed Nextflow to re-design an already existing pipeline dedicated to the analysis of MNase-defined cistrome-Occupancy (MOA-seq) data. The re-engineering process focused on modularizing the workflow and integrating containerization technologies to ensure reproducibility and easier deployment across heterogeneous computing environments. ResultsThe resulting workflow, named MOAflow, represents a modernized and fully containerized pipeline for MOA-seq data analysis. With only Docker and Nextflow required, the pipeline guarantees high portability and reproducibility. The data of the original article was used to benchmark the new pipeline. Its outputs closely match those of the original study with minor variations. ConclusionsMOAflow demonstrates how the adoption of robust WMS can substantially enhance the performance and usability of pre-existing bioinformatic pipelines. By leveraging containerization and Nextflow, it ensures consistent results across platforms while minimizing setup complexity. This work highlights the value of modern WMS-driven approaches in meeting the computational demands.
Roldan, A.; Duran, T. G.; Far, A. J.; Capa, M.; Arboleda, E.; Cancellario, T.
Show abstract
The era of Big Data has reshaped biodiversity research, yet the potential of this information is frequently constrained by data heterogeneity, incompatible schemas, and the fragmentation of resources. Whilst standards such as Darwin Core have improved interoperability, significant barriers persist in harmonising multi-typology datasets ranging from taxonomy and genetics to species distribution. Here, we present the Biodiversity Observatory System (BiOS), a comprehensive, open-source software stack designed to address these impediments through a modular, community-driven architecture. BiOS departs from monolithic database designs by decoupling the back-end data management from the front-end presentation layer. This architectural separation supports a dual-access model tailored to diverse stakeholder needs. For researchers and developers, the system offers a comprehensive Application Programming Interface (API) that exposes all back-end functionalities, enabling seamless programmatic access, automated data retrieval, and integration with external analytical workflows. Simultaneously, the platform features a user web interface designed to lower the technical barrier to entry. This interface facilitates intuitive data exploration through agile taxonomic navigation, advanced geospatial map viewers for species occurrence filtering, and dedicated dashboards for visualising genetic markers and legislative status. Strictly adhering to the FAIR principles (Findable, Accessible, Interoperable, Reusable), BiOS acts as a relational engine capable of integrating heterogeneous data streams. By providing a flexible, interoperable core that supports the "seven shortfalls" framework of biodiversity knowledge, BiOS offers a turnkey solution to overcome data fragmentation and enhance collaborative conservation efforts.
Pike, B.; Goncalves da Silva, A.; Teran, W.
Show abstract
We present fully-phased, chromosome-scale genome assemblies of 4 genotypes of Cannabis sativa. These assemblies were built from Oxford Nanopore R9.4.1 long reads, which previously have been considered insufficiently accurate for proper phasing. Contigs produced by the Phased Error Correction and Assembly Tool (PECAT), in combination with Hi-C libraries, were used by GreenHill to develop intermediate data structures that permit accurate phasing of the dual contigs, which were then scaffolded by the advanced algorithm of Yet another Hi-C Scaffolder (YaHS). These assemblies, while low in QV, are comparable to recent HiFi assemblies in their contiguity and gene content, and also show good macrosynteny with them. We compare these 8 haplotypes with 77 others recently produced and present a phylogenetic analysis, as well as a first draft of the Cannabis pan-NLRome. CoreWe assembled four fully-phased and chromosome-scale diploid genomes of Cannabis sativa, using Oxford Nanopore Technology readsets. These new assemblies are comparable to recent PacBio HiFi assemblies in terms of contiguity and gene content. We present a phylogenomic analysis, using whole-genome alignments after including 77 other publicly available Cannabis genomes, as well as a draft pan-NLRome. Gene and Accession NumbersAssemblies are archived at NCBI as BioProjects PRJNA1301983 (ANC), PRJNA1301963 (HAW), PRJNA1301984 (SRI), and PRJNA1301985 (TRC). Assemblies, annotations, and Supplemental Tables are also available on Zenodo: https://doi.org/10.5281/zenodo.16456638.
Anderson, J. K.; Zhang, J.; Ge, X.; Fan, H.; Leng, Y.; Silverstein, M.; Conrad, R.; Li, Z.; Holmes, E.; Joseph, S. S.; Lu, S.; Shinohara, R.; Li, T.; Johnson, W. E.; Alzheimers Disease Neuroimaging Initiative,
Show abstract
Batch effect correction is a common and often necessary step in data analysis to reduce bias due to technical and experimental factors when combining multiple batches of data. The severity of the batch effects dictates the correction strategy; therefore, a careful assessment of each datasets batch effects is necessary. BatchQC is an R package that provides reproducible tools and visualizations for quantitatively and qualitatively addressing batch effects across a broad range of data types. BatchQC integrates with standardized Bioconductor data structures and features an object-oriented design, enabling the application of workflows that can freely evaluate and process data within and outside the package tools. Common batch evaluation methods, along with novel quantitative metrics, help determine the benefits of batch correction for each dataset and enable direct comparisons between methods. Here, we present BatchQC as the first comprehensive batch-correction R package, with independent tools, reproducible workflows, visualization, and novel statistics.
Xu, M.; Chen, J.; Zhang, Z.
Show abstract
Large language models have enabled a new class of scientific software in the form of AI agents that can execute research workflows across bioinformatics, drug discovery, and related domains. Among these systems, OpenClaw introduced a skill-based design that allows workflows to be expressed as structured Markdown files, lowering the barrier to contribution and enabling rapid ecosystem growth. However, this growth has led to fragmentation. Projects are distributed across independent repositories, skills vary widely in quality, naming is inconsistent, and there is no unified way to discover or compare available tools. In this work, we construct the first curated dataset of the OpenClaw scientific ecosystem. The dataset includes 91 projects organized by functional role and 2,230 skills spanning 34 scientific categories. Based on this dataset, we perform a systematic analysis of the structure, distribution, and emerging patterns of scientific agent development. To make this ecosystem accessible in practice, we further build Claw4Science, a public platform at https://claw4science.org, which is built on top of our dataset. The platform organizes projects and aggregates distributed skill repositories into a unified interface, with a focus on bioinformatics and scientific workflows, providing a practical entry point for navigating the ecosystem. Our results show that the OpenClaw ecosystem reflects a shift from isolated systems to a more modular and shareable model of scientific computation. At the same time, challenges in evaluation, reproducibility, and governance remain open. We argue that our dataset provides a foundation for future benchmark development and standardized infrastructure for scientific AI agents.
Li, X.; Zhou, C.; Wu, H.; Xiao, K.; Hao, J.; Zhao, D.; Zhu, J.; Li, Y.; Peng, J.; Gu, J.; Deng, G.; Cai, W.; Li, M.; Liu, Y.; Shang, X.; Chen, H.; Kong, H.
Show abstract
Influenza A viruses continuously undergo antigenic evolution to escape host immunity induced by previous infections or vaccinations, consequently causing seasonal epidemics and occasional pandemics. Antigenic prediction and visualization of influenza A viruses are crucial for precise vaccine strain selection and robust pandemic preparedness. However, a user-friendly online platform for these capabilities remains notably absent, despite widespread demand. Here, we present FluNexus (https://flunexus.com), the first-of-its-kind, one-stop-shop web platform designed to facilitate the prediction and visualization of the antigenic change in emerging variants. FluNexus features a data preprocessing module for hemagglutinin subunit 1 (HA1) and hemagglutination inhibition (HI) data across three major public health threat subtypes (H1, H3 and H5). Meanwhile, FluNexus provides an interactive interface for online antigenic prediction and offers practical guidance for researchers. Most notably, FluNexus offers the visualization of influenza A virus antigenic evolution, providing intuitive insights into its antigenic dynamics. Specially, FluNexus proposes a novel manifold-based method for positioning antigens and antisera, generating accurate antigenic cartographies even with sparse HI data. By alleviating the programming burden on biologists, FluNexus supports more informed decision-making in vaccine strain selection and strengthens surveillance and pandemic preparedness. HighlightsO_LIFluNexus features a data preprocessing module for HA1 and HI data spanning the H1, H3, and H5 subtypes. C_LIO_LIFluNexus facilitates online antigenic prediction utilizing ten state-of-the-art antigenic prediction tools, and offers practical guidance based on a comparative evaluation of their performance. C_LIO_LIFluNexus provides a visualization module for mapping antigenic evolution of influenza A viruses, incorporating a novel manifold-based method for antigenic cartography. C_LI
Walter, J.; Kuenne, C.; Knoppik, N.; Goymann, P.; Looso, M.
Show abstract
Scientific research relies on transparent dissemination of data and its associated interpretations. This task encompasses accessibility of raw data, its metadata, details concerning experimental design, along with parameters and tools employed for data interpretation. Production and handling of these data represents an ongoing challenge, extending beyond publication into individual facilities, institutes and research groups, often termed Research Data Management (RDM). It is foundational to scientific discovery and innovation, and can be paraphrased as Findability, Accessibility, Interoperability and Reusability (FAIR). Although the majority of peer-reviewed journals require the deposition of raw data in public repositories in alignment with FAIR principles, metadata frequently lacks full standardization. This critical gap in data management practices hinders effective utilization of research findings and complicates sharing of scientific knowledge. Here we present a flexible design of a machine-readable metadata format to store experimental metadata, along with an implementation of a generalized tool named FRED. It enables i) dialog based creation of metadata files, ii) structured semantic validation, iii) logical search, iv) an external programming interface (API), and v) a standalone web-front end. The tool is intended to be used by non-computational scientists as well as specialized facilities, and can be seamlessly integrated in existing RDM infrastructure.
Martelli, E.; Ratto, M. L.; Nuvolari, B.; Arigoni, M.; Tao, J.; Micocci, F. M. A.; Alessandri, L.
Show abstract
BackgroundAchieving FAIR-compliant computational research in bioinformatics is systematically undermined by two compounding challenges that existing tools leave unresolved: long-term reproducibility and accessibility. Standard package managers re-download dependencies from live repositories at every build, making environments vulnerable to library disappearance and version drift, and pinning a package version does not pin the versions of its transitive dependencies, causing divergences between builds performed at different points in time. Compounding this, packages from repositories such as CRAN, Bioconductor, and PyPI frequently omit critical system-level dependencies from their installation metadata, leaving users to manually discover which underlying library is missing or which version is required. Beyond these technical failures, constructing a truly reproducible environment demands expertise in containerization making reproducibility in practice a privilege and not a standard. FindingsWe present REBEL (Reproducible Environment Builder for Explicit Library Resolution), a framework that addresses both challenges through three dependency inference heuristics: (i) Deep Inspection of source code, (ii) Fuzzy Matching against a manually curated knowledge base, and (iii) Conservative Dependency Locking. The resolved dependency stack is then archived into a self-contained local store, enabling offline and deterministic rebuilds at any future time. We compared the installation of 1,000 randomly sampled CRAN packages in isolated Docker containers versus the standard package manager and REBEL resolved 149 of 328 standard installation failures (45.4%). Moreover through its DockerBuilder component, REBEL further generates fully reproducible Docker images from a plain text requirements file, making deterministic environment construction accessible without expertise in containerization. ConclusionsREBEL provides a practical foundation for FAIR-compliant, long-term reproducible bioinformatics analyses, making deterministic environment construction accessible to researchers regardless of their technical background. REBEL is freely available at https://github.com/Rebel-Project-Core
Zhang, X.
Show abstract
Large language model agents are increasingly used for bioinformatics tasks that require external databases, tool use, and long multi-step retrieval workflows. However, practical evaluation of these systems remains limited, especially for prompts whose target set is both large and biologically heterogeneous. Here, I benchmarked three agent systems on the same difficult retrieval task: downloading coccolithophore calcification-related proteins from UniProt across six mechanistically distinct categories, while producing category-separated FASTA files and supporting evidence. The compared systems were Codex app agents extended with Claude Scientific Skills, Biomni Lab online, and DeerFlow 2 with default skills only. Outputs were normalized at the UniProt accession level and compared category by category using overlap analysis, Venn decomposition, and a heuristic relevance assessment of each subset relative to the benchmark prompt. Across the six shared categories, Codex retrieved 2,118 proteins, DeerFlow 6,255, and Biomni 8,752 in a run. Codex showed the best balance between sensitivity and specificity: 92.4% of its proteins fell into subsets labeled high relevance and the remaining 7.6% into medium relevance. DeerFlow was substantially more exhaustive, but 43.8% of its proteins fell into low or low-medium relevance subsets. Biomni produced the largest sets, yet 69.5% of its proteins fell into low or low-medium relevance subsets, mainly due to broad expansion into generic calcium sensors, kinases, transcription factors, and poorly specific domain families. Category-specific analysis showed that Codex was the strongest primary source for inorganic carbon transport, calcium and pH regulation, vesicle trafficking, and signaling, whereas DeerFlow contributed valuable complementary matrix and polysaccharide candidates. A second run for each system also separated them strongly by repeatability: Codex had the highest within-system stability (mean category Jaccard 0.982; micro-Jaccard 0.974), DeerFlow was intermediate (0.795; 0.571), and Biomni was least stable (0.412; 0.319). These results suggest that for complex protein-family retrieval tasks, agent quality depends less on raw output volume than on prompt decomposition, taxonomic scoping, exact query generation, provenance-rich export artifacts, and repeated-run stability.
Biase, F. H.; Morozyuk, M.; Ezepha, C.
Show abstract
BackgroundSingle-cell RNA sequencing (scRNA-seq) integration methods remove technical variation while preserving biological signal, yet systematic frameworks for evaluating how parameter choices influence biological interpretation remain limited. Traditional benchmarking approaches evaluate single-parameter configurations per method, potentially missing systematic patterns in functional outcomes and method convergence. A framework for systematic integration parameter evaluation was developed and applied to bovine embryo development. ResultsSix integration methods (FastMNN, CCA, RPCA, scVI, Harmony, STACAS) combined with multiple parameters, including those for neighbor identification and clustering, yielded 8232 combinations. The main outputs evaluated were specific cell counts and marker identification. After filtering for extremely poor cell and marker identification, 4,287 integration parameter combinations were retained for analysis. There were three major patterns (clusters) with integration methods distributed non-randomly across clusters and distinct biological outcomes. One pattern emerged, composed of scVI and STACAS integration, dominated by the lack of identification of epiblast cells. Cluster 2 (n=29), also composed of scVI and STACAS integration, identified the most epiblast markers (n=7, 8, or 9) but had a limited number of epiblast cells (median=10). Cluster 1 (n=4,120 combinations) had the highest method diversity. Across clusters, trophoblast and mesoderm showed high functional distinctness, while epiblast and hypoblast showed moderate overlap in gene ontology classes. ConclusionsThe approach reveals that parameter choices influence cell type classification, functional interpretation, and the degree of method convergence, with implications for identifying specific biological inferences for further orthogonal validation. A systematic approach to evaluating integration methods, along with other parameters, is advisable for accurate biological inference.
Butt, M. Z.; Ahmad, R. S.; Fatima, E.; Tahir ul Qamar, M.
Show abstract
The application of Large Language Models (LLMs) for generating data visualizations through natural language interaction represents a promising advance in AI-assisted scientific analysis. However, existing LLM-based tools largely emphasize graph generation, while research workflows require not only visualization but also rigorous interpretation and validation against established scholarly evidence. Despite advances in visualization technologies, no single tool currently integrates literature references with visualization while also generating insights from graphical data. To address this gap, we present BioVix, a web-based LLM-driven framework that integrates interactive data visualization, natural-language querying, and automated retrieval of relevant academic literature. BioVix enables users to upload datasets, generate complex visualizations, interpret graphical patterns, and contextualize findings through literature references within a unified workflow. The system employs a multi-model architecture combining DeepSeek V3.1 for code and logic generation, Qwen2.5-VL-32B-Instruct for multimodal interpretation, and GPT-OSS-20B for conversational reasoning, coordinated through structured prompt engineering. BioVix was evaluated across diverse biological domains, including proteomic expression profiling, epigenomic peak annotation, and clinical diabetes data, demonstrating its flexibility in handling heterogeneous datasets and supporting exploratory, literature-aware analysis. While BioVix substantially streamlines exploratory research workflows, its LLM-generated outputs are intended to support, not replace, expert judgment, and users should independently verify results before scientific reporting. BioVix is openly available via public deployment on Hugging Face (https://huggingface.co/spaces/MuhammadZain10/BioVix), with source code provided through GitHub (https://github.com/MuhammadZain-Butt/BioVix).
Corradi, M.; Djidrovski, I.; Ladeira, L.; Staumont, B.; Verhoeven, A.; Sanz Serrano, J.; Rougny, A.; Vaez, A.; Hemedan, A.; Mazein, A.; Niarakis, A.; de Carvalho e Silva, A.; Auffray, C.; Wilighagen, E.; Kuchovska, E.; Schreiber, F.; Balaur, I.; Calzone, L.; Matthews, L.; Veschini, L.; Gillespie, M. E.; Kutmon, M.; Koenig, M.; van Welzen, M.; Hiroi, N.; Lopata, O.; Klemmer, P.; Overall, R.; Hofer, T.; Satagopam, V.; Schneider, R.; Teunis, M.; Geris, L.; Ostaszewski, M.
Show abstract
As biomedical knowledge keeps growing, resources storing available information multiply and grow in size and complexity. Such resources can be in the format of molecular interaction maps, which represent cellular and molecular processes under normal or pathological conditions. However, these maps can be complex and hard to navigate, especially to novice users. Large Language Models (LLMs), particularly in the form of agentic frameworks, have emerged as a promising technology to support this exploration. In this article, we describe a user-driven process of prototyping, development, and user testing of Llemy, an LLM-based system for exploring these molecular interaction maps. By involving domain experts from the very first prototyping in the form of a hackathon and collecting both fine-grained and general feedback on more refined versions, we were able to evaluate the perceived utility and quality of the developed system, in particular for summarising maps and pathways, as well as prioritise the development of future features. We recommend continued user-driven development and benchmarking to keep the community engaged. This will also facilitate the transition towards open-weight LLMs to support the needs of the open research environment in an ever-changing technology landscape.
Li, T.; Wang, y.; Zhang, Z.; Chen, c.; Zheng, n.; Wang, j.; Ning, m.; Wang, j.; Ai, H.; Huang, Y.
Show abstract
BackgroundAlthough the biological mechanism for heterosis has been debated for a long time, heterosis is widely utilized to increase the global productivity of crops and livestock. Recently, the mechanism has been well characterized in crops and livestock with a male-heterogametic XY system due to genomic assembly advancements, especially the availability of haploid genomes. However, the biological mechanism for heterosis remains unclear in poultry possessing the female-heterogametic ZW system. ResultsHere, we assembled chromosome-level diploid and haploid genomes of the Muscovy duck. We developed an efficient and cost-effective method to assemble 12 variation graph-haploid Muscovy duck genomes from three full-sibling pairs with high quality using short-read Illumina sequences. We further characterized genetic, expression and regulatory patterns of parental alleles at multiple scales. We found that maternal haploid genomes generally had more open chromatin organization and higher accessibility, and higher levels of gene expression, while showing similar DNA methylation levels when compared to paternal haploid genomes. In contrast, the female paternal Z chromosome showed the most, and the male paternal Z chromosome presented more, relaxed chromatin organization and chromatin accessibility, and gene expression compared to the male maternal Z chromosome. Thus, the ZW system largely relies on compensation and balance to regulate gene expression on the sex Z chromosome. Moreover, we identified non-Mendelian regions covering 0.26% of the genome ([~]3.18 Mb). These regions contained lower gene density, GC content, and repeat sequence frequency, but were enriched for DNA motifs bound by transcription factors, likely leading to a compacted chromatin structure and lower chromatin accessibility. ConclusionsOur work here provides a comprehensive profile of parental alleles genetic, expression and regulatory patterns in the female-heterogametic ZW system, and might be useful for the utilization of heterosis in poultry.
Pitarch, B.; Pazos, F.; Chagoyen, M.
Show abstract
The Gene Ontology (GO) is a long-standing, community-maintained knowledge resource that underpins the functional annotation of gene products across numerous biological databases. Released regularly, GO and its associated annotations form a large, continuously evolving dataset whose temporal dynamics have direct consequences for data reuse, versioning, and reproducibility. Because analytical results derived from GO are inherently tied to specific ontology and annotation releases, a systematic understanding of how GO changes over time is essential for transparent interpretation and long-term reuse of GO-based analyses. Here, we present a comprehensive temporal characterization of the Gene Ontology and its annotations spanning 21 years of publicly available releases. Treating successive ontology and annotation versions as longitudinal research data, we quantify changes in ontology structure, term composition, relationships, and annotation content across time and across three representative annotation resources. Our analysis reveals sustained growth of GO over its lifetime, accompanied by marked structural reorganization, particularly affecting high-level, general ontology terms. Notably, across multiple structural and annotation metrics, we identify a transition toward increased stability beginning around 2017, consistent with a maturation phase of the resource. This work provides a reference framework for researchers who rely on GO releases for data integration, benchmarking, and reproducible functional analysis.
Braga Apolinario, A.; Vieira, K. V.; Costa, A. K. M. M.; Freitas, L. C.; Pinheiro, I. S.; Vitral, R. W. F.; Campos, M. J. d. S.
Show abstract
Bibliometric analyses have become essential for understanding scientific production and innovation dynamics; however, large-scale applications remain limited by challenges related to data extraction, preprocessing, citation network reconstruction, and reproducibility, particularly when using PubMed-indexed records. This study presents a fully automated and reproducible computational workflow for large-scale bibliometric analyses based on the Disruption Index (DI). The pipeline enables systematic retrieval of PubMed data, standardized metadata processing, construction of citation networks, and calculation of DI values within a fixed post-publication citation window. Implemented in Python, the workflow integrates automated querying, XML parsing, data consolidation, and network-based citation classification, allowing scalable and transparent analyses that are infeasible through manual approaches. In a demonstrative application focused on orthodontic literature, the pipeline processed more than 67,000 articles and reconstructed over 300,000 citation relationships, resulting in a final analytical sample of 3,234 articles with indexed references and citations. The automated framework ensures methodological transparency, facilitates replication, and substantially reduces the time and technical barriers associated with advanced bibliometric studies. By providing an open and extensible solution for calculating the Disruption Index at scale, this workflow supports robust assessments of scientific innovation and consolidation and can be readily adapted to other biomedical research domains indexed in PubMed.
Cavallaro, G.; Micale, G.; Privitera, G. F.; Pulvirenti, A.; Forte, S.; Alaimo, S.
Show abstract
MotivationHigh-throughput sequencing generates large gene lists, making data interpretation challenging. Accurate gene annotation and reliable conversion between identifiers (e.g., gene symbols, Ensembl GeneIDs, Entrez GeneIDs) are essential for integrating datasets, conducting functional analyses, and enabling cross-species comparisons. Existing tools and databases facilitate annotation but often suffer from inconsistencies, missing mappings, and fragmented workflows, limiting reproducibility and interpretability. ResultsTo address these limitations, we developed geneslator, an R package that unifies gene identifier conversion, orthologs mapping, and pathway annotation across eight model organisms (Homo sapiens, Mus musculus, Rattus norvegicus, Drosophila melanogaster, Danio rerio, Saccharomyces cerevisiae, Caenorhabditis elegans, Arabidopsis thaliana). geneslator provides an up-to-date, precise, and coherent framework that preserves data integrity, enables cross-species analyses, and facilitates robust interpretation of gene function and regulation, outperforming state-of-the-art gene annotation tools. Availabilitygeneslator is available at https://github.com/knowmics-lab/geneslator. Contactgrete.privitera@unict.it
Vilain, M.; Aris-Brosou, S.
Show abstract
BackgroundThe ever-growing amount of available biological data leads modern analysis to be performed on large datasets. Unfortunately, bioinformatics tools for preprocessing and analyzing data are not always designed to treat such large amounts of data efficiently. Notably, this is the case when encoding DNA and RNA sequences into numerical representations, also called descriptors, before passing them to machine learning models. Furthermore, current Python tools available for this preprocessing step are not well suited to be integrated into pipelines resulting in slow encoding speeds. ResultsWe introduce dna-parser, a Python library written in Rust to encode DNA and RNA sequences into numerical features. The combination of Rust and Python allows to encode sequences rapidly and in parallel across multiple threads while maintaining compatibility with packages from the Python ecosystem. Moreover, this library implements many of the most widely used types of numerical feature schemes coming from bioinformaticss and natural language processing. Conclusiondna-parser is an easy to install Python library that offers many Python wheels for Linux (muslinux and manylinux), macOS, and Windows via pip (https://pypi.org/project/dna-parser/). The open source code is available on GitHub (https://github.com/Mvila035/dna_parser) along with the documentation (https://mvila035.github.io/dna_parser/documentation/).
Kong, D.; Bei, S.; Wu, Y.; Tang, B.; Zhao, W.
Show abstract
AI-driven data search and integration represent an emerging research direction. Although several LLM-based backend frameworks and agentic frameworks have emerged, significant gap remains in developing a one-stop, configurable agent framework that supports various data sources and provides a web interface for efficient data retrieval using natural language. To address this, we present Dingent, a novel and configurable agent framework that facilitates data access from various resources and enables the flexible constructions of agent applications. We demonstrate its capabilities across three distinct application scenarios, achieving promising results. The Dingent framework can be readily applied to other fields, such as earth sciences and ecology, to facilitate data discovery.