GigaScience — Latest Matching Preprints

1

Integrating Patient Metadata and Genetic Pathogen Data: Advancing Pandemic Preparedness with a Multi-Parametric Simulator

Bonjean, M.; Ambroise, J.; Connolly, M.; Hayes, J.; Hurel, J.; sentis, A.; Orchard, F.; Gala, J.-L.

2023-08-23 bioengineering 10.1101/2023.08.22.554132 medRxiv

Top 0.1%

36.3%

Show abstract

Training and practice are needed to handle an unusual crisis quickly, safely, and effectively. Functional and table-top exercises simulate anticipated CBRNe (Chemical, Biological, Radiological, Nuclear, and Explosive) and public health crises with complex scenarios based on realistic epidemiological, clinical, and biological data from affected populations. For this reason, the use of anonymized databases, such as those from ECDC or NCBI, are necessary to run meaningful exercises. Creating a training scenario requires connecting different datasets that characterise the population groups exposed to the simulated event. This involves interconnecting laboratory, epidemiological, and clinical data, alongside demographic information. The sharing and connection of data among EU member states currently face shortcomings and insufficiencies due to a variety of factors including variations in data collection methods, standardisation practices, legal frameworks, privacy, and security regulations, as well as resource and infrastructure disparities. During the H2020 project PANDEM-2 (Pandemic Preparedness and Response), we developed a multi-parametric training tool to artificially link together laboratory data and metadata. We used SARS-CoV-2 and ECDC and NCBI open-access databases to enhance pandemic preparedness. We developed a comprehensive training procedure that encompasses guidelines, scenarios, and answers, all designed to assist users in effectively utilising the simulator. Our tool empowers training managers and trainees to enhance existing datasets by generating additional variables through data-driven or random simulations. Furthermore, it facilitates the augmentation of a specific variables proportion within a given set, allowing for the customization of scenarios to achieve desired outcomes. Our multi-parameter simulation tool is contained in the R package Pandem2simulator, available at https://github.com/maous1/Pandem2simulator. A Shiny application, developed to make the tool easy to use, is available at https://uclouvain-ctma.Shinyapps.io/Multi-parametricSimulator/. The tool runs in seconds despite using large data sets. In conclusion, this multi-parametric training tool can simulate any crisis scenario, improving pandemic and CBRN preparedness and response. The simulator serves as a platform to develop methodology and graphical representations of future database-connected applications.

2

Reproducible and accessible analysis of transposon insertion data at scale

Lariviere, D.; Wickham, L.; Keiler, K. C.; Nekrutenko, A.

2020-05-20 microbiology 10.1101/2020.05.19.105429 medRxiv

Top 0.1%

28.3%

Show abstract

Significant progress has been made in advancing and standardizing tools for human genomic and biomedical research, yet the field of next generation sequencing (NGS) analysis for microorganisms (including multiple pathogens) remains fragmented, lacks accessible and reusable tools, is hindered by local computational resource limitations, and does not offer widely accepted standards. One of such "problem areas" is the analysis of Transposon Insertion Sequencing (TIS) data. TIS allows perturbing the entire genome of a microorganism by introducing random insertions of transposon-derived constructs. The impact of the insertions on the survival and growth provides precise information about genes affecting specific phenotypic characteristics. A wide array of tools has been developed to analyze TIS data and among the variety of options available, it is often difficult to identify which one can provide a reliable and reproducible analysis. Here we sought to understand the challenges and propose reliable practices for the analysis of TIS experiments. Using data from two recent TIS studies we have developed a series of workflows that include multiple tools for data de-multiplexing, promoter sequence identification, transposon flank alignment, and read count repartition across the genome. Particular attention was paid to quality control procedures such as the determination of the optimal tool parameters for the analysis and removal of contamination. Our work provides an assessment of the currently available tools for TIS data analysis and offers ready to use workflows that can be invoked by anyone in the world using our public Galaxy platform (https://usegalaxy.org). To lower the entry barriers we have also developed interactive tutorials explaining details of TIS data analysis procedures at https://bit.ly/gxy-tis. ImportanceA wide array of tools has been developed to analyze TIS data and among the variety of options available, it is often difficult to identify which one can provide a reliable and reproducible analysis. Here we sought to understand the challenges and propose reliable practices for the analysis of TIS experiments. Using data from two recent TIS studies we have developed a series of workflows that include multiple tools for data de-multiplexing, promoter sequence identification, transposon flank alignment, and read count repartition across the genome. Particular attention was paid to quality control procedures such as the determination of the optimal tool parameters for the analysis and removal of contamination. Our work democratizes the TIS data analysis by providing open workflows supported by public computational infrastructure.

3

FAIRSCAPE: An Evolving AI-readiness Framework for Biomedical Research

Al Manir, S.; Levinson, M. A.; Niestroy, J.; Churas, C.; Parker, J. A.; Clark, T.

2024-12-23 bioinformatics 10.1101/2024.12.23.629818 medRxiv

Top 0.1%

28.3%

Show abstract

ObjectiveBiomedical datasets intended for use in AI applications require packaging with rich pre-model metadata to support model development that is explainable, ethical, epistemically grounded and FAIR (Findable, Accessible, Interoperable, Reusable). MethodsWe developed FAIRSCAPE, a digital commons environment, using agile methods, in close alignment with the team developing the AI-readiness criteria and with the Bridge2AI data production teams. Work was initially based on an existing provenance-aware framework for clinical machine learning. We incrementally added RO-Crate data+metadata packaging and exchange methods, client-side packaging support, provenance visualization, and support metadata mapped to the AI-readiness criteria, with automated AI-readiness evaluation. LinkML semantic enrichment and Croissant ML-ecosystem translations were also incorporated. ResultsThe FAIRSCAPE framework generates, packages, evaluates, and manages critical pre-model AI-readiness and explainability information with descriptive metadata and deep provenance graphs for biomedical datasets. It provides ethical, schema, statistical, and semantic characterization of dataset releases, licensing and availability information, and an automated AI-readiness evaluation across all 28 AI-readiness criteria. We applied this framework to successive, large-scale releases of multimodal datasets, progressively increasing dataset AI-readiness to full compliance. ConclusionFAIRSCAPE enables AI-readiness in biomedical datasets using standard metadata components and has been used to establish this pattern across a major, multimodal NIH data generation program. It eliminates early-stage opacity apparent in many biomedical AI applications and provides a basis for establishing end-to-end AI explainability.

4

Gencube: Efficient retrieval, download, and unification of genomic data from leading biodiversity databases

Son, K. H.; Cho, J.-Y.

2024-07-22 bioinformatics 10.1101/2024.07.18.604168 medRxiv

Top 0.1%

28.0%

Show abstract

MotivationWith the daily submission of numerous new genome assemblies, associated annotations, and experimental sequencing data to genome archives for various species, the volume of genomic data is growing at an unprecedented rate. Major genomic databases are establishing new hierarchical structures to manage this data influx. However, there is a significant need for tools that can efficiently access, download, and integrate genomic data from these diverse repositories, making it challenging for researchers to keep pace. ResultsWe have developed Gencube, a command-line tool with two primary functions. First, it facilitates the utility of genome assemblies, related annotations, gene set sequences, and cross-species data from various leading biodiversity databases. Second, it helps researchers intuitively explore experimental sequencing data that meets their needs and consolidates the metadata of the retrieved outputs. Availability and implementationGencube is a free and open-source tool, with its code available on GitHub: https://github.com/snu-cdrc/gencube.

5

From Planning Stage To FAIR Data: A Practical Metadatasheet For Biomedical Scientists

Seep, L.; Grein, S.; Splichalova, I.; Ran, D.; Mikhael, M.; Hildebrand, S.; Lauterbach, M.; Hiller, K.; Ribeiro, D. J. S.; Sieckmann, K.; Kardinal, R.; Huang, H.; Yu, J.; Kallabis, S.; Behrens, J.; Till, A.; Peeva, V.; Strohmeyer, A.; Bruder, J.; Blum, T.; Soriano-Arroquia, A.; Tischer, D.; Kuellmer, K.; Li, Y.; Beyer, M.; Gellner, A.-K.; Fromme, T.; Wackerhage, H.; Klingenspor, M.; Fenske, W. K.; Scheja, L.; Meissner, F.; Schlitzer, A.; Mass, E.; Wachten, D.; Latz, E.; Pfeifer, A.; Hasenauer, J.

2024-01-30 bioinformatics 10.1101/2024.01.27.577552 medRxiv

Top 0.1%

23.7%

Show abstract

Datasets consist of measurement data and metadata. Metadata provides context, essential for understanding and (re-)using data. Various metadata standards exist for different methods, systems and contexts. However, relevant information resides at differing stages across the data-lifecycle. Often, this information is defined and standardized only at publication stage, which can lead to data loss and workload increase. In this study, we developed Metadatasheet, a metadata standard based on interviews with members of two biomedical consortia and systematic screening of data repositories. It aligns with the data-lifecycle allowing synchronous metadata recording within Microsoft Excel, a widespread data recording software. Additionally, we provide an implementation, the Metadata Workbook, that offers user-friendly features like automation, dynamic adaption, metadata integrity checks, and export options for various metadata standards. By design and due to its extensive documentation, the proposed metadata standard simplifies recording and structuring of metadata for biomedical scientists, promoting practicality and convenience in data management. This framework can accelerate scientific progress by enhancing collaboration and knowledge transfer throughout the intermediate steps of data creation.

6

Maggot: An ecosystem for sharing metadata within the web of FAIR Data

Jacob, D.; Ehrenmann, F.; David, R.; Tran, J.; Mirande-Ney, C.; Chaumeil, P.

2024-05-29 bioinformatics 10.1101/2024.05.24.595703 medRxiv

Top 0.1%

23.7%

Show abstract

BackgroundDescriptive metadata are crucial for the discovery, reporting and mobilisation of research datasets. Addressing all metadata issues within the Data Management Plan often poses challenges for data producers. Organising and documenting data within data storage entails creating various descriptive metadata. Subsequently, data sharing involves ensuring metadata interoperability in alignment with FAIR principles. Given the tangible nature of these challenges, a real need for management tools has to be addressed to assist data managers to the fullest extent. Moreover, these tools have to meet data producers requirements and be user-friendly as well with minimal training as prerequisites. ResultsWe developed Maggot which stands for Metadata Aggregation on Data Storage, specifically designed to annotate datasets by generating metadata files to be linked into storage spaces. Maggot enables users to seamlessly generate and attach comprehensible metadata to datasets within a collaborative environment. This approach seamlessly integrates into a data management plan, effectively tackling challenges related to data organisation, documentation, storage, and frictionless FAIR metadata sharing within the collaborative group and beyond. Furthermore, for enabling metadata crosswalk, metadata generated with Maggot can be converted for a specific data repository or configured to be exported into a suitable format for data harvesting by third-party applications. ConclusionThe primary feature of Maggot is to ease metadata capture based on a carefully selected schema and standards. Then, it greatly eases access to data through metadata as requested nowadays in projects funded by public institutions and entities such as Europe Commission. Thus, Maggot can be used on one hand to promote good local versus global data management with open data sharing in mind while respecting FAIR principles, and on the other hand to prepare the future EOSC FAIR Web of Data within the framework of the European Open Science Cloud.

7

Rif-Correct, a software tool for bacterial mRNA half-life estimation

Aretakis, J. R.; Schrader, J. M.

2022-02-25 microbiology 10.1101/2022.02.25.481994 medRxiv

Top 0.1%

23.6%

Show abstract

BackgroundGenome-wide measurement of bacterial mRNA lifetimes using the antibiotic rifampicin has provided new insights into the control of bacterial mRNA decay. However, for long polycistronic mRNAs, the estimation of mRNA half-life can be confounded with transcriptional runoff caused by rifampicins inhibition of initiating RNA polymerases, and not elongating RNA polymerases. ResultsWe present the Rif-correct software package, a free open-source tool that uses transcriptome models of transcript architecture to allow for more accurate mRNA half-life estimates that account for the transcriptional runoff. Rif-correct is implemented as a customizable python script that allows for users to control all the analysis parameters to achieve improved mRNA half-life estimates. ConclusionsRif-correct is the first free open-source computational analysis pipeline for Rif-seq dataset mRNA half-life estimation. It is simple to run, fast, and easy to run with a detailed instruction manual and example datasets.

8

Analyzing the Naming Conventions of Life Science Data Resources to Inform Human and Computational Findability

Imker, H. J.; Ou, H.

2025-10-04 bioinformatics 10.1101/2025.10.02.680112 medRxiv

Top 0.1%

23.6%

Show abstract

This study aimed to evaluate the names of life science data resources and consider the impacts on findability, a core feature of the FAIR (Findability, Accessibility, Interoperability, and Reusability) Principles. Utilizing a previously published list of unique data resources, we identified and validated data resources with both common and full names available (n = 1153). From this set, we analyzed characteristics of resource names to identify if any naming conventions have emerged organically. Additionally, since common names are often used in the absence of a resources full name, we performed a test to evaluate our ability to infer any meaning from common names. Our results highlight suboptimal naming practices and a wide-spread opaqueness in common names, which poses challenges to resource identification and retrieval by both human-and computationally-centric methods. These results are informative for those who establish and promote data resources as well as for those who search for data to use in individual research projects, develop data discovery systems, analyze the scientific literature, or assess research infrastructure. The findings underscore the value of findability in the FAIR Principles and the current efforts to develop infrastructure that supports more efficient communication and global connectedness.

9

Building a community-driven bioinformatics platform to facilitate Cannabis sativa multi-omics research

Mansueto, L.; Kretzschmar, T.; Mauleon, R.; King, G. J.

2024-10-03 bioinformatics 10.1101/2024.10.02.616368 medRxiv

Top 0.1%

23.5%

Show abstract

Global changes in Cannabis legislation after decades of stringent regulation, and heightened demand for its industrial and medicinal applications have spurred recent genetic and genomics research. An international research community emerged and identified the need for a web portal to host Cannabis-specific datasets that seamlessly integrates multiple data sources and serves omics-type analyses, fostering information sharing. The Tripal platform was used to host public genome assemblies, gene annotations, QTL and genetic maps, gene and protein expression, metabolic profile and their sample attributes. SNPs were called using public resequencing datasets on three genomes. Additional applications, such as SNP-Seek and MapManJS, were embedded into Tripal. A multi-omics data integration web-service API, developed on top of existing Tripal modules, returns generic tables of sample, property, and values. Use-cases demonstrate the APIs utility for various -omics analyses, enabling researchers to perform multi- omics analyses efficiently.

10

VData: Temporally annotated data manipulation and storage

Bouvier, M.; Bonnaffoux, A.

2023-08-31 bioinformatics 10.1101/2023.08.29.555297 medRxiv

Top 0.1%

22.9%

Show abstract

BackgroundRecent advances in both single-cell sequencing technologies and gene expression simulation algorithms have led to the production of increasingly large datasets. Larger datasets (tens or hundreds of Gigabytes) can no longer fit on regular computers RAM and thus pose important challenges for storage and manipulation. Existing solutions offer partial solutions but do not explicitly handle the temporal dimension of simulated data and still require large amounts of RAM to run. ResultsVData is a Python extension to the widely used AnnData format that solves these issues by extending 2D dataframes to 3 dimensions (cells, genes and time). VData is built on top of Ch5mpy, a custom built Python library for easily working with hdf5 files and which allows to reduce the memory footprint to the minimum. ConclusionsVData allows to store and manipulate very large datasets of (empirical or simulated) time-stamped data. Since it follows the original Ann-Data format, it is compatible with the scverse tools and AnnData users will find it easy to use.

11

Integrating the ENCODE blocklist for machine learning quality control of ChIP-seq with seqQscorer

Albrecht, S.; Krämer, C.; Röchner, P.; Mayer, J. U.; Rothlauf, F.; Andrade-Navarro, M. A.; Sprang, M.

2025-05-15 bioinformatics 10.1101/2025.05.12.653555 medRxiv

Top 0.1%

22.8%

Show abstract

MotivationQuality assessment of next-generation sequencing data is a complex but important task to ensure correct conclusions from experiments in molecular biology, biomedicine, and biotechnology. We previously introduced seqQscorer, a quality assessment tool using machine learning to support this process. To improve seqQscorer in terms of accuracy and processing time, we integrated the ENCODE blocklist* to derive a new type of quality-related features, supposed to be more informative and faster in generation than those conventionally used by seqQscorer. ResultsThe novel seqQscorer extension, called seqBLQ, allows us to improve the quality assessment for ChIP-seq data derived from human tissues and cell lines. Furthermore, seqBLQ enhances the usability of the tool by simplifying the installation procedure and reducing the computational resources required for feature generation. Availability and implementationhttps://github.com/salbrec/seqQscorer

12

Bio-DIA: A web-based tool for data and algorithms integration

Dantas Soares, T.; Lira da Silva, V.; Luis Fonseca Faustino, A.; de Azevedo Morais, D. A.; Signoretti, A.; Figuerola, W. B.

2019-12-19 bioinformatics 10.1101/2019.12.13.875666 medRxiv

Top 0.1%

22.7%

Show abstract

Data science is historically a complex field, not only because of the huge amount of data and its variety of formats, but also because the necessity of collaboration between several specialists to retrieve valuable information. In this context, we created Bio-DIA, an online software to build data science workflow process focused in the integration of data and algorithms. Bio-DIA also facilitates the reusability of information/results obtained in previous process without the need of specific skills from the computer science field. The software was created with Angular at the front-end, Django at the back-end together with Spark to handle and process a variety of big data formats. The workflow/project is specified through XML file. Bio-DIA application facilitated the collaboration among users, allowing researchers groups to share data, scripts and information. Availabilityhttps://ucrania.imd.ufrn.br/biodia-app/. Login: bioguest, password: welcome123.

13

QuartPlotR: A quarternary phase diagram tool

Veluchamy, A.; Bowler, C.

2024-03-27 bioinformatics 10.1101/2024.03.22.586216 medRxiv

Top 0.1%

22.5%

Show abstract

MotivationLarge scale studies involving exploratory data analysis and important key discoveries require platform that provides comprehensive visualization. Density distribution analysis across multiple datasets is intuitive and summarization, visualization could reveal several biological information. Integration and visualization of sequence and annotation features in the context of composition of genomic mutation, microbiota, population are significantly challenging. ResultsWe propose a simple, novel strategy of visualization of multidimensional datasets involving multiple layers of data distribution which are interconnected. Also, we have implemented this phase diagram in an easy-to-use tool QuartPlotR, a resource for plotting charts from different genomic datasets. A generic data access and plotting framework has been designed and this is implemented as an R package. Availabilityhttps://github.com/AlagurajVeluchamy/QuartPlotR. Contactalaguraj.veluchamy@stjude.org Supplementary informationSupplementary data are available at Bioinformatics online.

14

VaxLLM: Leveraging Fine-tuned Large Language Model for automated annotation of Brucella Vaccines

Li, X.; Zheng, Y.; Hu, J.; Zheng, J.; Wang, Z.; He, Y.

2024-11-26 bioinformatics 10.1101/2024.11.25.625209 medRxiv

Top 0.1%

22.5%

Show abstract

BackgroundVaccines play a vital role in enhancing immune defense and preventing the hosts against a wide range of diseases. However, research relating to vaccine annotation remains a labor-intensive task due to the ever-increasing volume of scientific literature. This study explores the application of Large Language Models (LLMs) to automate the classification and annotation of scientific literature on vaccines as exemplified on Brucella vaccines. ResultsWe developed an automatic pipeline to automatically perform the classification and annotation of Brucella vaccine-related articles, using abstract and title. The pipeline includes VaxLLM (Vaccine Large Language Model), which is a fine-tuned Llama 3 model. VaxLLM systematically classifies articles by identifying the presence of vaccine formulations and extracts the key information about vaccines, including vaccine antigen, vaccine formulation, vaccine platform, host species used as animal models, and experiments used to investigate the vaccine. The model demonstrated high performance in classification (Precision: 0.90, Recall: 1.0, F1-Score: 0.95) and annotation accuracy (97.9%), significantly outperforming a corresponding non-fine-tuned Llama 3 model. The outputs from VaxLLM are presented in a structured format to facilitate the integration into databases such as the VIOLIN vaccine knowledgebase. To further enhance the accuracy and depth of the Brucella vaccine data annotations, the pipeline also incorporates PubTator, enabling cross comparison with VaxLLM annotations and supporting downstream analyses like gene enrichment. ConclusionVaxLLM rapidly and accurately extracted detailed itemized vaccine information from publications, significantly outperforming traditional annotation methods in both speed and precision. VaxLLM also shows great potential in automating knowledge extraction in the domain of vaccine research. AvailabilityAll data is available at https://github.com/xingxianli/VaxLLM, and the model was also uploaded to HuggingFace (https://huggingface.co/Xingxian123/VaxLLM).

15

Gexplora - user interface that highlights and explores the density of genomic elements along a chromosomal sequence

Nussbaumer, T.; Debnath, O.; Heidari, P.

2020-04-05 genomics 10.1101/2020.04.04.025379 medRxiv

Top 0.1%

22.1%

Show abstract

The density of genomic elements such as genes or transposable elements along its consecutive sequence can provide an overview of a genomic sequence while in the detailed analysis of candidate genes it may depict enriched chromosomal hotspots harbouring genes that explain a certain trait. The herein presented python-based graphical user interface Gexplora allows to obtain more information about a genome by considering sequence-intrinsic information from external databases such as Ensembl, OMA and STRING database using REST API calls to retrieve sequence-intrinsic information, protein-protein datasets and orthologous groups. Gexplora is available under https://github.com/nthomasCUBE/Gexplora.

16

BEDMS: A metadata standardizer for genomic regionattributes

Tambe, S.; Khoroshevskyi, O.; Park, S.-H.; LeRoy, N. J.; Campbell, D. R.; Zheng, G.; Zhang, A.; Sheffield, N. C.

2024-09-23 genomics 10.1101/2024.09.18.613791 medRxiv

Top 0.1%

21.6%

Show abstract

High-throughput sequencing technologies have generated vast omics data annotating genomic regions. A challenge arises in integrating this data because the associated metadata does not follow a uniform schema. This hinders data management, discovery, interoperability, and reusability. Existing tools that address metadata standardization issues are generally limited in scope and targeted toward specific data sets or types and are not generally applicable to custom schemas. To improve standardization of genomic interval metadata, we have developed BEDMS. We developed and evaluated several model architectures and trained models that achieved high performance on held-out training data. With a trained model, BEDMS provides users with predicted standardized metadata attributes that follow a standardized schema. Furthermore, BEDMS provides the ability to train custom models. To demonstrate, we trained BEDMS on three different schemas, allowing users to choose which schema to standardize into. We also deployed BEDMS on PEPhub, which provides a graphical user interface to allow users to standardize metadata without requiring any local training or software at all. In conclusion, BEDMS offers a practical one-stop solution for metadata management and standardization for genomic interval data.

17

Advanced Research Infrastructure for Experimentation in genomicS (ARIES): a lustrum of Galaxy experience

Knijn, A.; Michelacci, V.; Orsini, M.; Morabito, S.

2020-05-16 bioinformatics 10.1101/2020.05.14.095901 medRxiv

Top 0.1%

19.9%

Show abstract

Background: With the introduction of Next Generation Sequencing (NGS) and Whole-Genome Sequencing (WGS) in microbiology and molecular epidemiology, the development of an information system for the collection of genomic and epidemiological data and subsequent transparent and reproducible data analysis became indispensable. Further requirements for the system included accessibility and ease of use by bioinformatics as well as command line profane scientists. Findings: The ARIES (Advanced Research Infrastructure for Experimentation in genomicS, https://aries.iss.it) platform has been implemented in 2015 as an instance of the Galaxy framework specific for use of WGS in molecular epidemiology. Here, the experience with ARIES is reported. Conclusions: During its five years existence, ARIES has grown into a well-established reality not only as a web service but as well as a workflow engine for the Integrated Rapid Infectious Disease Analysis (IRIDA) platform. In fact, an environment has been created with the implementation of complex bioinformatic tools in an easy-to-use context allowing scientists to concentrate on what to do instead of how to do it.

18

Empirical study on software and process quality in bioinformatics tools

Ferenc, K.; Otto, K.; de Oliveira Neto, F. G.; Davila Lopez, M.; Horkoff, J.; Schliep, A.

2022-03-13 bioinformatics 10.1101/2022.03.10.483804 medRxiv

Top 0.1%

18.9%

Show abstract

Software quality in computational tools impacts research output in a variety of scientific disciplines. Biology is one of these fields, especially for High Throughput Sequencing (HTS) data, such tools play an important role. This study therefore characterises the overall quality of a selection of tools which are frequently part of HTS pipelines, as well as analyses the maintainability and process quality of a selection of HTS alignment tools. Our findings highlight the most pressing issues, and point to software engineering best practices developed for the improvement of maintenance and process quality. To help future research, we share the tooling for the static code analysis with SonarCloud which we used to collect data on the maintainability of different alignment tools. The results of the analysis show that the maintainability level is generally high but trends towards increasing technical debt over time. We also observed that the development activities on alignment tools are generally driven by very few developers and are not utilising modern tooling to their advantage. Based on these observations, we recommend actions to improve both maintainability and process quality in open source alignment tools. Those actions include improvements in tooling like the use of linters as well as better documentation of architecture and features. We encourage developers to use these tools in order to ease future maintenance efforts, increase user experience, support reproducibility, and ultimately increase the quality of research through increasing the quality of research software tools.

19

VirusDIP: Virus Data Integration Platform

Wang, L.; Chen, F.; Guo, X.; You, L.; Yang, X.; Yang, F.; Yang, T.; Gao, F.; Hua, C.; Ding, Y.; Cai, J.; Yang, L.; Huang, W.; Xu, Z.; Wan, B.; Tong, J.; Peng, C.; Yang, Y.; Zhang, L.; Liu, K.; Zhou, F.; Zhang, M.; Tan, C.; Zeng, W.; Wang, B.; Wei, X.

2020-06-09 bioinformatics 10.1101/2020.06.08.139451 medRxiv

Top 0.1%

18.9%

Show abstract

MotivationThe Coronavirus Disease 2019 (COVID-19) pandemic poses a huge threat to human public health. Viral sequence data plays an important role in the scientific prevention and control of epidemics. A comprehensive virus database will be vital useful for virus data retrieval and deep analysis. To promote sharing of virus data, several virus databases and related analyzing tools have been created. ResultsTo facilitate virus research and promote the global sharing of virus data, we present here VirusDIP, a one-stop service platform for archive, integration, access, analysis of virus data. It accepts the submission of viral sequence data from all over the world and currently integrates data resources from the National GeneBank Database (CNGBdb), Global initiative on sharing all influenza data (GISAID), and National Center for Biotechnology Information (NCBI). Moreover, based on the comprehensive data resources, BLAST sequence alignment tool and multi-party security computing tools are deployed for multi-sequence alignment, phylogenetic tree building and global trusted sharing. VirusDIP is gradually establishing cooperation with more databases, and paving the way for the analysis of virus origin and evolution. All public data in VirusDIP are freely available for all researchers worldwide. Availabilityhttps://db.cngb.org/virus/ Contactweixiaofeng@cngb.org

20

Large Language Models for Accessible Reporting of Bioinformatics Analyses in Interdisciplinary Contexts

Yu, L.; Kim, D.; Cao, Y.; Shu, M. W. S.; Shen, M.; Liang, X.; Gu, J.; Jayakumar, R.; Ding, W.; Yang, F.; Zhang, X.; Kim, J.; Yang, P.; Yang, J. Y. H.

2025-11-11 bioinformatics 10.1101/2025.11.09.687479 medRxiv

Top 0.1%

18.7%

Show abstract

Health and life scientists routinely collaborate with quantitative scientists for data analysis and interpretation, yet miscommunication often obscures the interpretation of complex results. Large Language Models (LLMs) offer a promising way to bridge this gap, but their cross-discipline interpretative skill remains limited on real-word bioinformatics analyses. We therefore benchmarked four state-of-the-art LLMs: GPT-4o, o1, Claude 3.7 Sonnet, and Gemini 2.0 Flash, using automated and human evaluation frameworks to ensure holistic evaluation. Automated assessment employed multiple choice questions designed using Blooms taxonomy to assess multiple levels of understanding, while human evaluation tasked scientists to score summaries for factual consistency, lack of harmfulness, comprehensiveness, and coherence. All generally produced readable and largely safe summaries, confirming their value for first-pass translation of technical analyses, however frequently misinterpreted visualisations, produced verbose summaries and rarely offered novel insights beyond what was already contained in the analytics. Our findings suggest that LLMs are best suited for easing interdisciplinary communication rather than replacing domain expertise and human oversight remains essential to guarantee accuracy, interpretative depth, and the generation of genuinely novel scientific insights.