Back

SoftwareX

Elsevier BV

Preprints posted in the last 30 days, ranked by how well they match SoftwareX's content profile, based on 15 papers previously published here. The average preprint has a 0.02% match score for this journal, so anything above that is already an above-average fit.

1
ARACRA: Automated RNA-seq Analysis for Chemical Risk Assessment

sharma, S.; Kumar, S.; Brull, J. B.; Deepika, D.; Kumar, V.

2026-04-09 bioinformatics 10.64898/2026.04.07.716912 medRxiv
Top 0.1%
2.4%
Show abstract

Transcriptomic analysis is considered a powerful approach for biomarker discovery, however still exploring large scale omics dataset to extract meaningful biological insights remains a challenge for biologists. To address this gap, we present ARACRA a fully automated RNA-seq analysis pipeline including entire transcriptomics workflow from raw FASTQ files to the transcriptomics Point of Departure (tPoD) with human-in-the-loop review process. Overall, the analysis is performed in two phases: Phase 1 carries out the acquisition of raw reads, pre-alignment quality control, alignment to reference genome and quantification of gene expression. Whereas, Phase 2 performs statistical analysis including Differential Gene Expression analysis and Dose-Response modelling. Two phases are separated by an extensive quality control step which allows the user to visually inspect the quality of data processed and helps in filtering noise and outlier samples. ARACRA facilitates end-to-end analysis of RNA-Seq data through an interactive web-based application developed on nextflow and streamlit for minimizing computational complexities while ensuring correct downstream processing. Availability and implementationARACRA is freely available online at the GitHub with MIT License and stream lit-based web application: ARACRA. Researchers can use the demo data or even upload their own data to do the analysis. O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=78 SRC="FIGDIR/small/716912v1_fig1.gif" ALT="Figure 1"> View larger version (27K): org.highwire.dtl.DTLVardef@15170a9org.highwire.dtl.DTLVardef@1bb9822org.highwire.dtl.DTLVardef@1010f3aorg.highwire.dtl.DTLVardef@8ee6e6_HPS_FORMAT_FIGEXP M_FIG O_FLOATNOFig 1:C_FLOATNO Overall Architecture of ARACRA C_FIG

2
fishROI: A specialized workflow for semi-automated muscle morphometry analysis in teleosts

Lu, Y.; Pan, M.; Jamwal, V.; Locop, J.; Ruparelia, A. A.; Currie, P. D.

2026-03-30 cell biology 10.64898/2026.03.27.714781 medRxiv
Top 0.1%
2.4%
Show abstract

Quantitative histological analysis of skeletal muscle morphometry provides critical insights into muscle physiology but remains labor-intensive and technically demanding. While recent developments in machine-learning-based image segmentation techniques have facilitated large-scale tissue analysis, existing tools that automate muscle morphometry analysis are largely tailored to mammalian models, with limited applicability to teleosts. Moreover, there is a lack of effective tools for visualizing spatial organization and morphometric variability of teleost muscle fibers, a feature that is important for understanding hyperplastic muscle growth dynamics in teleosts. In this study, we show that cytoplasmic staining combined with deep learning-based cell segmentation offers a robust and accurate approach for automated muscle morphometry analysis in developing zebrafish. We also introduce a FIJI2 plugin, implemented in Jython, that streamlines both morphometric analysis and visualization. This tool accommodates shallow and deep learning-based segmentation techniques and incorporates novel quantification and visualization methods suited to teleost-specific muscle features, including mosaic hyperplasia dynamics. The plugin features an intuitive graphical user interface and is designed for flexibility, with minimal constraints regarding species, image quality, or staining protocol. Its modular architecture allows it to be used as a baseline for automated muscle morphometry analysis, while permitting integration with other tools and workflows.

3
BrightEyes-FFS: an open-source platform for comprehensive analysis of fluorescence fluctuation spectroscopy experiments with small detector arrays

Slenders, E.; Perego, E.; Zappone, S.; Vicidomini, G.

2026-04-10 bioinformatics 10.64898/2026.04.08.717207 medRxiv
Top 0.2%
1.7%
Show abstract

Fluorescence fluctuation spectroscopy (FFS) is an ensemble of techniques for quantitative measurement of molecular dynamics and interactions. Recently, the introduction of small-format array detectors has opened up a new range of spatiotemporal information, allowing for more detailed analysis of system kinetics. However, there is currently no open-source software available for analyzing the high-dimensional FFS data sets. We present BrightEyes-FFS, an open-source Python-based environment for FFS analysis with array detectors. The environment includes a Python package for reading raw FFS data, computing auto- and cross-correlations using various algorithms, and fitting the correlations to several models. A graphical user interface (GUI), available as a standalone executable, makes the analysis fast and user-friendly. An automated Jupyter Notebook writing tool enables transition from the GUI to Jupyter Notebook for custom analysis. We believe that BrightEyes-FFS will enable a wider community to study diffusion, flow, and interaction dynamics.

4
BioDCASE: Using data challenges to make community advances in computational bioacoustics

Stowell, D.; Nolasco, I.; McEwen, B.; Vidana Vila, E.; Jean-Labadye, L.; Benhamadi, Y.; Lostanlen, V.; Dubus, G.; Hoffman, B.; Linhart, P.; Morandi, I.; Cazau, D.; White, E.; White, P.; Miller, B.; Nguyen Hong Duc, P.; Schall, E.; Parcerisas, C.; Gros-Martial, A.; Moummad, I.

2026-04-06 animal behavior and cognition 10.64898/2026.04.02.716062 medRxiv
Top 0.2%
1.7%
Show abstract

Computational bioacoustics has seen significant advances in recent decades. However, the rate of insights from automated analysis of bioacoustic audio lags behind our rate of collecting the data - due to key capacity constraints in data annotation and bioacoustic algorithm development. Gaps in analysis methodology persist: not because they are intractable, but because of resource limitations in the bioacoustics community. To bridge these gaps, we advocate the open science method of data challenges, structured as public contests. We conducted a bioacoustics data challenge named BioDCASE, within the format of an existing event (DCASE). In this work we report on the procedures needed to select and then conduct useful bioacoustics data challenges. We consider aspects of task design such as dataset curation, annotation, and evaluation metrics. We report the three tasks included in BioDCASE 2025 and the resulting progress made. Based on this we make recommendations for open community initiatives in computational bioacoustics.

5
A Web Application for Exploring Distribution in Academic Publications Across Geography and Institutions in India

Hou, Y.; Cohen, E.; Higginbottom, J.; Rountree, L.; Ren, Y.; Wahl, B.; Nyhan, K.; Mukherjee, B.

2026-03-20 health informatics 10.64898/2026.03.18.26348755 medRxiv
Top 0.2%
1.6%
Show abstract

India's national research capacity and infrastructure are unevenly distributed across states and union territories (UTs), contributing to geographic variation in academic publication output. We developed Indiapub, an open-access web application that quantitatively enumerates and visually displays geographic and temporal publication patterns for research products with at least one author affiliated with an Indian institution, using OpenAlex data. The app is designed for ease of use, with automated data retrieval, cleaning, and aggregation. Indiapub allows users to filter publications by topic, publication year range, author position, publication type, minimum citation count, state/UT, and population size of the state/UT where the author institution is located. The app also provides downloadable tables and ranked institution lists by publication count. Its interactive dashboard includes five modules: (i) a map of publication distribution, (ii) time trend plots for nation and state/UT, (iii) publication-share versus population-share plots highlighting over- and underrepresentation, (iv) stacked bar charts of state/UT contributions over time with population benchmarks, and (v) bubble plots relating the Human Development Index to publication volume over time. This tool may support resource prioritization and identification of institutional strengths for trainees, researchers, higher education administrators, and policymakers. To illustrate its utility, we present sample findings derived from the app. For publications across all topics from 2014 to 2025, the largest research participation footprints were observed in Tamil Nadu, Maharashtra, Delhi, Uttar Pradesh, and Karnataka. Tamil Nadu and Delhi were home to three of the highest-publishing institutions nationally: Vellore Institute of Technology, All India Institute of Medical Sciences, and Indian Institute of Technology Delhi. We also examined six curated case studies of broad scientific interest: electronic health records (EHR), genome-wide association studies (GWAS), artificial intelligence (AI), development economics, environmental science, and COVID-19. Findings from these case studies revealed over- and underrepresentation in publication output across states and UTs. For example, in EHR publications among high-population states, Tamil Nadu's publication share exceeded its population share by 31.3 percentage points (pp), whereas Bihar's was 12.8 pp lower. Our tool offers insights into India's research landscape across states and UTs with easy-to-digest visuals. Such interactive tools have the potential to serve as a starting point for fostering a more inclusive research ecosystem supporting targeted research policy and planning.

6
MOAflow: how re-design a pipeline with Nextflow streamlines data analysis

Tartaglia, J.; Giorgioni, M.; Cattivelli, L.; Faccioli, P.

2026-03-30 bioinformatics 10.64898/2026.03.26.713914 medRxiv
Top 0.2%
1.3%
Show abstract

BackgroundAdvances in high-throughput DNA sequencing technologies have dramatically reduced the time and cost required to generate genomic data. As sequencing is no longer a limiting factor, increasing attention must be paid to optimizing the analyses of the large-scale datasets produced. Efficient processing of such data is essential to reduce computational time and operational costs. In this context, workflow management systems (WMSs) have become key instruments for orchestrating complex bioinformatic pipelines. Among these systems, Nextflow has emerged as one of the most widely adopted solutions in bioinformatics. MethodsTo improve scalability and computational efficiency, we employed Nextflow to re-design an already existing pipeline dedicated to the analysis of MNase-defined cistrome-Occupancy (MOA-seq) data. The re-engineering process focused on modularizing the workflow and integrating containerization technologies to ensure reproducibility and easier deployment across heterogeneous computing environments. ResultsThe resulting workflow, named MOAflow, represents a modernized and fully containerized pipeline for MOA-seq data analysis. With only Docker and Nextflow required, the pipeline guarantees high portability and reproducibility. The data of the original article was used to benchmark the new pipeline. Its outputs closely match those of the original study with minor variations. ConclusionsMOAflow demonstrates how the adoption of robust WMS can substantially enhance the performance and usability of pre-existing bioinformatic pipelines. By leveraging containerization and Nextflow, it ensures consistent results across platforms while minimizing setup complexity. This work highlights the value of modern WMS-driven approaches in meeting the computational demands.

7
Statistical Principles Define an Open-Source Differential Analysis Workflow for Mass Spectrometry Imaging Experiments with Complex Designs

Rogers, E. B. T.; Lakkimsetty, S. S.; Bemis, K. A.; Schurman, C. A.; Angel, P. A.; Schilling, B.; Vitek, O.

2026-04-10 bioinformatics 10.64898/2026.04.08.717212 medRxiv
Top 0.2%
1.2%
Show abstract

Mass spectrometry imaging (MSI) characterizes the spatial heterogeneity of molecular abundances in biological samples. Experiments with complex designs, involving multiple conditions and multiple samples, provide particularly useful insight into differential abundance of analytes. However, analyses of these experiments require attention to details such as signal processing, selection of regions of interest, and statistical methodology. This manuscript contributes a statistical analysis workflow for detecting differentially abundant analytes in MSI experiments with complex designs. Using a case study of histologic samples of human tibial plateaus from knees of osteoarthritis patients and cadaveric controls, as well as simulated datasets, we illustrate the impact of the analysis decisions. We illustrate the importance of signal processing and feature aggregation for preserving biological relevance and alleviating the stringency of multiple testing. We further demonstrate the importance of selecting regions of interest in ways that are compatible with differential analysis. Finally, we contrast several common statistical models for differential analysis, showcase the appropriate use of replication, and demonstrate model-based calculation of sample size for followup investigations. The discussion is accompanied by detailed recommendations and an open-source R-based implementation that can be followed by other investigations.

8
Introducing circStudio, a Python package for preprocessing, analyzing and modeling actigraphy data

Marques, D.; Barbosa-Morais, N. L.; Reis, C. C. P.

2026-04-01 bioinformatics 10.64898/2026.03.30.711342 medRxiv
Top 0.3%
1.1%
Show abstract

Actigraphy is a non-invasive and cost-effective method for monitoring behavioral rhythms under real-world conditions by collecting time-resolved measurements of locomotor activity, light exposure, and temperature. Although several open-source packages support specific aspects of actigraphy analysis, aspects such as preprocessing, metric calculation, and mathematical modeling are often distributed across separate software packages, limiting interoperability and increasing programming overhead. Here we introduce circStudio, a Python package that unifies actigraphy data processing and mathematical modeling of circadian rhythms within a single framework. Built from the pyActigraphy codebase and integrating circadian models from the Arcascope circadian package, circStudio provides flexible preprocessing tools, support for multiple actigraphy file formats through adaptor classes, standalone functions for computing commonly used actigraphy metrics, and implementations of several mathematical models of circadian rhythms. The package enables users to move efficiently from raw wearable data to physiologically interpretable circadian outputs. Ultimately, circStudio aims to facilitate reproducible workflows and to provide a flexible foundation for research applications across circadian biology, sleep science, and digital health.

9
TRaP: An Open-source, Reproducible Framework for Raman Spectral Preprocessing across Heterogeneous Systems

Zhu, Y.; Lionts, M. M.; Haugen, E.; Walter, A. B.; Voss, T. R.; Grow, G. R.; Liao, R.; McKee, M. E.; Locke, A.; Hiremath, G.; Mahadevan-Jansen, A.; Huo, Y.

2026-03-27 bioengineering 10.64898/2026.03.26.714582 medRxiv
Top 0.3%
0.9%
Show abstract

Raman spectroscopy offers a uniquely rich window into molecular structure and composition, making it a powerful tool across fields ranging from materials science to biology. However, the reproducibility of Raman data analysis remains a fundamental bottleneck. In practice, transforming raw spectra into meaningful results is far from standardized: workflows are often complex, fragmented, and implemented through highly customized, case-specific code. This challenge is compounded by the lack of unified open-source pipelines and the diversity of acquisition systems, each introducing its own file formats, calibration schemes, and correction requirements. Consequently, researchers must frequently rely on manual, ad hoc reconciliation of processing steps. To address this gap, we introduce TRaP (Toolbox for Reproducible Raman Processing), an open-source, GUI-based Python toolkit designed to bring reproducibility, transparency, and portability to Raman spectral analysis. TRaP unifies the entire preprocessing-to-analysis pipeline within a single, coherent framework that operates consistently across heterogeneous instrument platforms (e.g., Cart, Portable, Renishaw, and MANTIS). Central to its design is the concept of fully shareable, declarative workflows: users can encode complete processing pipelines into a single configuration file (e.g., JSON), enabling others to reproduce results instantly without reimplementing code or reverse-engineering undocumented steps. Beyond convenience, TRaP integrates configuration management, X-axis calibration, spectral response correction, interactive processing, and batch execution into a workflow-driven architecture that enforces deterministic, repeatable operations. Every transformation is explicitly recorded, making the full processing history transparent, inspectable, and reproducible. This eliminates ambiguity in how results are generated and ensures that identical protocols can be applied consistently across datasets and experimental contexts. Through representative use cases, we show that TRaP enables seamless, reproducible preprocessing of Raman spectra acquired from diverse platforms within a unified environment. We hope TRaP can empower Raman data processing as a reproducible, shareable, and systematized scientific practice, aligning it with modern standards for computational research. TRaP is released as an open-source software at https://github.com/hrlblab/TRaP

10
NeoDBS: Open-Source Platform for Visualization and Analysis of Electrophysiological Recordings from Deep Brain Stimulation Systems

Rodrigues, L.; Ferreira, A.; Pereira, I.; Moreira, R.; Jacinto, L.

2026-03-30 bioengineering 10.64898/2026.03.27.714691 medRxiv
Top 0.3%
0.8%
Show abstract

Optimization of deep brain stimulation (DBS) therapy for neurological and neuropsychiatric disorders depends on objective quantitative biomarkers that can guide stimulation parameter adjustments. With the recent introduction of new-generation DBS systems capable of simultaneously stimulating brain activity and recording local field potentials (LFP), there is increasing demand for platforms that enable efficient visualization and analysis of these signals for electrophysiological biomarkers identification. To address the limitations of currently available toolboxes that require advanced signal processing skills and rely on proprietary software, we present NeoDBS, an open-source Python platform designed for ingestion and advance signal visualization and processing of LFP signals from DBS systems through an easy-to-use graphical interface. NeoDBS is a user-centered platform that offers predefined analysis pipelines with the aim of facilitating electrophysiological biomarker investigation for DBS across different brain disorders. Custom analysis pipelines are also available for users to leverage the signal analysis tools to their research needs. Critical functionalities for longitudinal biomarker research are featured in NeoDBS, such as batch file processing and event-locked analysis for in-clinic and at-home recordings. This combination of accessibility, user-experience and advanced signal processing tools makes NeoDBS an environment that propels easy and fast electrophysiological biomarker research for DBS, across patients, sessions, and stimulation parameters.

11
Validation and optimisation of wearable accelerometer data pre-processing for digital measure implementation and development

Langford, J.; Chua, J. Y.; Long, I.; Williams, A. C.; Hillsdon, M.

2026-03-24 animal behavior and cognition 10.64898/2026.03.21.713324 medRxiv
Top 0.4%
0.8%
Show abstract

The increasing use of accelerometers as digital health technologies in clinical trials and clinical care is driving the need for data processing to meet medical standards. The aim of this study was to create and test a modular pipeline for the pre-processing of high-resolution accelerometry that assures the quality, transparency and traceability of digital measures from sensor-level data. The objective is for the pipeline to be a foundational layer in the development, implementation and comparison of measures. The study developed the open GENEAcore package to meet the requirements of regulators, verifying the engineering implementation and analytically validating outputs against reference datasets. Early stages included the optimisation of calibration and non-wear detection. Data-driven detection of behavioural transitions was then validated to give direct bout outputs without the need to identify rules for epoch aggregation and interruptions. The utility for measure development was shown by comparing two algorithms for the characterisation of activity intensity in both the epoch and bout paradigms. Non-wear was detected with a balanced accuracy of 92.3% and the commonly used 13mg acceleration standard deviation threshold was empirically validated for the first time. The detection of transitions proved reliable with 99% detected, on average, within 2 seconds of their occurrence to give a mean expected event duration of 68.6s from a log-normal distribution. The different activity intensity algorithms were more than 99% concordant during movement but their outputs diverged in low movement conditions. Importantly, variable duration bouts created 31% higher daily activity durations compared to 1-second epochs. This evaluation of pre-processing steps has confirmed the attention to detail required to create robust and reproducible results for later clinical validation where small changes in an algorithm or its implementation may have clinically meaningful consequences.

12
StrucTTY: An Interactive, Terminal-Native Protein Structure Viewer

Jang, L. S.-e.; Cha, S.; Steinegger, M.

2026-03-19 bioinformatics 10.64898/2026.03.17.712308 medRxiv
Top 0.4%
0.8%
Show abstract

Terminal-based workflows are central to large-scale structural biology, particularly in high-performance computing (HPC) environments and SSH sessions. Yet no existing tool enables real-time, interactive visualization of protein backbone structures directly within a text-only terminal. To address this gap, we present StrucTTY, a fully interactive, terminal-native protein structure viewer. StrucTTY is a single self-contained executable that loads mulitple PDB and mmCIF files, normalizes three-dimensional coordinates, and renders protein structures as ASCII graphics. Users can rotate, translate, and zoom in on structures, adjust visualization modes, inspect chain-level features and view secondary structure assignments. The tool supports simultaneous visualization of up to nine protein structures and can directly display structural alignments using Foldseeks output, enabling rapid comparative analysis in headless environments. The source code is available at https://github.com/steineggerlab/StrucTTY. O_TEXTBOXKey MessagesO_LIReal-time, interactive protein structure visualization directly within text-only terminals C_LIO_LIASCII-based, depth-aware rendering of PDB and mmCIF backbone structures C_LIO_LIMulti-structure comparison with direct application of Foldseek alignment transformations C_LIO_LIDesigned for headless workflows on remote servers and HPC systems C_LI C_TEXTBOX

13
HybridNet-XR: Efficient Teacher-Free Self-Supervised Learning for Autonomous Medical Diagnostic Systems in Resource-Constrained Environments.

Mayala, S.; Mzurikwao, D.; Suluba, E.

2026-03-19 health informatics 10.64898/2026.03.16.26348570 medRxiv
Top 0.4%
0.8%
Show abstract

Deep learning model classification on large datasets is often limited in countries with restricted computational resources. While transfer learning can offset these limitations, standard architectures often maintain a high memory footprint. This study introduces HybridNet-XR, a memory-efficient and computationally lightweight hybrid convolutional neural network (CNN) designed to bridge the domain gap in medical radiography using autonomous self-supervised learning protocols. The HybridNet-XR architecture integrates depthwise separable convolutions for parameter reduction, residual connections for gradient stability, and aggressive early downsampling to minimize the video RAM (VRAM) footprint. We evaluated several training paradigms, including teacher-free self-supervised learning (SSL-SimCLR), teacher-led knowledge distillation (KD), and domain-gap (DG) adaptation. Each variant was pre-trained on ImageNet-1k subsets and fine-tuned on the ChestX6 multi-class dataset. Model interpretability was validated through gradient-weighted class activation mapping (Grad-CAM). The performance frontier analysis identified the HybridNet-XR-150-PW (Pre-warmed) as the optimal configuration, achieving a 93.38% average accuracy and 99% AUC while utilizing only 814.80 MB of VRAM. Regarding class-wise accuracy, this variant significantly outperformed standard MobileNetV2 and teacher-led models in critical diagnostic categories, notably Covid-19 (97.98%) and Emphysema (96.80%). Grad-CAM visualizations confirmed that the teacher-free pre-warming phase allows the model to develop sharper, anatomically grounded focus on pathological landmarks compared to distilled models. Specialized pre-warming schedules offer a viable, computationally autonomous alternative to knowledge distillation for medical imaging. By eliminating the requirement for high-performance teacher models, HybridNet-XR provides a robust and trustworthy diagnostic foundation suitable for clinical deployment in resource-constrained environments. Author summaryTraditional deep learning models for medical imaging are often too large for the low-power computers available in many global health settings. We developed a new model to bridge this computational gap. We designed HybridNet-XR, a highly efficient AI architecture, and trained it using a "teacher-free" method that doesnt require a massive supercomputer. We found a specific version (H-XR150-PW) that provides high accuracy while using very little memory. Our results show that high-performance diagnostic AI can be deployed on standard, low-cost hardware. Furthermore, using visual heatmaps (Grad-CAM), we proved that the AI correctly identifies medical landmarks like lung opacities, ensuring it is safe and reliable for real-world clinical use.

14
EthoClaw: An Integrated AI Workflow Platform for Automated Analysis in Neuroethology

Chen, K.; Chen, Z.; Zheng, D.; Fang, X.; Liang, J.; Li, Z.; Chen, Y.; Zou, J.; Cai, B.; Chen, S.; Huang, K.

2026-03-27 animal behavior and cognition 10.64898/2026.03.25.714141 medRxiv
Top 0.4%
0.8%
Show abstract

Computational methods have advanced the analysis of animal behavior, yet significant challenges remain in data standardization, analytical reproducibility, and workflow integration. Existing computational solutions often demand extensive programming proficiency or compel users to navigate a highly fragmented ecosystem of disconnected tools for tracking, statistical analysis, and visualization. Here, we present EthoClaw, an open-source, artificial intelligence-driven workflow platform built upon the OpenClaw agentic framework, functioning as a locally deployable AI assistant for behavioral research. EthoClaw provides an integrated computational infrastructure that seamlessly bridges the gap between raw behavioral video acquisitions and publishable scientific results. In this study, we demonstrate the platforms capacity to natively ingest video data via a dual-mode tracking architecture: utilizing ultra-fast image processing for rapid object detection, and leveraging the SuperAnimal methods for precise, markerless postural tracking. To ensure maximal interoperability, EthoClaw automatically converts various tracking data formats into DeepLabCut-compatible formats, enabling high-throughput phenotyping by generating publication-quality visualizations alongside rigorous multidimensional statistical profiling. Furthermore, the platform incorporates a large language model (LLM)-driven reporting module that dynamically synthesizes analytical documents, ensuring methodological transparency. Through an open field test, we validate the practical usability of EthoClaw while accelerating computational throughput by localizing heavy video processing to circumvent cloud bandwidth bottlenecks. Operating via an omnichannel natural language interface that integrates seamlessly with ubiquitous instant messaging software, EthoClaw democratizes advanced computational behavioral analysis, offering a holistic, highly efficient ecosystem that enforces experimental reproducibility and open science principles.

15
In-source fragmentation in mass spectrometry-based proteomics: prevalence, impact, and strategies for mitigation

Schramm, T.; Gillet, L.; Reber, V.; de Souza, N.; Gstaiger, M.; Picotti, P.

2026-03-30 biochemistry 10.64898/2026.03.27.714398 medRxiv
Top 0.4%
0.7%
Show abstract

Peptide-level analyses are becoming increasingly popular in mass spectrometry-based proteomics and are being applied, for example, in immunopeptidomics, structural proteomics, and analyses of post-translational modifications. In such analyses, peptides that are not biologically meaningful but instead arise as artifacts prior to mass spectrometry analysis pose the risk of data misinterpretation. Here, we describe an approach based on retention time analysis and precise chromatographic peak matching to identify peptides generated by in-source fragmentation (ISF), which occurs between chromatographic separation of peptide mixtures and the first mass filter of a tandem mass spectrometer (MS). To understand the prevalence and properties of ISF, we generated 13 proteomics datasets and analyzed them along with additional 25 previously published datasets spanning a broad range of sample types, MS, and proteomics approaches including classical bottom-up proteomics, immunopeptidomics, structural proteomics, and phosphoproteomics. We found that, in typical trypsin-digested samples on average 1 % of fully-tryptic peptides and 22 % of semi-tryptic peptides originated from ISF. However, we observed large variations between datasets, and in-source fragments exceeded, in some cases, a third of the total peptide identifications. The extent of ISF was dependent on the peptide sequence, the instrument, method parameters, and sample complexity. Although ISF did not impair relative quantification across samples, it generated peptides that could be misinterpreted qualitatively, inflated peptide identifications, and comprised up to 37 percent of peptides shorter than 9 amino acids in immunopeptidomics datasets. We propose that, for peptide-centric applications, our open-source ISF detection approach be used to re-annotate peptides generated by ISF and remove them to avoid misinterpretation of data. ISF is an increasing concern with improving mass spectrometers, as they enable detection of an ever-increasing number of m/z features, including low abundance features like ISF products. Our work thus addresses a growing issue in proteomics and presents solutions to mitigate the impact of in-source fragment peptides. In the future, improved feature detection algorithms may enable elucidation of new ISF patterns affecting side chains that have been missed so far, which could contribute to explaining the vast space of as-yet unannotated proteomics data.

16
Correlate: A Web Application for Analyzing Gene Sets and Exploring Gene Dependencies Using CRISPR Screen Data

Deolankar, S.; Wermeling, F.

2026-04-04 bioinformatics 10.64898/2026.04.02.716070 medRxiv
Top 0.5%
0.7%
Show abstract

CRISPR screen data provides a valuable resource for understanding gene function and identifying potential drug targets. Here, we present Correlate, a freely accessible web application (https://correlate.cmm.se) that enables exploration of the Cancer Dependency Map (DepMap) CRISPR screen gene effects, hotspot mutations, and translocation/fusion data across more than 1,000 human cancer cell lines. The application supports two main use cases: (i) analysis of user-defined gene sets (e.g. CRISPR screen hits) to identify functionally linked genes based on correlations while providing an overview based on essentiality or user-provided screen statistics; and (ii) exploration of genes of interest in defined biological contexts, such as specific cancer types or mutational backgrounds, to generate hypotheses about gene function and dependencies. Additionally, Correlate supports experimental design by providing rapid overviews of gene essentiality and enabling the identification of cell lines with relevant mutational profiles. In contrast to knowledge-based approaches such as STRING and GSEA, which rely on prior biological annotations and curated interaction networks, Correlate identifies gene connections directly from functional CRISPR screen readouts, offering a complementary and data-driven perspective on gene network analysis. The application runs entirely in the browser, requires no installation or login, and integrates with the Green Listed v2.0 tool family for custom CRISPR screen design. HIGHLIGHTS{blacksquare} Interactive web-based platform for bulk correlation analysis of user-defined gene sets using DepMap CRISPR screen data, requiring no installation or programming expertise. {blacksquare}Identifies functional gene relationships from CRISPR screen readouts rather than curated annotations, offering a data-driven complement to tools such as GSEA and STRING. {blacksquare}Enables contextual exploration of gene dependencies across cancer types and mutational backgrounds, supporting hypothesis generation about gene function and therapeutic targets. {blacksquare}Supports experimental design through gene essentiality overviews, mutation and fusion analysis, and cell line identification, with optional integration of user-provided statistics from CRISPR screens, proteomics, or transcriptomics analyses.

17
MartiniSurf: Automated Simulations of Surface-Immobilized Biomolecular Systems with Martini

Jimenez Garcia, J. C.; Lopez-Gallego, F.; Lopez, X.; De Sancho, D.

2026-03-30 biophysics 10.64898/2026.03.27.714767 medRxiv
Top 0.5%
0.7%
Show abstract

The rational design of biomolecule immobilization strategies requires molecular-level understanding of how surface properties, tethering geometry, and structural dynamics jointly influence stability and function. Recently, coarse-grained molecular dynamics simulations based on the Martini force field have emerged as an efficient framework for studying enzyme-surface interactions. However, the reproducible construction of immobilized systems with controlled orientations remains technically challenging, usually involving multiple computational tools. Here we present MartiniSurf, an open-source command-line framework for the preparation of protein and DNA systems immobilized on solid supports within the Martini paradigm. MartiniSurf integrates automated structure retrieval and cleaning, coarse graining via tools from the Martini force field software ecosystem, customizable surface generation, and biomolecule orientation based on user-defined anchoring residues, producing complete GROMACS-ready simulation systems. The framework supports both implicit restraint-based anchoring and explicit linker-mediated immobilization, including surfaces functionalized with user-defined ligands or linker-like moieties, enabling representation of mono- and multivalent attachment geometries at different modeling resolutions. Structure-based G[o]Martini potentials can be incorporated for proteins, while DNA systems are modeled using Martini 2. Optional substrate insertion, pre-coarse-grained complex handling, and automated solvation and ionization further extend system flexibility. By integrating these components into a unified workflow, MartiniSurf enables systematic and high-throughput in silico exploration of surface-tethered biomolecules and provides a robust computational platform for rational immobilization studies. TOC Graphic O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=146 SRC="FIGDIR/small/714767v1_ufig1.gif" ALT="Figure 1"> View larger version (45K): org.highwire.dtl.DTLVardef@bc1ac4org.highwire.dtl.DTLVardef@1813b43org.highwire.dtl.DTLVardef@159b19borg.highwire.dtl.DTLVardef@19b60d6_HPS_FORMAT_FIGEXP M_FIG C_FIG

18
emb2dis: a novel protein disorder prediction tool based on ResNets, dilated convolutions & protein language models

Duarte, S. A.; Mehdiabadi, M.; Bugnon, L. A.; Aspromonte, M. C.; Piovesan, D.; Milone, D. H.; Tosatto, S.; Stegmayer, G.

2026-04-01 bioinformatics 10.64898/2026.03.30.715414 medRxiv
Top 0.6%
0.6%
Show abstract

Intrinsically disordered proteins (IDPs) play an important role in a wide range of biological functions and are linked to several diseases. Due to technical difficulties and the high cost of experimental determination of disorder in proteins, combined with the exponential increase of unannotated protein sequences, the development of computational methods for disorder prediction became an active area of research in the last few decades. In this work, we present emb2dis, a deep learning model that uses protein language models (pLMs) to predict disorder from sequence. The emb2dis tool is a pre-trained model that receives as input a protein sequence, calculates its pLM embedding and passes it to a deep learning model. In contrast to existing approaches, emb2dis integrates informative sequence representations with a novel architecture that combines residual networks (ResNets) and dilated convolutions. This design effectively enlarges the receptive field of the convolution operation, enabling the model to better capture an extended context of each amino acid. At the output, emb2dis assigns a disorder propensity score to each residue in the sequence. The model was evaluated on datasets from the latest CAID3 blind benchmark for disorder prediction, where it achieved first place in the Disorder-PDB category, exhibiting strong performance with high AUC and Fmax scores. Additionally, it ranked among the top ten methods on the Disorder-NOX dataset. We provide a freely available web-demo for emb2dis and a source code repository for local installation. Weblink for the toolhttps://sinc.unl.edu.ar/web-demo/emb2dis/ The importance of the emb2dis tool is that it provides a new deep learning approach and significant improvements in the prediction of protein disorder, with a simple web interface and graphical output detailing per-residue disorder.

19
Using activity data to estimate brown bear den exit and entry dates

Brault, B.; Clermont, J.; Zedrosser, A.; Friebe, A.; Kindberg, J.; Pelletier, F.

2026-04-01 animal behavior and cognition 10.64898/2026.03.30.715338 medRxiv
Top 0.6%
0.5%
Show abstract

BackgroundIn hibernating mammals, the timing of den entry and exit reflects complex interactions among environment, physiology, and energetic constraints, with consequences for fitness. Consequently, shifts in denning phenology can affect population dynamics, particularly under climate change. Reliable estimation of denning timing is therefore critical, yet current methods often rely on GPS-derived movement data, limited by coarse sampling intervals, detection issues, and the inability to distinguish true inactivity from active presence at the den site. In this study, we developed and apply a method to estimate denning phenology in a brown bear population in south-central Sweden using accelerometer-derived activity data. Our approach employs adaptive, individual-specific thresholds to account for variation in baseline activity across bears, focusing on day-to-day changes to identify the start and end of inactivity periods. This method allows flexible and reproducible detection of den entry and exit dates, overcoming limitations associated with fixed thresholds and small sample sizes. ResultsWe compared activity-based estimates with GPS-derived den occupancy and examined variation in denning behavior across demographic groups. Analyzing 388 bear-winters, the method successfully identified inactivity periods in 360 cases. The method failed to identify clear start and end dates of hibernation for 28 (7%) bear-winters, which were characterized by unusually high or low daily activity levels at the boundaries of the inactivity period. Den site occupancy ranged from September 5 to June 2, with durations of 112-260 days, whereas inactivity periods detected from activity data extended from September 6 to May 13, lasting 83-217 days. Our comparison of activity-based and GPS-based methods indicates that bears may arrive at the den site several weeks before the onset of inactivity, with timing varying among demographic groups. ConclusionWe show that activity-based analysis provides a robust framework for estimating denning phenology, distinguishing actual inactivity from site presence, and improving understanding of the timing and variability of bear denning behavior. Applying an individual-level activity-based method improves accuracy in assessing ecological mechanisms underlying hibernation in bears and other hibernators, while also enhancing interpretation of environmental drivers and providing a reliable tool to monitor phenological shifts in response to climate change.

20
REBEL, Reproducible Environment Builder for Explicit Library resolution

Martelli, E.; Ratto, M. L.; Nuvolari, B.; Arigoni, M.; Tao, J.; Micocci, F. M. A.; Alessandri, L.

2026-04-07 bioinformatics 10.64898/2026.04.04.716498 medRxiv
Top 0.6%
0.5%
Show abstract

BackgroundAchieving FAIR-compliant computational research in bioinformatics is systematically undermined by two compounding challenges that existing tools leave unresolved: long-term reproducibility and accessibility. Standard package managers re-download dependencies from live repositories at every build, making environments vulnerable to library disappearance and version drift, and pinning a package version does not pin the versions of its transitive dependencies, causing divergences between builds performed at different points in time. Compounding this, packages from repositories such as CRAN, Bioconductor, and PyPI frequently omit critical system-level dependencies from their installation metadata, leaving users to manually discover which underlying library is missing or which version is required. Beyond these technical failures, constructing a truly reproducible environment demands expertise in containerization making reproducibility in practice a privilege and not a standard. FindingsWe present REBEL (Reproducible Environment Builder for Explicit Library Resolution), a framework that addresses both challenges through three dependency inference heuristics: (i) Deep Inspection of source code, (ii) Fuzzy Matching against a manually curated knowledge base, and (iii) Conservative Dependency Locking. The resolved dependency stack is then archived into a self-contained local store, enabling offline and deterministic rebuilds at any future time. We compared the installation of 1,000 randomly sampled CRAN packages in isolated Docker containers versus the standard package manager and REBEL resolved 149 of 328 standard installation failures (45.4%). Moreover through its DockerBuilder component, REBEL further generates fully reproducible Docker images from a plain text requirements file, making deterministic environment construction accessible without expertise in containerization. ConclusionsREBEL provides a practical foundation for FAIR-compliant, long-term reproducible bioinformatics analyses, making deterministic environment construction accessible to researchers regardless of their technical background. REBEL is freely available at https://github.com/Rebel-Project-Core