Back

Database

Oxford University Press (OUP)

Preprints posted in the last 30 days, ranked by how well they match Database's content profile, based on 51 papers previously published here. The average preprint has a 0.04% match score for this journal, so anything above that is already an above-average fit.

1
Clarified an rDNA Gene Unit Pattern with (CTTT)n and (CT)n Microsatellites Aggregation Ahead of and Behind the Gene in Human Genome

Shen, J.; Tang, S.; Xia, Y.; Qin, J.; Xu, H.; Tan, Z.

2026-03-24 genetics 10.64898/2026.03.22.713381 medRxiv
Top 0.2%
3.7%
Show abstract

BackgroundConventional models of human ribosomal DNA (rDNA) array organization have historically depended on transcription-centric boundaries, partitioning the unit into a [~]13 kb rDNA transcription region and a monolithic [~]31 kb intergenic spacer (IGS). While our previous identification of Duplication Segment Units (DSUs) mapped these arrays based on an intuitive analysis of the microsatellite density landscape of the complete reference human genome, our present deep mining of this landscape has revealed a more accurate rDNA Gene Unit Pattern. Methods & ResultsIn this study, we conducted a deep mining analysis of our previously established microsatellite density landscape of the T2T-CHM13 assembly, focusing specifically on nucleolar organizing regions (NORs). We suggest a more accurate rDNA Gene Unit Pattern containing a (CTTT)n microsatellite aggregation ahead of the rDNA gene and a (CT)n microsatellite aggregation behind the gene, rather than a pattern featuring an IGS region inserted between two rDNA genes. ConclusionsA correct rDNA gene pattern of the human genome probably includes a (CTTT)n microsatellite aggregation ahead of the gene and a (CT)n microsatellite aggregation behind it, which possibly constitute cis- and trans-regulating regions; the (CTTT)n and (CT)n microsatellite aggregations may provide two different local stable DNA structures for regulatory protein binding.

2
MTB-KB: A Curated Knowledgebase of Mycobacterium tuberculosis Related Studies

Li, P.; Li, C.; Zhu, R.; Sun, W.; Zhou, H.; Fan, Z.; Yue, L.; Zhang, S.; Jiang, X.; Luo, Q.; Han, J.; Huang, H.; Shen, A.; Bahetibieke, T.; Wang, J.; Zhang, W.; Wen, H.; Niu, H.; Bu, C.; Zhang, Z.; Xiao, J.; Gao, R.; Chen, F.

2026-04-10 bioinformatics 10.64898/2026.04.07.716833 medRxiv
Top 0.2%
3.1%
Show abstract

Tuberculosis (TB), caused by Mycobacterium tuberculosis (MTB), has regained its position as the worlds leading killer among infectious diseases. Despite extensive research progress across epidemiology, diagnosis, drug development, treatment regimens, vaccines, drug resistance, virulence factors, and immune mechanisms, MTB-related knowledge remains fragmented across thousands of publications, limiting its effective use. To address this gap, we present MTB-KB, a literature-curated knowledgebase that systematically integrates high-impact findings from eight major sections of TB research. The current release contains 75,170 associations from 1,246 publications, covering 18,439 entities standardized using authoritative databases and WHO-endorsed classifications. A central feature is the interactive knowledge graph, which links cross-section associations to reveal and infer MTB-host interactions, treatment strategies, and vaccine development opportunities. MTB-KB also provides a user-friendly interface with browsing, advanced search, and statistical visualization. Overall, by consolidating dispersed MTB knowledge into a structured and accessible platform, MTB-KB provides a valuable resource for researchers, clinicians, and policymakers, supporting both basic and clinical TB research, enabling evidence-based TB prevention, diagnosis, and treatment, and contributing to global elimination efforts. MTB-KB is accessible at https://ngdc.cncb.ac.cn/mtbkb/.

3
Quantifying Scientific Consensus in Biomedical Hypotheses via LLM-Assisted Literature Screening

Kim, U.; Kwon, O.; Lee, D.

2026-04-09 bioinformatics 10.64898/2026.04.06.716861 medRxiv
Top 0.3%
2.1%
Show abstract

Systematic literature reviews are labor-intensive tasks in biomedical research. While Large Language Models (LLMs) using Retrieval-Augmented Generation (RAG) techniques have enhanced information accessibility, the inherent complexity of biological systems--characterized by high context dependency and conflicting data--remains a primary driver of LLM hallucinations. This imposes a structural constraint that limits the precision of evidence synthesis. To address these limitations, we propose an automated framework designed for the exhaustive identification of supporting and contradictory evidence within a target literature set. Rather than relying on a models pre-trained knowledge, our system requires the LLM to review each paper individually to determine its alignment with a specific research hypothesis. By evaluating semantic context, the framework captures subtle contradictions that are often overgeneralized by conventional methods. The frameworks performance was validated using the BioNLI task, where it demonstrated high classification accuracy in distinguishing whether evidence supports or contradicts a given hypothesis. Notably, the implementation of an ensemble approach provided superior stability and slightly higher precision compared to individual models. Furthermore, the framework exhibited robust performance across several well-established biological hypotheses, confirming its practical utility and reliability in real-world research. This approach provides a rigorous basis for biomedical discovery by enabling the precise, systematic analysis of biological literature and the robust collection of evidence.

4
geneslator: an R package for comprehensive gene identifier conversion and annotation

Cavallaro, G.; Micale, G.; Privitera, G. F.; Pulvirenti, A.; Forte, S.; Alaimo, S.

2026-04-01 bioinformatics 10.64898/2026.03.30.714723 medRxiv
Top 0.4%
1.8%
Show abstract

MotivationHigh-throughput sequencing generates large gene lists, making data interpretation challenging. Accurate gene annotation and reliable conversion between identifiers (e.g., gene symbols, Ensembl GeneIDs, Entrez GeneIDs) are essential for integrating datasets, conducting functional analyses, and enabling cross-species comparisons. Existing tools and databases facilitate annotation but often suffer from inconsistencies, missing mappings, and fragmented workflows, limiting reproducibility and interpretability. ResultsTo address these limitations, we developed geneslator, an R package that unifies gene identifier conversion, orthologs mapping, and pathway annotation across eight model organisms (Homo sapiens, Mus musculus, Rattus norvegicus, Drosophila melanogaster, Danio rerio, Saccharomyces cerevisiae, Caenorhabditis elegans, Arabidopsis thaliana). geneslator provides an up-to-date, precise, and coherent framework that preserves data integrity, enables cross-species analyses, and facilitates robust interpretation of gene function and regulation, outperforming state-of-the-art gene annotation tools. Availabilitygeneslator is available at https://github.com/knowmics-lab/geneslator. Contactgrete.privitera@unict.it

5
From Registration to Insight: How STRONG AYA Transforms Registry Data to Enhance Decision-Support Tools for Adolescent and Young Adult Oncology

Hughes, N.; Hogenboom, J.; Carter, R.; Norman, L.; Gouthamchand, V.; Lindner, O.; Connearn, E.; Lobo Gomes, A.; Sikora-Koperska, A.; Rosinska, M.; Pogoda, K.; Wiechno, P.; Jagodzinska-Mucha, P.; Lugowska, I.; Hanebaum, S.; Dekker, A.; van der Graaf, W.; Husson, O.; Wee, L.; Feltbower, R.; Stark, D.

2026-04-04 oncology 10.64898/2026.04.03.26350064 medRxiv
Top 0.4%
1.7%
Show abstract

Background: Population-based cancer registers (PBCR) are important for monitoring trends in cancer epidemiology, facilitating the implementation of effective cancer services. Adolescents and Young Adult (AYA) with cancer are a patient group with a unique set of needs. The utility of PBCR in AYA is limited by the lack of AYA-specific data items. STRONG AYA, an international multidisciplinary consortium is addressing this through federated learning (FL) methodology and novel data visualisation concepts. A Core Outcome Set (COS) has been developed to measure outcomes of importance through clinical data and Patient Reported Outcomes (PROs). We describe how data from the Yorkshire Specialist Register of Cancer in Children and Young People (YSRCCYP), a PBCR in the UK is being used within STRONG AYA and how the subsequent analyses can guide patient consultations. Methods: Data from the YSRCCYP were imported into a Vantage 6 node, from which FL analyses are performed along with data provided by other consortium members. The results are extracted into the PROMPT software and integrated into patient electronic healthcare records. Results: Healthcare professionals can view the results of individual PROs at various time points and in comparison, to summary analyses carried out within the STRONG AYA infrastructure. Results can be filtered by age, disease, country and stage. Conclusion: We have demonstrated how a regional PBCR can contribute to a pan-European infrastructure and analyses viewed to enhance patient consultations. Such analyses have the potential to be used for research and policy-making, improving outcomes for AYA.

6
Dingent: An Easily Deployable Database Retrieval and Integration Agent framework

Kong, D.; Bei, S.; Wu, Y.; Tang, B.; Zhao, W.

2026-03-20 bioinformatics 10.64898/2026.03.17.712026 medRxiv
Top 0.4%
1.7%
Show abstract

AI-driven data search and integration represent an emerging research direction. Although several LLM-based backend frameworks and agentic frameworks have emerged, significant gap remains in developing a one-stop, configurable agent framework that supports various data sources and provides a web interface for efficient data retrieval using natural language. To address this, we present Dingent, a novel and configurable agent framework that facilitates data access from various resources and enables the flexible constructions of agent applications. We demonstrate its capabilities across three distinct application scenarios, achieving promising results. The Dingent framework can be readily applied to other fields, such as earth sciences and ecology, to facilitate data discovery.

7
A harmonized benchmarking framework for implementation-aware evaluation of 46 polygenic risk score tools across binary and continuous phenotypes

Muneeb, M.; Ascher, D.

2026-03-23 bioinformatics 10.64898/2026.03.22.713457 medRxiv
Top 0.4%
1.7%
Show abstract

Polygenic risk score (PRS) tools differ substantially in statistical assumptions, input requirements, and implementation complexity, making direct comparison difficult. We developed a harmonized, implementation-aware benchmarking framework to evaluate 46 PRS tools across seven binary UK Biobank phenotypes and one continuous trait under three model configurations: null, PRS-only, and PRS plus covariates. The framework integrates standardized preprocessing, tool-specific execution, hyperparameter exploration, and unified downstream evaluation using five-fold cross-validation on high-performance computing infrastructure. In addition to predictive performance, we assessed runtime, memory use, input dependencies, and failure modes. A Friedman test across 40 phenotype-fold combinations confirmed significant differences in tool rankings ({chi}2 = 102.29, p = 2.57 x 10-11), with no single method universally optimal. These findings provide a reproducible framework for comparative PRS evaluation and demonstrate that tool performance is shaped not only by statistical methodology but also by phenotype architecture, preprocessing choices, covariate structure, computational demands, software robustness, and practical implementation constraints.

8
Track Hub Quickload Translator: Convert Track Hub or Quickload data for viewing in the UCSC Genome Browser or the Integrated Genome Browser

Freese, N. H.; Raveendran, K.; Sirigineedi, J. S.; Chinta, U. L.; Badzuh, P.; Marne, O.; Shetty, C.; Naylor, I.; Jagarapu, S.; Loraine, A.

2026-03-30 bioinformatics 10.64898/2026.03.26.708838 medRxiv
Top 0.4%
1.5%
Show abstract

SummaryTrack Hub Quickload Translator is a web application that interconverts University of California Santa Cruz (UCSC) Genome Browser track hub and Integrated Genome Browser (IGB) data repository formats by translating the track hub or Quickload configuration files to the other genome browsers required format. This new work enables researchers to work with tens of thousands of published genome assemblies for the first time using either browser. Availability and ImplementationTrack Hub Quickload Translator is implemented using Python 3 and freely available to use at translate.bioviz.org. Integrated Genome Browser is available from BioViz.org. Track Hub Quickload Translator, GenArk Genomes, and the Integrated Genome Browser source code is available from github.org/lorainelab. Contactaloraine@charlotte.edu

9
IEKB: a comprehensive knowledge base for inner ear genetics integrating curated associations, cochlear interactions, Bayesian candidate prioritisation, explainable dark-gene support relations, and a scientific entity network

Wang, H.; Chen, W.; Ning, H.; Cai, Y.; Xu, Y.; Hou, X.; Pang, L.; Luo, Z.; Tian, C.

2026-04-09 bioinformatics 10.64898/2026.04.06.716823 medRxiv
Top 0.5%
1.5%
Show abstract

Inner-ear genetics has expanded rapidly, yet the supporting evidence remains dispersed across a vast literature and across resources that typically emphasise loci, variants, or expression data rather than integrated biological interpretation. Here we present the Inner Ear Knowledge Base (IEKB; https://earkb.org), an open database that unifies curated associations, cochlear interaction evidence, candidate prioritisation, explainable support relations, and network exploration for inner-ear research. IEKB was built with an automated agent-assisted curation workflow that combines schema-constrained literature extraction, continuous human monitoring, and final expert review by inner-ear genetics researchers. By systematically analysing 250,696 PubMed-indexed records retrieved across 16,563 screened genes, IEKB curates 6,051 gene-phenotype-disease associations from 2,494 genes across 43 phenotype categories and 4,102 cochlear gene-gene interactions with pathway, cell-type, and experimental context. IEKB further includes a Bayesian "dark matter" module that prioritises 243,071 candidate gene-phenotype associations for 13,229 genes across all 43 phenotypes (global AUC-ROC = 0.8603; global AUC-PR = 0.1674), together with a supervised dark-relation layer that ranks phenotype-specific known-gene support for each candidate and a multi-entity scientific network containing nearly 4,000 entities, 28,616 deterministic edges, and 83,712 literature-derived relational links. The web resource supports interactive search, multi-parameter filtering, gene-detail pages, bibliometric exploration, domain-specific enrichment against IEKB phenotype and disease gene sets, network visualisation, bulk download in CSV, JSON, SQLite, and XLSX formats, and natural-language evidence-grounded question answering through a companion conversational interface (IEKB QA). To our knowledge, IEKB is the first openly accessible inner-ear resource that integrates curated associations, cochlear interactions, probabilistic candidate prioritisation, auditable known-gene support relations for novel candidates, and a multi-entity scientific network within a single database. All data are released without registration under the CC BY 4.0 license.

10
Evaluating FoldX5.1 for MAVISp Stability Data Collection

Vliora, A.; Tiberti, M.; Papaleo, E.

2026-04-02 bioinformatics 10.64898/2026.03.31.715598 medRxiv
Top 0.5%
1.5%
Show abstract

MAVISp (Multi-layered Assessment of VarIants by Structure for proteins) is a structure-based framework for facilitating mechanistic interpretation of missense variants, with protein stability as one of its core analytical layers. When software tools are updated, a key consideration for database curation is whether the new version can be adopted without compromising compatibility with existing entries. This study evaluated the effect of replacing FoldX5 with FoldX5.1 on the results of the MAVISp stability workflow. We compared predicted changes in folding free energy for 539,809 shared variants across 119 proteins. We found high overall agreement with a mean Pearson correlation of 0.933 and a mean Cohen coefficient of 0.814. Most proteins showed strong concordance, whereas only three (NUPR1, TSC1, and TMEM127) showed poor agreement. The number of disagreements was higher at sites with low AlphaFold2 confidence for NUPR1 and TSC1. These outliers did not display systematic inter-version bias, as mean shifts in folding free energies between versions were minimal. Collectively, these findings support adopting FoldX5.1 for future MAVISp data collection. We will include a transition period, during which existing entries retain FoldX5 annotations until their scheduled annual update, while new or updated entries are processed with FoldX5.1. To facilitate this transition, the FoldX software version has been added as a new metadata annotation in the MAVISp database.

11
From SNPs to Pathways: A genome-wide benchmark of annotation discrepancies and their impact on protein- and pathway-level inference

Queme, B.; Muruganujan, A.; Ebert, D.; Mushayahama, T.; Gauderman, W. J.; Mi, H.

2026-03-24 bioinformatics 10.64898/2026.03.21.713397 medRxiv
Top 0.5%
1.4%
Show abstract

BackgroundAccurate single-nucleotide polymorphism (SNP) annotation is central to genomic research yet widely used tools and gene models often yield divergent results. Prior studies have shown such discrepancies in small datasets, but the extent of genome-wide variation and its impact on downstream pathway analysis remain unclear. ResultsWe conducted a comprehensive comparison of three commonly used SNP annotation tools, ANNOVAR, SnpEff, and VEP, using both Ensembl and RefSeq gene models to evaluate more than 40 million SNPs from the Haplotype Reference Consortium. At the protein level, annotation output differed significantly across tools and gene models (p-adj < 0.001), with discrepancies present in both genic and intergenic regions. RefSeq produced broader annotation coverage, particularly for intergenic SNPs, while Ensembl showed greater internal consistency. SnpEff provided the most complete coverage overall, whereas no single tool or model configuration achieved full annotation recovery of the union reference. Integration across tools and models maximized coverage and reduced annotation loss. In a case study of 204 colorectal cancer-associated SNPs from the FIGI GWAS, pathway enrichment results varied depending on annotation strategy. The fully integrated approach identified all four significant pathways, whereas several single-tool or single-model strategies missed one or more. ConclusionSNP annotation outcomes are influenced by both the tool and gene model used, and relying on a single approach may result in incomplete coverage. A multi-tool, multi-model strategy provides the most comprehensive annotation and preserves enriched pathways, supporting more robust and reproducible genomic interpretation.

12
Correlate: A Web Application for Analyzing Gene Sets and Exploring Gene Dependencies Using CRISPR Screen Data

Deolankar, S.; Wermeling, F.

2026-04-04 bioinformatics 10.64898/2026.04.02.716070 medRxiv
Top 0.5%
1.4%
Show abstract

CRISPR screen data provides a valuable resource for understanding gene function and identifying potential drug targets. Here, we present Correlate, a freely accessible web application (https://correlate.cmm.se) that enables exploration of the Cancer Dependency Map (DepMap) CRISPR screen gene effects, hotspot mutations, and translocation/fusion data across more than 1,000 human cancer cell lines. The application supports two main use cases: (i) analysis of user-defined gene sets (e.g. CRISPR screen hits) to identify functionally linked genes based on correlations while providing an overview based on essentiality or user-provided screen statistics; and (ii) exploration of genes of interest in defined biological contexts, such as specific cancer types or mutational backgrounds, to generate hypotheses about gene function and dependencies. Additionally, Correlate supports experimental design by providing rapid overviews of gene essentiality and enabling the identification of cell lines with relevant mutational profiles. In contrast to knowledge-based approaches such as STRING and GSEA, which rely on prior biological annotations and curated interaction networks, Correlate identifies gene connections directly from functional CRISPR screen readouts, offering a complementary and data-driven perspective on gene network analysis. The application runs entirely in the browser, requires no installation or login, and integrates with the Green Listed v2.0 tool family for custom CRISPR screen design. HIGHLIGHTS{blacksquare} Interactive web-based platform for bulk correlation analysis of user-defined gene sets using DepMap CRISPR screen data, requiring no installation or programming expertise. {blacksquare}Identifies functional gene relationships from CRISPR screen readouts rather than curated annotations, offering a data-driven complement to tools such as GSEA and STRING. {blacksquare}Enables contextual exploration of gene dependencies across cancer types and mutational backgrounds, supporting hypothesis generation about gene function and therapeutic targets. {blacksquare}Supports experimental design through gene essentiality overviews, mutation and fusion analysis, and cell line identification, with optional integration of user-provided statistics from CRISPR screens, proteomics, or transcriptomics analyses.

13
CCIDeconv: Hierarchical model for deconvolution of subcellular cell-cell interactions in single-cell data

Jayakumar, R.; Panwar, P.; Yang, J. Y. H.; Ghazanfar, S.

2026-03-30 bioinformatics 10.64898/2026.03.26.714643 medRxiv
Top 0.5%
1.3%
Show abstract

MotivationCell-cell interaction (CCI) underlies several fundamental mechanisms including development, homeostasis and disease progression. CCI are known to be localised to specific subcellular regions, for example, within the cytoplasms of cells. With the emergence of subcellular spatial transcriptomics technologies (sST), there is an opportunity to attribute CCI to subcellular regions. We aimed to deconvolute CCI to subcellular CCI (sCCI) in non-spatial single cell transcriptomics data (i.e. scRNA-seq) datasets using a modified CCI score from CellChat. ResultsBy calculating the sCCI score specific to cytoplasm and nucleus in nine publicly available sST datasets, we identified unique nucleus-nucleus and cytoplasm-cytoplasm sCCI. Then, we deconvolved the communication score to subcellular regions by using a hierarchical classification and regression model which we name as CCIDeconv. We performed leave-one-dataset-out cross-validation across nine datasets over a range of different tissue types from human samples. We observed that training across many different tissue types resulted in robust deconvolution performance in an unseen dataset. As the number of training datasets increased, models trained without spatial features achieved similar performance as models including spatial features. This implied the potential for accurate prediction of sCCI events from even scRNA-seq with large numbers of training datasets. Overall, we offer a method towards attributing CCI events to subcellular regions. This method can allow researchers in dissecting sCCI patterns to gain insights in underlying biology in a range of tissues covering health and disease.

14
Identification and classification of all Cytochrome P450 deposits in the Protein Data Bank

Smieja, P.; Zadrozna, M.; Syed, K.; Nelson, D.; Gront, D.

2026-03-19 bioinformatics 10.64898/2026.03.17.712328 medRxiv
Top 0.7%
0.9%
Show abstract

Cytochrome P450 monooxygenases (CYPs/P450s) form a highly diverse enzyme superfamily central to biotechnology, pharmacology, and environmental science. Despite the large number of available structures, identifying and comparing P450 entries in structural repositories remains challenging due to their extreme sequence divergence and inconsistent annotation practices. In particular, many deposits lack the standardized nomenclature (CYPid) and rather rely on legacy or author-defined common names (like P450cam, P450BM-3 and P450-PCN1), which are often inconsistent in formatting and specificity. This is particularly difficult for a superfamily as sequentially diverse as P450s. This hinders reliable retrieval and cross-referencing, making even identification all P450 structures in the database nontrivial. To overcome these obstacles, we developed a structure-guided discovery and validation workflow combining keyword search, Hidden Markov Models, and structural alignment, enabling robust detection and annotation. This strategy identified 1,513 deposits representing 674 unique sequences. All sequences were reannotated using the P450Atlas server and manually verified, confirming high assignment accuracy. In the process, we have also identified five new CYP subfamilies. The resulting dataset constitutes the first rigorously curated, structure-linked registry of P450 enzymes, integrated into a publicly accessible resource and supported by an automated pipeline that periodically scans newly released entries. By unifying structurally validated identification with standardized CYP nomenclature, this work establishes a reliable framework for accurate retrieval, comparison, and future large-scale analyses of P450 enzymes.

15
ChiMER: Integrating chromatin architecture into splicing graphs for chimeric enhancer RNAs detection

Xiang, Y.; Xiao, X.; Zhou, B.; Xie, L.

2026-03-19 bioinformatics 10.64898/2026.03.16.711958 medRxiv
Top 0.8%
0.8%
Show abstract

MotivationEnhancer-derived RNAs (eRNAs) and their fusion with protein-coding genes represent a crucial yet understudied layer of transcriptional regulation. eRNAs are typically expressed at low levels, which makes fusion events difficult to detect with conventional fusion detection tools. In addition, these tools are not designed to capture fusion transcripts arising from spatial proximity between distal regulatory elements and gene loci. Reads spanning such regions are also frequently filtered as mapping artifacts. As a result, computational approaches for systematically identifying spatially mediated enhancer-exon fusion transcripts remain lacking. MethodsWe developed ChiMER, a graph-based framework for detecting ChiMeric Enhancer RNAs from short-read RNA-seq data. ChiMER constructs splice graphs with chromatin contact information to introduce enhancer-exon edges and uses graph alignment to search for potential transcriptional paths. A ranking-based scoring module then prioritizes high-confidence events. Evaluations on simulated and real RNA-seq datasets show that ChiMER achieves higher sensitivity than conventional linear fusion detection methods while maintaining low false-positive rates. ResultsApplied to cancer cell line RNA-seq datasets, ChiMER identified multiple enhancer-exon chimeric transcripts, several associated with super-enhancer regions. Multi-omics analysis further shows that fusion transcripts occur in transcriptionally active regulatory environments and frequently coincide with strong R-loop signals, suggesting a potential role of RNA-DNA hybrid structures in facilitating long-range transcriptional joining events. Availabilityhttps://github.com/Candlelight-XYJ/ChiMER Contactyujia.xiang@outlook.com, xielinhai@ncpsb.org.cn

16
Patient-Centred Communication in Lung Cancer Screening: A Clinically Focussed Evaluation of a Fine-Tuned Open-Source Model Against a Larger Frontier System

Khanna, S.; Chaudhary, R.; Narula, N.; Lee, R.

2026-04-11 oncology 10.64898/2026.04.10.26350595 medRxiv
Top 0.9%
0.8%
Show abstract

Lung cancer screening saves lives, yet uptake remains suboptimal and inequitable. Personalised communication can improve attendance and reduce anxiety, but scaling such support is a workforce challenge. We fine-tuned Googles Gemma 2 9B using QLoRA on 5,086 synthetic screening conversations and compared it against Googles Gemini 2.5 Flash (a larger frontier model) and an unmodified baseline across 300 multi-turn conversations with 100 patient personas spanning ten clinical categories. Evaluation combined automated natural language processing metrics with independent language model judgement in two complementary modes: structured clinical rubric and simulated patient persona. The fine-tuned model achieved the highest simulated patient experience score (3.71/5 vs 3.65 for the frontier model), recorded zero boundary violations after clinician review of all flagged instances, and led on the four most safety-critical categories. A composite Patient Adaptation Index showed that the fine-tuned model led overall (0.37 vs 0.35 vs 0.35), with its clearest advantage on the two clinically specific components: empathy calibration to patient distress and selective smoking cessation signposting. These findings suggest that targeted fine-tuning of open-source models can yield clinical communication quality comparable to larger proprietary systems, with advantages in safety-critical scenarios and suitability for NHS data governance constraints. Human clinician review of these conversations is ongoing.

17
Quantifying and Characterizing the Fiber in Hass Avocados During the Ripening Process

Sanabria-Veaz, M. G.; Fahey, G. C.; Bach-Knudsen, K. E.; Holscher, H. D.

2026-04-08 plant biology 10.64898/2026.04.05.716578 medRxiv
Top 0.9%
0.8%
Show abstract

Reported avocado dietary fiber (DF) content and composition are inconsistently reported, particularly during ripening. Thus, this study aimed to characterize the amount and type of DF in Hass avocados and evaluate DF changes during ripening. Unripe (day 0), ripe (day 5), and overripe (day 12) Hass avocados were freeze-dried and defatted. DF was analyzed using non-starch polysaccharide (NSP) enzymatic-chemical methods. Per 100g of as-is avocado, unripe contained 3.96g total DF, ripe 3.68g, and overripe 3.26g. In ripe avocados, DF comprised 43% soluble (SDF) and 57% insoluble dietary fiber (IDF). SDF consisted primarily of rhamnogalacturonan-1 and arabinan pectins, while IDF was predominantly cellulose (32%), hemicelluloses (23%), and lignin (2%). Total DF decreased with ripening, with pectin undergoing solubilization and depolymerization, while cellulose and hemicelluloses remained stable. These findings are important as dietary fibers differentially influence intestinal microbial fermentation and health benefits.

18
STRmie-HD enables interruption-aware HTT repeat genotyping and somatic mosaicism profiling across sequencing platforms

Napoli, A.; Liorni, N.; Biagini, T.; Giovannetti, A.; Squitieri, A.; Miele, L.; Urbani, A.; Caputo, V.; Gasbarrini, A.; Squitieri, F.; Mazza, T.

2026-03-25 bioinformatics 10.64898/2026.03.21.713334 medRxiv
Top 0.9%
0.8%
Show abstract

Short tandem repeat expansions in exon 1 of the HTT gene drive Huntingtons disease (HD) pathogenesis, with disease onset and progression heavily influenced by somatic mosaicism and sequence interruptions. While sequencing technologies enable repeat sizing, many computational tools lack the resolution to capture subtle interruption motifs and allele-specific somatic variation. We present STRmie-HD, an alignment-free, de novo framework for interruption-aware genotyping and quantitative profiling of somatic mosaicism at single-read resolution. The tool parses individual reads to quantify uninterrupted CAG tract length, CCG repeat content, and critical interruption variants, including Loss of Interruption (LOI) and Duplication of Interruption (DOI). Validated across Illumina, PacBio SMRT, and Oxford Nanopore platforms, STRmie-HD demonstrates high concordance with reference genotypes and superior sensitivity in identifying rare interruption patterns that conventional tools often overlook. Furthermore, it implements somatic mosaicism metrics to characterize repeat dynamics, successfully distinguishing the higher somatic expansion burden in brain tissues compared to peripheral blood. STRmie-HD offers a comprehensive and extensible solution for high-resolution molecular characterization of HTT variation, providing a robust framework for patient stratification and genetic research in HD. Graphical Abstract O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=72 SRC="FIGDIR/small/713334v1_ufig1.gif" ALT="Figure 1"> View larger version (27K): org.highwire.dtl.DTLVardef@17a54aforg.highwire.dtl.DTLVardef@4dcfc5org.highwire.dtl.DTLVardef@8398edorg.highwire.dtl.DTLVardef@1acefde_HPS_FORMAT_FIGEXP M_FIG Graphical Abstract: STRmie-HD flowchart. STRmie-HD is a comprehensive analytical framework that processes sequencing reads to analyze CAG/CCG trinucleotide repeats, interruption variants, and somatic mosaicism in the HTT gene. The workflow begins with sequencing reads (FASTA/FASTQ) that can undergo optional custom processing eq]based on the sequencing design. These reads are then fed into a regular expression-based engine (STRmie-HD) to identify CAG and CCG motifs. The identified motifs lead to the estimation of CAG/CCG alleles, visualized as distinct peaks representing different allele sizes, interruption variant assessment, and somatic mosaicism quantification. STRmie-HD produces an HTML output that wraps this information into a report. C_FIG

19
A North American Collaborative Atlas of Oncology Data Visualization with R Statistical Software

Soltanifar, M.; Portuguese, A. J.; Jeon, Y.; Gauthier, J.; Lee, C. H.

2026-03-24 oncology 10.64898/2026.03.20.26348936 medRxiv
Top 0.9%
0.8%
Show abstract

Oncology research and clinical practice in North America increasingly rely on complex endpoints, heterogeneous study designs, and high-dimensional molecular data. In this landscape, data visualization serves as a critical analytic instrument for study design communication, model diagnostics, safety reporting, and real-time clinical decision support. Despite its importance, the oncology visualization ecosystem remains fragmented across commercial platforms and bespoke scripts, lacking a unified, code-first reference that emphasizes reproducibility and auditability in the R programming environment. This paper addresses this gap by presenting a North American collaborative atlas of 62 oncology visualization templates: 24 for clinical trials, 12 for real-world evidence (RWE), and 26 common to both settings. A core innovation of this atlas is its simulation-driven approach; each plot is illustrated using transparent, reproducible data-generating mechanisms. This allows users to deterministically recreate figures and easily adapt templates to alternative endpoints, censoring patterns, and subgroup structures. The paper provides foundational notation for oncology endpoints, an operational taxonomy based on data geometry, and a consolidated review of relevant R software. We further synthesize the practical utility of these methods through four representative case studies and provide a comparative analysis of the strengths, limitations, and future challenges of oncology data visualization. A detailed tutorial on fishplot is included to demonstrate a publication-ready workflow for clonal evolution.

20
MyGeneRisk Colon: A Web-Based Tool for Personalized Colorectal Cancer Risk Prediction Based on Genetics and Lifestyle

Zheng, J.; Steinfelder, R. S.; Yin, H.; Qu, C.; Thomas, M.; Thomas, S. S.; Andrews, C.; Augusto, B.; Corley, D. C.; Lee, J. K.; Berndt, S. I.; Chan, A. T.; Chanock, S. J.; Gignoux, C.; Goldberg, S. R.; Haiman, C. A.; Huyghe, J. R.; Iwasaki, M.; Le Marchand, L.; Lee, S. C.; Melendez, J.; Mesa, I.; Ogino, S.; Sifontes, V.; Um, C. Y.; Visvanathan, K.; White, L. L.; Williams, A.; Willis, W.; Wolk, A.; Yamaji, T.; Vadaparampil, S. T.; Jarvik, G. P.; Burnett-Hartman, A. N.; Milne, R. L.; Platz, E. A.; Figueiredo, J. C.; Zheng, W.; MacInnis, R. J.; Palmer, J. R.; Schmit, S. L.; Landorp-Vogelaar, I.;

2026-04-06 gastroenterology 10.64898/2026.04.03.26349669 medRxiv
Top 0.9%
0.8%
Show abstract

Colorectal cancer (CRC) is a leading cause of cancer-related death, with incidence rising substantially among individuals under 50 years of age. Polygenic risk scores (PRS) hold promise for identifying high-risk individuals; when combined with lifestyle factors, they substantially improve prediction accuracy compared with models based on lifestyle factors alone. However, few clinical tools currently exist that facilitate this integrated, PRS-enhanced risk assessment. To bridge this gap, we developed MyGeneRisk Colon, a publicly accessible web portal that delivers individualized CRC risk prediction by incorporating genetic, demographic, family history, and lifestyle factors. This paper details the development of the underlying risk prediction model, the portal's architecture and data security, our reporting framework, and engagement with a community advisory panel. Designed as a user-friendly platform, MyGeneRisk Colon aims to effectively communicate personalized CRC risk profiles and educate users and healthcare providers about prevention strategies.