Back

Bioinformatics

24 training papers 2019-06-25 – 2026-03-07

Top medRxiv preprints most likely to be published in this journal, ranked by match strength.

1
FA-NIVA: A Nextflow framework for automated analysis of Nanopore based long-read sequencing data for genetic analysis in Fanconi anemia
2026-03-04 genetic and genomic medicine 10.64898/2026.02.27.26346867
Top 0.1% (5.7%)
Show abstract

MotivationFanconi anemia (FA) is a rare disease mainly caused by biallelic pathogenic variants, including structural variants such as large deletions and insertions in FA genes. Currently, variant detection is based on short-read sequencing and probe-based approaches. However, determining the exact genomic breakpoint or achieving allelic discrimination remains challenging. Nanopore-based long-read sequencing enables a comprehensive detection of FA variants, but a unified bioinformatic analysis p...

2
Gene Portals: A Framework for Integrating Clinical, Functional, and Structural Evidence into Rare Disease Variant Classification
2026-03-06 genetic and genomic medicine 10.64898/2026.03.05.26347086
Top 1% (1.2%)
Show abstract

Rare Mendelian disorders affect 300-400 million people globally. Although genetic testing has become widely adopted, gene-specific evidence for tailored variant interpretation remains scattered across resources. We present Gene Portals, a framework for gene-centered multimodal knowledge bases that co-localize expert-harmonized clinical data, functional assays, population variation, structural annotations and gene-specific ACMG/AMP specifications within a single resource. A modular interface inte...

3
Cancer genomic profiling predicts pathogenicity of BRCA1 and BRCA2 variants
2026-03-06 genetic and genomic medicine 10.64898/2026.03.05.26347746
Top 2% (1.1%)
Show abstract

Accurate classification of BRCA1 and BRCA2 variants is essential for cancer risk assessment and therapy selection, yet over one-third remain variants of uncertain significance (VUS). Here, using 120,660 real-world cancer genomic profiles with BRCA1 or BRCA2 variants from a >800,000-sample cohort, we develop machine learning models that predict pathogenicity using clinical and tumor-derived features, including a pan-cancer homologous recombination deficiency signature, co-mutated genes, zygosity,...

4
Pan-cancer tumour classification and risk stratification from whole-genome somatic variants via dual-task representation learning
2026-03-04 genetic and genomic medicine 10.64898/2026.03.02.26347318
Top 2% (1.0%)
Show abstract

Tumour typing from whole-genome sequencing is increasingly accurate, yet molecular subtyping from somatic variants remains challenging because of tumour heterogeneity and inconsistent clinical annotations. Here, we present Mutation-Attention Dual-Task (MuAt2), a Transformer model that jointly classifies histological tumour types and subtypes directly from somatic single-nucleotide variants, indels and structural variants. MuAt2 leverages encoders pre-trained on 2,587 pan-cancer whole genomes, an...

5
Prediction of incident coronary artery disease in individuals with zero coronary artery calcium using a novel multi-ancestry, label-free polygenic risk score framework
2026-03-04 genetic and genomic medicine 10.64898/2026.03.02.26347474
Top 2% (1.0%)
Show abstract

BackgroundA coronary artery calcium (CAC) score of 0 is widely considered to indicate low short- to intermediate-term risk for coronary artery disease (CAD) and is frequently used to defer lipid-lowering therapy. However, a subset of individuals with CAC=0 still experience events, highlighting residual risk not captured by imaging alone. Polygenic risk scores (PRS) quantify lifelong inherited susceptibility, but conventional approaches rely on predefined ancestry labels despite human genetic div...

6
Show Your Work: Verbatim Evidence Requirements and Automated Assessment for Large Language Models in Biomedical Text Processing
2026-03-04 health informatics 10.64898/2026.03.03.26346690
Top 3% (0.8%)
Show abstract

PurposeLarge language models (LLMs) are used for biomedical text processing, but individual decisions are often hard to audit. We evaluated whether enforcing a mechanically checkable "show your work" quote affects accuracy, stability, and verifiability for trial eligibility-scope classification from abstracts. MethodsWe used 200 oncology randomized controlled trials (2005 - 2023) and provided models with only the title and abstract. Trials were labeled with whether they allowed for the inclusio...

7
Enhancing Prediabetes Diagnosis from Continuous Glucose Monitoring Data via Iterative Label Cleaning and Deep Learning
2026-03-05 health informatics 10.64898/2026.03.04.26347604
Top 3% (0.7%)
Show abstract

As of early 2026, over 115 million US adults (more than 1 in 3) have prediabetes, a condition with an annual conversion rate of 5%-10% to type 2 diabetes. Total diabetes (diagnosed and undiagnosed) affects approximately 40.1 million Americans, or 12% of the population, with roughly 1.5 million new cases diagnosed annually. Continuous Glucose Monitoring (CGM) provides real-time, 24/7 insights into glycemic variability, detecting dangerous highs, lows, and trends that HbA1c (a 3-month average) mis...

8
Association of the FTO rs9939609 variant with glycemic control
2026-03-05 genetic and genomic medicine 10.64898/2026.03.05.26347689
Top 3% (0.7%)
Show abstract

Type 2 diabetes (T2D) affects 11.1% of the global population, underscoring the need for biomarkers that inform treatment response and glycemic outcomes. We evaluated the association between the FTO variant rs9939609-A and glycemic control in a Mexican population. A total of 174 individuals living with T2D from Merida and Sisal, Yucatan, were included, of whom 85% were receiving oral hypoglycemic agents as main treatment. Glycemic control was defined cross-sectionally as good ([≤]130 mg/dL, n=...

9
Integrative screening identifies functional variants and VNTRs underlying GWAS signals at the 5p15.33 multi-cancer susceptibility locus
2026-03-04 genetic and genomic medicine 10.64898/2026.03.03.26347427
Top 4% (0.7%)
Show abstract

Chromosome 5p15.33 harbors several independent association signals which demonstrate antagonistic pleiotropy across cancer types, with causal mechanisms largely unresolved. To identify functional variants and enhancer elements at this locus, we performed statistical fine-mapping followed by massively parallel reporter assays (MPRA) and proliferation based CRISPRi screens. This approach identified eight multi-cancer functional variants (MCFVs) across three GWAS signals. Targeting rs421629 (part o...

10
Red-Teaming Medical AI: Systematic Adversarial Evaluation of LLM Safety Guardrails in Clinical Contexts
2026-03-05 health informatics 10.64898/2026.02.26.26347212
Top 5% (0.5%)
Show abstract

BackgroundLarge language models (LLMs) are increasingly deployed in medical contexts as patient-facing assistants, providing medication information, symptom triage, and health guidance. Understanding their robustness to adversarial inputs is critical for patient safety, as even a single safety failure can lead to adverse outcomes including severe harm or death. ObjectiveTo systematically evaluate the safety guardrails of state-of-the-art LLMs through adversarial red-teaming specifically designe...

11
Trustworthy personalized treatment selection: causal effect-trees and calibration in perioperative medicine
2026-03-04 health informatics 10.64898/2026.03.03.26347440
Top 6% (0.5%)
Show abstract

BackgroundPersonalized medicine promises to tailor treatments to the individual, but it carries a hidden risk: mistaking statistical noise for actionable clinical insight. Current machine learning approaches often provide predictions, but fail to inform clinicians when those predictions are unreliable. ObjectiveDevelop a deployment-readiness framework that integrates causal inference, interpretable effect-trees, and calibration assessment to distinguish actionable signal from unreliable variati...

12
Medical concept understanding in large language models is fragmented
2026-03-05 health informatics 10.64898/2026.03.03.26347552
Top 7% (0.5%)
Show abstract

Large language models (LLMs) perform strongly across a wide range of medical applications, yet it remains unclear whether such success reflects genuine understanding of medical concepts. We present an ontology-grounded, concept-centered evaluation of medical concept understanding in LLMs. Using 6,252 phenotype concepts from Human Phenotype Ontology, we decompose concept understanding into three core dimensions--concept identity, concept hierarchy, and concept meaning--and design corresponding be...

13
Personalized Insights Derived from Wearable Device Data and Large Language Models to Improve Well-Being
2026-03-04 health informatics 10.64898/2026.03.03.26347299
Top 7% (0.5%)
Show abstract

Health behaviors such as physical activity and sleep affect mental health, but the effect of each health behavior varies substantially across individuals, limiting the usefulness of generic behavioral recommendations. We collected one year of continuous wearable and ecological momentary assessment data from 3,139 participants in the Intern Health Study (2018-2023), and examined individual-level associations between wearable-derived features and mood across the internship year. The behaviors asso...

14
Genetic liability to hip osteoarthritis confers neurovascular protection against Alzheimer's disease despite depression-mediated phenotypic comorbidity
2026-03-04 genetic and genomic medicine 10.64898/2026.03.04.26347509
Top 7% (0.5%)
Show abstract

BackgroundThe relationship between hip osteoarthritis (hip OA) and Alzheimers disease (AD) presents a critical paradox within the emerging "bone-brain axis": widespread phenotypic comorbidity sharply contradicts evolutionary theories of biological antagonism. This study integrates longitudinal and multi-omic analyses to determine whether this clinical overlap masks an underlying genetic neuroprotection. MethodsWe analyzed longitudinal phenotypic data from 261,767 UK Biobank participants using C...

15
Too rare to be random: genetic finding suggests previously unrecognized path of mutagenesis
2026-03-04 genetic and genomic medicine 10.64898/2026.03.03.26346966
Top 7% (0.4%)
Show abstract

We report a previously undescribed genotypic configuration identified in twins with HNRNPU-related neurodevelopmental disorder. Both twins have two closely spaced mosaic variants on the same allele that never co-occur on any single DNA molecule, resulting in three distinct cell lineages within each individual. We define this genotypic configuration as clustered monoallelic mosaicism (cMoMa). Recognizing the extreme improbability of such a configuration, we systematically explore two potential me...

16
Thyroid Cancer Risk Prediction from Multimodal Datasets Using Large Language Model
2026-03-06 health informatics 10.64898/2026.03.05.26347766
Top 7% (0.4%)
Show abstract

Thyroid carcinoma is one of the most prevalent endocrine malignancies worldwide, and accurate preoperative differentiation between benign and malignant thyroid nodules remains clinically challenging. Diagnostic methods that medical practitioners use at present depend on their personal judgment to evaluate both imaging results and separate clinical tests, which creates inconsistency that leads to incorrect medical evaluations. The combination of radiological imaging with clinical information syst...

17
Molecular characterisation of a Klebsiella pneumoniae neonatal sepsis outbreak in a rural Gambian hospital: a retrospective genomic epidemiology investigation
2026-03-04 genetic and genomic medicine 10.64898/2026.03.03.26347025
Top 9% (0.4%)
Show abstract

BackgroundKlebsiella pneumoniae is a common cause of neonatal sepsis in Africa, and is frequently hospital acquired. We recently reported an outbreak of multidrug-resistant K. pneumoniae sepsis amongst neonates at a rural hospital in The Gambia, West Africa, involving 57 cases and case fatality of 60%. Here we undertook a retrospective pathogen genomic epidemiology study of clinical and environmental K. pneumoniae isolated during the outbreak, to identify the outbreak strain, refine the epidemic...

18
Using the ECHILD Database to Explore Educational and Health Outcomes of Unaccompanied Asylum-Seeking Children living in England (2005 to 2021)
2026-03-04 health informatics 10.64898/2026.03.04.26347576
Top 9% (0.3%)
Show abstract

UK-based quantitative research on the health and education outcomes of Unaccompanied Asylum-Seeking Children (UASC) remains limited, especially at national level. Linked administrative data provide an unprecedented opportunity to study these outcomes among UASC. This paper lays a foundation for further research, particularly examining the influence of socio-demographic, legal and environmental factors on UASCs health and educational outcomes. We described the UASC population with a first record...

19
Population differences in wearable device wear time: Rescuing data to address biases and advance health equity
2026-03-06 health informatics 10.64898/2026.03.06.26347799
Top 9% (0.3%)
Show abstract

Wearable devices present transformative opportunities for personalized healthcare through continuous monitoring of digital biomarkers; however, individual variations in device wear time could mask or otherwise impact signal identification. Despite the widespread adoption of wearable devices in research, no comprehensive framework exists for understanding how wear time varies across populations or for addressing wear time-related biases in analysis. Using Fitbit data from 11,901 participants in t...

20
BEGA-UNet: Boundary-Explicit Guided Attention U-Net with Multi-Scale Feature Aggregation for Colonoscopic Polyp Segmentation
2026-03-05 gastroenterology 10.64898/2026.03.04.26347608
Top 9% (0.3%)
Show abstract

Accurate polyp segmentation from colonoscopy images is critical for colorectal cancer prevention, yet the generalization of deep learning models under domain shift remains insufficiently explored. We propose Boundary-Explicit Guided Attention U-Net (BEGA-UNet), a boundary-aware segmentation architecture that introduces explicit edge modeling as a structural inductive bias to enhance both segmentation accuracy and cross-domain robustness. The framework integrates three components: an Edge-Guided ...