Back

Bioinformatics

24 training papers 2019-06-25 – 2026-03-07

Top medRxiv preprints most likely to be published in this journal, ranked by match strength.

1
FA-NIVA: A Nextflow framework for automated analysis of Nanopore based long-read sequencing data for genetic analysis in Fanconi anemia
2026-03-04 genetic and genomic medicine 10.64898/2026.02.27.26346867
Top 0.1% (5.7%)
Show abstract

MotivationFanconi anemia (FA) is a rare disease mainly caused by biallelic pathogenic variants, including structural variants such as large deletions and insertions in FA genes. Currently, variant detection is based on short-read sequencing and probe-based approaches. However, determining the exact genomic breakpoint or achieving allelic discrimination remains challenging. Nanopore-based long-read sequencing enables a comprehensive detection of FA variants, but a unified bioinformatic analysis p...

2
Learning lifetime disease liability reveals and removes genetic confounding in electronic health records
2026-02-22 genetic and genomic medicine 10.64898/2026.02.15.26346336
Top 0.2% (4.0%)
Show abstract

Electronic health records (EHRs) have become the cornerstone of population-scale genetic studies1, but factors including patterns of healthcare use shape which and how diagnoses are recorded, leading to confounding effects in genetic associations with EHR codes2. In this study we propose EDGAR, a deep learning framework that recovers lifetime disease liability from EHR by aligning diagnostic codes with clinically validated measures and disease labels in a set of individuals prioritized through a...

3
Statistical uncertainty explains the poor agreement in polygenic scoring for type 2 diabetes
2026-02-27 genetic and genomic medicine 10.64898/2026.02.25.26347015
Top 0.2% (3.8%)
Show abstract

Polygenic scores (PGS) have emerged as an important tool for genetic risk prediction in medicine to identify individuals at high-risk for disease. A major limitation in their implementation is the apparent disagreement among scores for the same individual decreasing their interpretability and utility in clinical settings. Here we show that the poor agreement across PGSes for type 2 diabetes (T2D) is fully explained by statistical uncertainty in PGS-based prediction; individual-level uncertainty ...

4
PHARMWATCH: A Multilayer Pharmacogenomics Safety System for Accurate Star Allele Interpretation
2026-02-28 genetic and genomic medicine 10.64898/2026.02.26.26347200
Top 0.2% (3.7%)
Show abstract

The Clinical Pharmacogenetics Implementation Consortium (CPIC) bases its drug-gene recommendations on the assignment of star alleles, which map known genotypes to defined functional categories and corresponding drug dosage guidelines. The star allele framework, first proposed in 1996 for the CYP gene family and later formalized with CPICs establishment in 2010 [1, 2], remains foundational to pharmacogenomics. However, this system has notable limitations. Its dependence on a restricted set of ben...

5
Constructing a Literature-Derived Database for Benchmarking Polygenic Risk Score Construction Methods with Spectral Ranking Inferences
2026-03-03 genetic and genomic medicine 10.64898/2026.03.01.26347258
Top 0.3% (3.7%)
Show abstract

Polygenic risk scores (PRSs) have emerged as a valuable tool for genetic risk prediction and stratification in human diseases. Over the past decade, extensive methodological efforts have focused on improving the predictive power of PRS, leading to the development of numerous methods for PRS construction. Benchmarking these various methods thus becomes an essential task that is crucial for guiding future PRS applications. While studies have benchmarked subsets of these methods on specific phenoty...

6
Deep Agentic Variant Prioritisation for Expert Level Genetic Diagnosis Fast at Scale
2026-02-18 genetic and genomic medicine 10.64898/2026.02.17.26346421
Top 0.3% (3.1%)
Show abstract

Genetic diagnosis remains a formidable challenge characterized by a diagnostic odyssey that spans years, with over half of rare disease patients remaining undiagnosed affecting more than 300 million people on earth. Clinicians must navigate through thousands of candidate variants against a noisy and fragmented literature landscape, a task that overwhelms human cognitive capacity and conventional decision-making approaches. Recent advances in agentic artificial intelligence systems have demonstra...

7
A time-to-event heritability framework for inferring the genetic architecture of longitudinal traits
2026-02-22 genetic and genomic medicine 10.64898/2026.02.16.26346285
Top 0.4% (2.0%)
Show abstract

Biobanks with longitudinal measurements have advanced our understanding of time-to-event (TTE) traits including age-of-onset and disease progression. However, limited work has characterized the heritability of TTE traits, a key parameter for comparisons of total association and predictive power. Here, we present COXMM, a Cox proportional hazard mixed model for estimating TTE heritability. Simulations show our model achieves nearly unbiased results, whereas non-TTE approaches severely underestima...

8
Performance Characteristics of Reasoning Large Language Models for Evidence Extraction from Clinical Genomics Literature
2026-02-19 genetic and genomic medicine 10.64898/2026.02.18.26346543
Top 0.4% (2.0%)
Show abstract

BACKGROUNDGenetic variant curation, an important step in the implementation of Genomic Medicine, requires literature-guided comparison of variant prevalence in affected individuals versus healthy controls. This evidence is categorized as the PS4 evidence code by the AMP/ACMG variant interpretation guidelines and its manual extraction is a major bottleneck in clinical variant curation. This study aimed to evaluate whether reasoning-capable large language models (LLMs) can support guideline-constr...

9
PhenoSS: Phenotype semantic similarity-based approach for rare disease prediction and patient clustering
2026-03-02 health informatics 10.64898/2026.02.26.26347219
Top 0.4% (2.0%)
Show abstract

ObjectiveSystematic clinical phenotyping using Human Phenotype Ontology (HPO) is central to rare disease diagnosis. However, current disease prioritization (ranking candidate diseases from HPO for a patient) methods face key challenges: they often fail to account for the hierarchical structure of HPO terms, ignore dependencies among correlated terms, and do not adjust for batch effects arising from systematic differences in phenotype documentation across cohorts, institutions, or clinicians. We ...

10
An Integrated Deep Learning Framework for Small-Sample Biomedical Data Classification: Explainable Graph Neural Networks with Data Augmentation for RNA sequencing Dataset
2026-02-24 genetic and genomic medicine 10.64898/2026.02.22.26346827
Top 0.5% (1.9%)
Show abstract

Applying deep learning models to RNA-Seq data poses substantial challenges, primarily due to the high dimensionality of the data and the limited sample sizes. To address these issues, this study introduces an advanced deep learning pipeline that integrates feature engineering with data augmentation. The engineering application focuses on biomedical engineering, specifically the classification of RNA-Seq datasets for disease diagnosis. The proposed framework was initially validated on synthetic d...

11
Interpretable Fine-tuned Large Language Models Facilitate Making Genetic Test Decisions for Rare Diseases
2026-03-02 health informatics 10.64898/2026.02.26.26347223
Top 0.5% (1.9%)
Show abstract

Clinical decision making often relies on expert judgment guided by established guidelines, which can be challenging to standardize and abstract to implement. For example, selecting between gene panels and whole exome/genome sequencing (WES/WGS) for rare disease diagnosis frequently requires interpretation of evidence-based recommendations from the American College of Medical Genetics and Genomics (ACMG) guideline. Traditional machine learning (ML) models predicting suitable genetic tests often f...

12
Leveraging genome-wide effects on gene expression to identify disease-critical genes with trans-genetic components
2026-02-25 genetic and genomic medicine 10.64898/2026.02.23.26346922
Top 0.7% (1.6%)
Show abstract

Genome-wide association studies (GWAS) have implicated tens of thousands of genetic variants associated with complex traits and polygenic diseases. Colocalizing GWAS variants with variants that may regulate gene expression, via expression quantitative trait loci (eQTL) mapping, has successfully led to the identification of disease-critical genes and their cell types of action. Recent studies predominantly colocalize proximal cis-eQTLs, which are estimated to regulate [~]10% of variance in gene e...

13
Language models reveal evidence gaps in variants of uncertain significance
2026-03-02 genetic and genomic medicine 10.64898/2026.02.28.26347206
Top 0.8% (1.6%)
Show abstract

BackgroundMost rare coding variants in monogenic disease genes remain classified as Variants of Uncertain Significance (VUS), limiting their use in clinical care. Many variant classifications have been submitted to ClinVar, often with rich free-text summaries of the evidence underlying each classification. These narratives are not standardized and are difficult to mine systematically, making it challenging to identify variants that might be reclassified as new evidence becomes available. Method...

14
Handling onset age inconsistencies in longitudinal healthcare survey data
2026-02-23 health informatics 10.64898/2026.02.20.26346741
Top 0.8% (1.6%)
Show abstract

AO_SCPLOWBSTRACTC_SCPLOWLongitudinal healthcare surveys frequently contain inconsistencies in self-reported onset ages, where participants report different ages for the same condition between enrollment and follow-up surveys. We propose two methods to handle this challenge. First, we introduce a procedure that aggregates inconsistency patterns to construct participant-level reliability scores, enabling researchers to stratify participants and prioritize analysis on high-reliability cohorts. Seco...

15
An LLM-assisted framework for accelerated and verifiable clinical hypothesis testing from electronic health records
2026-02-12 health informatics 10.64898/2026.02.10.26346008
Top 0.8% (1.5%)
Show abstract

Acquiring insights from electronic health records (EHRs) is slowed by manual analytical workflows that limit scalability and reproducibility. We present LATCH (LLM-Assisted Testing of Clinical Hypotheses), an agentic framework that converts natural language clinical hypotheses into fully auditable analyses on structured EHR data. LATCH integrates LLM-assisted semantic layers with deterministic execution pipelines to automate cohort construction, statistical analysis, and result reporting, while ...

16
Federated penalized piecewise exponential model for horizontally distributed survival data: FedPPEM
2026-02-12 health informatics 10.64898/2026.02.11.26346054
Top 0.8% (1.5%)
Show abstract

Cox proportional hazard regressions are frequently employed to develop prognostic models for time-to-event data, considering both patient-specific and disease-specific characteristics. In high-dimensional clinical modeling, these biological features can exhibit high collinearity due to inter-feature relationships, potentially causing instability and numerical issues during estimation without regularization. For rare diseases such as acute myeloid leukemia (AML), the sparsity and scarcity of data...

17
Combining phenotypic similarity and network propagation to improve performance and clinical consistency of rare disease diagnosis
2026-02-17 health informatics 10.64898/2026.02.15.26346357
Top 1.0% (1.4%)
Show abstract

Achieving timely diagnosis for rare diseases remains challenging due to, among others, phenotypic heterogeneity and incomplete clinical data. While the Solve-RD project developed a phenotype-based gene prioritisation method, this approach did not account for the clinical consistency among related diseases in Orphanets hierarchical classifications. We present a phenotype-based computational pipeline that ranks candidate ORPHAcodes based on patient phenotypes. The pipeline computes patient-diseas...

18
Distinguishing causal from tagging enhancers using single-cell multiome data
2026-02-17 genetic and genomic medicine 10.64898/2026.02.15.26346353
Top 1% (1.4%)
Show abstract

Methods that analyze single-cell RNA-seq+ATAC-seq multiome data have shown promise in linking enhancers to target genes by correlating chromatin accessibility with gene expression across cells. However, correlations among ATAC-seq peaks may induce non-causal tagging peak-gene links (analogous to tagging associations in GWAS); indeed, we confirm that tagging effects induced by peak co-accessibility are pervasive in peak-gene linking. We defined two scores for each ATAC-seq peak: co-accessibility ...

19
A large deletion spanning multiple enhancers near PITX2 increases primary open-angle glaucoma risk
2026-03-02 ophthalmology 10.64898/2026.02.26.25342774
Top 1% (1.4%)
Show abstract

ImportanceGenome-wide association studies have identified hundreds of common single nucleotide polymorphisms (SNPs) and small insertions/deletions (indels) associated with primary open-angle glaucoma (POAG) risk, though these variants have modest effect sizes and individually may have minor contributions to disease development. As whole-genome sequencing data is becoming more readily available, structural variants and other complex genomic features can be interrogated for contribution to disease...

20
Rare Coding Variant Associations With Primary Open-Angle Glaucoma In African Ancestry:A Multi-Cohort Exome-Wide Meta Analysis
2026-02-27 ophthalmology 10.64898/2026.02.25.26347141
Top 1% (1.3%)
Show abstract

Primary open-angle glaucoma (POAG) disproportionately affects individuals of African ancestry, yet rare coding variation in this population remains understudied. To address this gap, we performed a multi-cohort exome-wide meta-analysis across POAAGG, PMBB, All of Us, and UK Biobank, including 4,815 POAG cases and 22,922 controls of genetically inferred African ancestry. Although no gene reached exome-wide significance, we identified several suggestive gene-level associations driven by rare varia...