Bioinformatics
Top medRxiv preprints most likely to be published in this journal, ranked by match strength.
Show abstract
MotivationFanconi anemia (FA) is a rare disease mainly caused by biallelic pathogenic variants, including structural variants such as large deletions and insertions in FA genes. Currently, variant detection is based on short-read sequencing and probe-based approaches. However, determining the exact genomic breakpoint or achieving allelic discrimination remains challenging. Nanopore-based long-read sequencing enables a comprehensive detection of FA variants, but a unified bioinformatic analysis p...
Show abstract
Electronic health records (EHRs) have become the cornerstone of population-scale genetic studies1, but factors including patterns of healthcare use shape which and how diagnoses are recorded, leading to confounding effects in genetic associations with EHR codes2. In this study we propose EDGAR, a deep learning framework that recovers lifetime disease liability from EHR by aligning diagnostic codes with clinically validated measures and disease labels in a set of individuals prioritized through a...
Show abstract
Polygenic scores (PGS) have emerged as an important tool for genetic risk prediction in medicine to identify individuals at high-risk for disease. A major limitation in their implementation is the apparent disagreement among scores for the same individual decreasing their interpretability and utility in clinical settings. Here we show that the poor agreement across PGSes for type 2 diabetes (T2D) is fully explained by statistical uncertainty in PGS-based prediction; individual-level uncertainty ...
Show abstract
The Clinical Pharmacogenetics Implementation Consortium (CPIC) bases its drug-gene recommendations on the assignment of star alleles, which map known genotypes to defined functional categories and corresponding drug dosage guidelines. The star allele framework, first proposed in 1996 for the CYP gene family and later formalized with CPICs establishment in 2010 [1, 2], remains foundational to pharmacogenomics. However, this system has notable limitations. Its dependence on a restricted set of ben...
Show abstract
Polygenic risk scores (PRSs) have emerged as a valuable tool for genetic risk prediction and stratification in human diseases. Over the past decade, extensive methodological efforts have focused on improving the predictive power of PRS, leading to the development of numerous methods for PRS construction. Benchmarking these various methods thus becomes an essential task that is crucial for guiding future PRS applications. While studies have benchmarked subsets of these methods on specific phenoty...
Show abstract
Genetic diagnosis remains a formidable challenge characterized by a diagnostic odyssey that spans years, with over half of rare disease patients remaining undiagnosed affecting more than 300 million people on earth. Clinicians must navigate through thousands of candidate variants against a noisy and fragmented literature landscape, a task that overwhelms human cognitive capacity and conventional decision-making approaches. Recent advances in agentic artificial intelligence systems have demonstra...
Show abstract
Biobanks with longitudinal measurements have advanced our understanding of time-to-event (TTE) traits including age-of-onset and disease progression. However, limited work has characterized the heritability of TTE traits, a key parameter for comparisons of total association and predictive power. Here, we present COXMM, a Cox proportional hazard mixed model for estimating TTE heritability. Simulations show our model achieves nearly unbiased results, whereas non-TTE approaches severely underestima...
Show abstract
BACKGROUNDGenetic variant curation, an important step in the implementation of Genomic Medicine, requires literature-guided comparison of variant prevalence in affected individuals versus healthy controls. This evidence is categorized as the PS4 evidence code by the AMP/ACMG variant interpretation guidelines and its manual extraction is a major bottleneck in clinical variant curation. This study aimed to evaluate whether reasoning-capable large language models (LLMs) can support guideline-constr...
Show abstract
ObjectiveSystematic clinical phenotyping using Human Phenotype Ontology (HPO) is central to rare disease diagnosis. However, current disease prioritization (ranking candidate diseases from HPO for a patient) methods face key challenges: they often fail to account for the hierarchical structure of HPO terms, ignore dependencies among correlated terms, and do not adjust for batch effects arising from systematic differences in phenotype documentation across cohorts, institutions, or clinicians. We ...
Show abstract
Applying deep learning models to RNA-Seq data poses substantial challenges, primarily due to the high dimensionality of the data and the limited sample sizes. To address these issues, this study introduces an advanced deep learning pipeline that integrates feature engineering with data augmentation. The engineering application focuses on biomedical engineering, specifically the classification of RNA-Seq datasets for disease diagnosis. The proposed framework was initially validated on synthetic d...
Show abstract
Clinical decision making often relies on expert judgment guided by established guidelines, which can be challenging to standardize and abstract to implement. For example, selecting between gene panels and whole exome/genome sequencing (WES/WGS) for rare disease diagnosis frequently requires interpretation of evidence-based recommendations from the American College of Medical Genetics and Genomics (ACMG) guideline. Traditional machine learning (ML) models predicting suitable genetic tests often f...
Show abstract
Genome-wide association studies (GWAS) have implicated tens of thousands of genetic variants associated with complex traits and polygenic diseases. Colocalizing GWAS variants with variants that may regulate gene expression, via expression quantitative trait loci (eQTL) mapping, has successfully led to the identification of disease-critical genes and their cell types of action. Recent studies predominantly colocalize proximal cis-eQTLs, which are estimated to regulate [~]10% of variance in gene e...
Show abstract
BackgroundMost rare coding variants in monogenic disease genes remain classified as Variants of Uncertain Significance (VUS), limiting their use in clinical care. Many variant classifications have been submitted to ClinVar, often with rich free-text summaries of the evidence underlying each classification. These narratives are not standardized and are difficult to mine systematically, making it challenging to identify variants that might be reclassified as new evidence becomes available. Method...
Show abstract
AO_SCPLOWBSTRACTC_SCPLOWLongitudinal healthcare surveys frequently contain inconsistencies in self-reported onset ages, where participants report different ages for the same condition between enrollment and follow-up surveys. We propose two methods to handle this challenge. First, we introduce a procedure that aggregates inconsistency patterns to construct participant-level reliability scores, enabling researchers to stratify participants and prioritize analysis on high-reliability cohorts. Seco...
Show abstract
Acquiring insights from electronic health records (EHRs) is slowed by manual analytical workflows that limit scalability and reproducibility. We present LATCH (LLM-Assisted Testing of Clinical Hypotheses), an agentic framework that converts natural language clinical hypotheses into fully auditable analyses on structured EHR data. LATCH integrates LLM-assisted semantic layers with deterministic execution pipelines to automate cohort construction, statistical analysis, and result reporting, while ...
Show abstract
Cox proportional hazard regressions are frequently employed to develop prognostic models for time-to-event data, considering both patient-specific and disease-specific characteristics. In high-dimensional clinical modeling, these biological features can exhibit high collinearity due to inter-feature relationships, potentially causing instability and numerical issues during estimation without regularization. For rare diseases such as acute myeloid leukemia (AML), the sparsity and scarcity of data...
Show abstract
Achieving timely diagnosis for rare diseases remains challenging due to, among others, phenotypic heterogeneity and incomplete clinical data. While the Solve-RD project developed a phenotype-based gene prioritisation method, this approach did not account for the clinical consistency among related diseases in Orphanets hierarchical classifications. We present a phenotype-based computational pipeline that ranks candidate ORPHAcodes based on patient phenotypes. The pipeline computes patient-diseas...
Show abstract
Methods that analyze single-cell RNA-seq+ATAC-seq multiome data have shown promise in linking enhancers to target genes by correlating chromatin accessibility with gene expression across cells. However, correlations among ATAC-seq peaks may induce non-causal tagging peak-gene links (analogous to tagging associations in GWAS); indeed, we confirm that tagging effects induced by peak co-accessibility are pervasive in peak-gene linking. We defined two scores for each ATAC-seq peak: co-accessibility ...
Show abstract
ImportanceGenome-wide association studies have identified hundreds of common single nucleotide polymorphisms (SNPs) and small insertions/deletions (indels) associated with primary open-angle glaucoma (POAG) risk, though these variants have modest effect sizes and individually may have minor contributions to disease development. As whole-genome sequencing data is becoming more readily available, structural variants and other complex genomic features can be interrogated for contribution to disease...
Show abstract
Primary open-angle glaucoma (POAG) disproportionately affects individuals of African ancestry, yet rare coding variation in this population remains understudied. To address this gap, we performed a multi-cohort exome-wide meta-analysis across POAAGG, PMBB, All of Us, and UK Biobank, including 4,815 POAG cases and 22,922 controls of genetically inferred African ancestry. Although no gene reached exome-wide significance, we identified several suggestive gene-level associations driven by rare varia...