Bioinformatics
Top medRxiv preprints most likely to be published in this journal, ranked by match strength.
Show abstract
The integration of causal effect estimates from multiple Mendelian Randomization studies has become increasingly popular. However, the presence of overlapping databases compromises traditional meta-analysis, leading to inflated variance and reduced statistical power. Here, we propose JointMR, a joint likelihood-based approach designed to integrate multiple GWAS summary databases while explicitly accounting for the covariance matrix of the Wald ratio estimates. Specifically, to accommodate potent...
Show abstract
Polygenic risk scores (PRSs) quantify an individuals genetic susceptibility to complex traits and diseases. Conventional PRSs, which are based on linear models, perform poorly for phenotypes with skewed distributions or with genetic effects that vary across the distribution. We propose a quantile regression-based PRS (QPRS) that can capture quantile-specific genetic effects. While existing PRSs provide only a single score, QPRS models genetic influences at multiple quantiles of the phenotype, th...
Show abstract
MotivationFanconi anemia (FA) is a rare disease mainly caused by biallelic pathogenic variants, including structural variants such as large deletions and insertions in FA genes. Currently, variant detection is based on short-read sequencing and probe-based approaches. However, determining the exact genomic breakpoint or achieving allelic discrimination remains challenging. Nanopore-based long-read sequencing enables a comprehensive detection of FA variants, but a unified bioinformatic analysis p...
Show abstract
Advanced spatially resolved transcriptomic (SRT) technologies preserve the spatial context of gene expression within tissues, enabling the study of context-dependent transcriptional regulation. Here, we propose VISGP, a variational sparse gaussian-process method for spatial variable genes (SVGs) and cellular interactions analysis from such data. VISGP utilizes variational inference and a sparse Gaussian process approximation, which efficiently models the posterior distribution with a set of indu...
Show abstract
Recent studies showed that expression QTLs, even from trait-related tissues, explained a small fraction of complex trait heritability. A natural strategy to close this gap is to incorporate molecular QTLs (molQTLs) beyond gene expression, across diverse tissue/cellular contexts. Yet, integrating such QTL data presents analytical challenges. Molecular traits often share QTLs or have QTLs in high LD, complicating the attribution of GWAS signals to specific molecular traits. Our simulations showed ...
Show abstract
Electronic health records (EHRs) have become the cornerstone of population-scale genetic studies1, but factors including patterns of healthcare use shape which and how diagnoses are recorded, leading to confounding effects in genetic associations with EHR codes2. In this study we propose EDGAR, a deep learning framework that recovers lifetime disease liability from EHR by aligning diagnostic codes with clinically validated measures and disease labels in a set of individuals prioritized through a...
Show abstract
Polygenic scores (PGS) have emerged as an important tool for genetic risk prediction in medicine to identify individuals at high-risk for disease. A major limitation in their implementation is the apparent disagreement among scores for the same individual decreasing their interpretability and utility in clinical settings. Here we show that the poor agreement across PGSes for type 2 diabetes (T2D) is fully explained by statistical uncertainty in PGS-based prediction; individual-level uncertainty ...
Show abstract
The Clinical Pharmacogenetics Implementation Consortium (CPIC) bases its drug-gene recommendations on the assignment of star alleles, which map known genotypes to defined functional categories and corresponding drug dosage guidelines. The star allele framework, first proposed in 1996 for the CYP gene family and later formalized with CPICs establishment in 2010 [1, 2], remains foundational to pharmacogenomics. However, this system has notable limitations. Its dependence on a restricted set of ben...
Show abstract
Polygenic risk scores (PRSs) have emerged as a valuable tool for genetic risk prediction and stratification in human diseases. Over the past decade, extensive methodological efforts have focused on improving the predictive power of PRS, leading to the development of numerous methods for PRS construction. Benchmarking these various methods thus becomes an essential task that is crucial for guiding future PRS applications. While studies have benchmarked subsets of these methods on specific phenoty...
Show abstract
Genetic diagnosis remains a formidable challenge characterized by a diagnostic odyssey that spans years, with over half of rare disease patients remaining undiagnosed affecting more than 300 million people on earth. Clinicians must navigate through thousands of candidate variants against a noisy and fragmented literature landscape, a task that overwhelms human cognitive capacity and conventional decision-making approaches. Recent advances in agentic artificial intelligence systems have demonstra...
Show abstract
O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=111 SRC="FIGDIR/small/25342726v1_ufig1.gif" ALT="Figure 1"> View larger version (36K): org.highwire.dtl.DTLVardef@8edb94org.highwire.dtl.DTLVardef@f20105org.highwire.dtl.DTLVardef@21033corg.highwire.dtl.DTLVardef@15b865e_HPS_FORMAT_FIGEXP M_FIG O_FLOATNOVISUAL ABSTRACT:C_FLOATNO C_FIG Structured phenotypic annotations linked to genetic data can drive diagnostic insight and therapeutic discovery in complex diseases. However, poor research access to the rich ...
Show abstract
Digenic alterations can produce phenotypes such as synthetic lethality or digenic disease that are not observed upon individual gene perturbation, often by disrupting compensatory or redundant biological mechanisms. We hypothesized that gene pairs underlying such phenotypes share, when considered jointly, biological network properties analogous to those of essential genes or monogenic Mendelian disease genes. To test this hypothesis, we developed PAGAN, a graph representation learning framework ...
Show abstract
Biobanks with longitudinal measurements have advanced our understanding of time-to-event (TTE) traits including age-of-onset and disease progression. However, limited work has characterized the heritability of TTE traits, a key parameter for comparisons of total association and predictive power. Here, we present COXMM, a Cox proportional hazard mixed model for estimating TTE heritability. Simulations show our model achieves nearly unbiased results, whereas non-TTE approaches severely underestima...
Show abstract
BACKGROUNDGenetic variant curation, an important step in the implementation of Genomic Medicine, requires literature-guided comparison of variant prevalence in affected individuals versus healthy controls. This evidence is categorized as the PS4 evidence code by the AMP/ACMG variant interpretation guidelines and its manual extraction is a major bottleneck in clinical variant curation. This study aimed to evaluate whether reasoning-capable large language models (LLMs) can support guideline-constr...
Show abstract
ObjectiveSystematic clinical phenotyping using Human Phenotype Ontology (HPO) is central to rare disease diagnosis. However, current disease prioritization (ranking candidate diseases from HPO for a patient) methods face key challenges: they often fail to account for the hierarchical structure of HPO terms, ignore dependencies among correlated terms, and do not adjust for batch effects arising from systematic differences in phenotype documentation across cohorts, institutions, or clinicians. We ...
Show abstract
Applying deep learning models to RNA-Seq data poses substantial challenges, primarily due to the high dimensionality of the data and the limited sample sizes. To address these issues, this study introduces an advanced deep learning pipeline that integrates feature engineering with data augmentation. The engineering application focuses on biomedical engineering, specifically the classification of RNA-Seq datasets for disease diagnosis. The proposed framework was initially validated on synthetic d...
Show abstract
Clinical decision making often relies on expert judgment guided by established guidelines, which can be challenging to standardize and abstract to implement. For example, selecting between gene panels and whole exome/genome sequencing (WES/WGS) for rare disease diagnosis frequently requires interpretation of evidence-based recommendations from the American College of Medical Genetics and Genomics (ACMG) guideline. Traditional machine learning (ML) models predicting suitable genetic tests often f...
Show abstract
Deep learning-based facial phenotyping represents a major paradigm shift in the diagnosis of rare and ultra-rare genetic disorders. By capturing disease-specific craniofacial "gestalts" that are often subtle, overlapping, but overlooked in routine clinical practice, these technologies surpass the traditional limits of dysmorphology assessment. Despite this, data scarcity and stringent privacy policies constraint centralized model training and its clinical translation. Swarm learning, a decentral...
Show abstract
Polygenic scores (PGSs) are widely used to summarize the joint genetic effects for disease-related traits. However, while age-dependent genetic effects are increasingly recognized, their integration into PGSs remains underexplored. Kidney function, assessed by estimated glomerular filtration rate (eGFR), has strong age-related genetic effects, and prediction of kidney function decline is an unmet need. We develop an age-informative PGS for quantitative traits by generating age-specific weights ...
Show abstract
Glaucoma is the leading cause of irreversible blindness; vision loss is preventable with timely treatment, but early detection is challenging, leaving [~]50% undiagnosed, highlighting the need for improved risk assessment tools. We developed a polygenic risk score (PRS) using data from >6 million individuals. PRS performance was exceptional in European ancestries; top 10% PRS individuals had 10-fold increased risk (OR=10.0) relative to the remainder. Performance remained good across all major an...