Bioinformatics
Top medRxiv preprints most likely to be published in this journal, ranked by match strength.
Show abstract
Genome-wide association studies (GWAS) often aggregate data from millions of participants across multiple cohorts using meta-analysis to maximise power for genetic discovery. The increase in availability of genomic biobanks, together with a growing focus on phenotypic subgroups, genetic diversity, and sex-stratified analyses, has led GWAS meta-analyses to routinely produce hundreds of summary statistic files accompanied by detailed meta-data. Scalable infrastructures for data handling, quality c...
Show abstract
The integration of causal inference, artificial intelligence (AI), and multi-omics data represents a transformative frontier for unravelling the complex mechanisms underlying health and disease. Traditional observational epidemiology is limited by confounding and reverse causation, while statistical frameworks such as Mendelian randomization (MR) and structural equation modelling (SEM) enable more robust causal inference. Recent advances in AI and machine learning have further expanded these cap...
Show abstract
SummaryElectronic health records (EHRs) linked with a DNA biobank provide unprecedented opportunities for biomedical research in precision medicine. The Phenome-wide association study (PheWAS) is a widely-used technique for the evaluation of relationships between genetic variants and a large collection of clinical phenotypes recorded in EHRs. PheWAS analyses are typically presented as static tables and charts of summary statistics obtained from statistical tests of association between a genetic ...
Show abstract
MotivationGenotype imputation is a powerful tool for inferring missing genotype data in large-scale genomic studies. Over the last two decades, multiple research groups have developed a number of imputation algorithms, which continue improving in speed and overall accuracy. However, accurate imputation of rare and infrequent variants remains a challenge. ResultsHere we present Selphi, a novel genotype imputation algorithm based on the Positional Burrow Wheeler Transform (PBWT) and a new heurist...
Show abstract
AO_SCPLOWBSTRACTC_SCPLOWComplex disease genetics is a key area of research for reducing disease and improving human health. Genome-wide association studies (GWAS) help in this research by identifying regions of the genome that contribute to complex disease risk. However, GWAS are computationally intensive and require access to individual-level genetic and health information, which presents concerns about privacy and imposes costs on researchers seeking to study complex diseases. Publicly release...
Show abstract
Causal discovery is a powerful tool to disclose underlying structures by analyzing purely observational data. Genetic variants can provide useful complementary information for structure learning. Here, we propose a novel algorithm MRSL (Mendelian Randomization (MR)-based Structure Learning algorithm), which combines the graph theory with univariable and multivariable MR to learn the true structure using only GWAS summary statistics. Specifically, MRSL also utilizes topological sorting to improve...
Show abstract
SummaryThe ever-growing genetic cohorts lead to an increase in scale of molecular Quantitative Trait Loci (QTL) studies, creating opportunities for more extensive two samples Mendelian randomization (MR) investigations aiming to identify causal relationships between molecular traits and diseases. This increase led to the identification of multiple causal candidates and potential drug targets over time. However, the increase in scale of such studies and higher dimension multi-omic data come with ...
Show abstract
Understanding the role of genetic variants in disease is essential for diagnostics and the advancement of genomic medicine. While the advent of high-throughput sequencing has been matched by the development of sophisticated genomic analysis tools, these packages often involve complex analytical procedures that can be challenging for researchers with limited computational experience. Additionally, modern genomic datasets require high-performance computing (HPC) systems, which may be difficult to ...
Show abstract
In cross-cohort studies, integrating diverse datasets, such as electronic health records (EHRs), is both essential and challenging due to cohort-specific variations, distributed data storage, and data privacy concerns. Traditional methods often require data pooling or complex data harmonization, which can reduce efficiency and limit the scope of cross-cohort learning. We introduce mixWAS, a one-shot, lossless algorithm that efficiently integrates distributed EHR datasets via summary statistics. ...
Show abstract
The advent of big data from GWAS consortia and biobanks led to remarkable improvements in polygenic score (PGS) prediction accuracy. However, most PGS were derived using data from Europeans (EU) and performed poorly when used to predict phenotypes of non-Europeans. Transfer Learning (TL) is a technique by which knowledge gained using data from one population is used to improve a models performance in another population. Here, we present GPTL, an R-package implementing three methods to build PGS ...
Show abstract
The large volume of research findings submitted to the GWAS catalog in the last decade is a clear indication of the exponential progress of these studies and association approaches. This success has, however, been dimmed by recurring concerns about disparity and the lack of population diversity. As a result, researchers are now responding, and GWAS extension to diverse populations is under way. Initial GWAS methods were calibrated using European populations with long-range regions of linkage dis...
Show abstract
Polygenic risk scores (PRS) are becoming increasingly vital for risk prediction and stratification in precision medicine. However, PRS model training presents significant challenges for broader adoption of PRS, including limited access to computational resources, difficulties in implementing advanced PRS methods, and availability and privacy concerns over individual-level genetic data. Cloud computing provides a promising solution with centralized computing and data resources. Here we introduce ...
Show abstract
Deciphering the relationships between genes and complex traits could help us better understand the biological mechanisms leading to phenotypic variations and disease onset. Univariate gene-based analyses are widely used to characterize gene-phenotype relationships, but are subject to the influence of confounders. Furthermore, while some genes directly contribute to traits variations, others may exert their effects through other genes. How to quantify individual genes direct and indirect effects ...
Show abstract
Fine mapping aims to identify causal genetic variants with nonzero phenotypic effects. Leveraging genome-wide association study (GWAS) data from diverse ancestries enhances fine-mapping accuracy and resolution by exploiting differences in linkage disequilibrium (LD) and increasing sample sizes. However, existing multi-ancestry fine-mapping methods rely on discrete priors and assume that all causal variants are shared across ancestries - an assumption that may not hold in practice. Although MESuS...
Show abstract
Several gene-based tests, e.g., sequence kernel association test, have been developed for association testing of rare single nucleotide variants (SNVs) in genomic regions with disease traits. A common limitation of these aggregate methods is their inability to discriminate potentially causal variants from null variants within the tested regions. We propose a novel clustering method to classify rare variants into null and signal variant groups using summary statistics from the gene-based tests ba...
Show abstract
The rapid advancement in sequencing technologies has exponentially increased the availability of genomic data, heightening concerns about data privacy. Despite the perceived safety of publicly accessible genome-wide association study (GWAS) summary statistics, we demonstrate that their combination with less sensitive high-dimensional phenotype data can lead to significant leakage of confidential genomic information. By transforming a linear regression model into linear programming constraints, w...
Show abstract
BackgroundMendelian randomization (MR) has emerged as a valuable tool for causal inference in genetic epidemiology. Existing MR methods have issues related to pleiotropy and offer limited comprehensiveness. Here, we introduce an integrated MR analysis pipeline designed for GWAS summary statistics data. Our pipeline integrates feature selection, harmonization, and checkpoint mechanisms to improve the accuracy and reliability of MR analysis. MethodsIn classical GWAS, the p-value threshold usually...
Show abstract
The growing availability of pre-trained polygenic risk score (PRS) models has enabled their integration into real-world applications, reducing the need for extensive data labeling, training, and calibration. However, selecting the most suitable PRS model for a specific target population remains challenging, due to issues such as limited transferability, data heterogeneity, and the scarcity of observed phenotype in real-world settings. Ensemble learning offers a promising avenue to enhance the pr...
Show abstract
The Phenome-wide association studies (PheWAS) have become widely used for efficient, high-throughput evaluation of relationship between a genetic factor and a large number of disease phenotypes, typically extracted from a DNA biobank linked with electronic medical records (EMR). Phecodes, billing code-derived disease case-control status, are usually used as outcome variables in PheWAS and logistic regression has been the standard choice of analysis method. Since the clinical diagnoses in EMR are...
Show abstract
Recent advances in genome-wide association study (GWAS) and sequencing studies have shown that the genetic architecture of complex diseases and traits involves a combination of rare and common genetic variants, distributed throughout the genome. One way to better understand this architecture is to visualize genetic associations across a wide range of allele frequencies. However, there is currently no standardized or consistent graphical representation for effectively illustrating these results. ...