Back

Bioinformatics

24 training papers 2019-06-25 – 2026-03-07

Top medRxiv preprints most likely to be published in this journal, ranked by match strength.

1
GWASHub: An Automated Cloud-Based Platform for Genome-Wide Association Study Meta-Analysis
2025-10-23 genetic and genomic medicine 10.1101/2025.10.21.25338463
#1 (21.1%)
Show abstract

Genome-wide association studies (GWAS) often aggregate data from millions of participants across multiple cohorts using meta-analysis to maximise power for genetic discovery. The increase in availability of genomic biobanks, together with a growing focus on phenotypic subgroups, genetic diversity, and sex-stratified analyses, has led GWAS meta-analyses to routinely produce hundreds of summary statistic files accompanied by detailed meta-data. Scalable infrastructures for data handling, quality c...

2
AutoMRAI: A Multi-Omics Causal Inference Platform Using Structural Equation Modelling
2025-11-19 health informatics 10.1101/2025.11.17.25340455
#1 (18.7%)
Show abstract

The integration of causal inference, artificial intelligence (AI), and multi-omics data represents a transformative frontier for unravelling the complex mechanisms underlying health and disease. Traditional observational epidemiology is limited by confounding and reverse causation, while statistical frameworks such as Mendelian randomization (MR) and structural equation modelling (SEM) enable more robust causal inference. Recent advances in AI and machine learning have further expanded these cap...

3
PheWAS-ME: A web-app for interactive exploration of multimorbidity patterns in PheWAS
2019-10-24 health informatics 10.1101/19009480
#1 (17.6%)
Show abstract

SummaryElectronic health records (EHRs) linked with a DNA biobank provide unprecedented opportunities for biomedical research in precision medicine. The Phenome-wide association study (PheWAS) is a widely-used technique for the evaluation of relationships between genetic variants and a large collection of clinical phenotypes recorded in EHRs. PheWAS analyses are typically presented as static tables and charts of summary statistics obtained from statistical tests of association between a genetic ...

4
Empowering GWAS Discovery through Enhanced Genotype Imputation
2023-12-19 genetic and genomic medicine 10.1101/2023.12.18.23300143
#1 (17.2%)
Show abstract

MotivationGenotype imputation is a powerful tool for inferring missing genotype data in large-scale genomic studies. Over the last two decades, multiple research groups have developed a number of imputation algorithms, which continue improving in speed and overall accuracy. However, accurate imputation of rare and infrequent variants remains a challenge. ResultsHere we present Selphi, a novel genotype imputation algorithm based on the Positional Burrow Wheeler Transform (PBWT) and a new heurist...

5
WebGWAS: A web server for instant GWAS on arbitrary phenotypes
2024-12-12 genetic and genomic medicine 10.1101/2024.12.11.24318870
#1 (17.2%)
Show abstract

AO_SCPLOWBSTRACTC_SCPLOWComplex disease genetics is a key area of research for reducing disease and improving human health. Genome-wide association studies (GWAS) help in this research by identifying regions of the genome that contribute to complex disease risk. However, GWAS are computationally intensive and require access to individual-level genetic and health information, which presents concerns about privacy and imposes costs on researchers seeking to study complex diseases. Publicly release...

6
MRSL: A phenome-wide causal discovery algorithm based on GWAS summary data
2022-06-30 genetic and genomic medicine 10.1101/2022.06.29.22277051
#1 (16.9%)
Show abstract

Causal discovery is a powerful tool to disclose underlying structures by analyzing purely observational data. Genetic variants can provide useful complementary information for structure learning. Here, we propose a novel algorithm MRSL (Mendelian Randomization (MR)-based Structure Learning algorithm), which combines the graph theory with univariable and multivariable MR to learn the true structure using only GWAS summary statistics. Specifically, MRSL also utilizes topological sorting to improve...

7
Efficient molecular mendelian randomization screens with LaScaMolMR.jl
2024-08-30 genetic and genomic medicine 10.1101/2024.08.29.24312805
#1 (14.8%)
Show abstract

SummaryThe ever-growing genetic cohorts lead to an increase in scale of molecular Quantitative Trait Loci (QTL) studies, creating opportunities for more extensive two samples Mendelian randomization (MR) investigations aiming to identify causal relationships between molecular traits and diseases. This increase led to the identification of multiple causal candidates and potential drug targets over time. However, the increase in scale of such studies and higher dimension multi-omic data come with ...

8
Segpy: a streamlined, user-friendly pipeline for variant segregation analysis
2024-12-29 genetic and genomic medicine 10.1101/2024.12.26.24319616
#1 (14.6%)
Show abstract

Understanding the role of genetic variants in disease is essential for diagnostics and the advancement of genomic medicine. While the advent of high-throughput sequencing has been matched by the development of sophisticated genomic analysis tools, these packages often involve complex analytical procedures that can be challenging for researchers with limited computational experience. Additionally, modern genomic datasets require high-performance computing (HPC) systems, which may be difficult to ...

9
mixWAS: An efficient distributed algorithm for mixed-outcomes genome-wide association studies
2024-01-10 genetic and genomic medicine 10.1101/2024.01.09.24301073
#1 (14.3%)
Show abstract

In cross-cohort studies, integrating diverse datasets, such as electronic health records (EHRs), is both essential and challenging due to cohort-specific variations, distributed data storage, and data privacy concerns. Traditional methods often require data pooling or complex data harmonization, which can reduce efficiency and limit the scope of cross-cohort learning. We introduce mixWAS, a one-shot, lossless algorithm that efficiently integrates distributed EHR datasets via summary statistics. ...

10
Improving Polygenic Score Prediction for Underrepresented Groups Through Transfer Learning
2025-10-09 genetic and genomic medicine 10.1101/2025.10.08.25337572
#1 (14.1%)
Show abstract

The advent of big data from GWAS consortia and biobanks led to remarkable improvements in polygenic score (PGS) prediction accuracy. However, most PGS were derived using data from Europeans (EU) and performed poorly when used to predict phenotypes of non-Europeans. Transfer Learning (TL) is a technique by which knowledge gained using data from one population is used to improve a models performance in another population. Here, we present GPTL, an R-package implementing three methods to build PGS ...

11
JasMAP: A Joint Ancestry and SNP Association Method for a Multi-way Admixed Population
2023-10-27 genetic and genomic medicine 10.1101/2023.10.26.23297617
#1 (13.9%)
Show abstract

The large volume of research findings submitted to the GWAS catalog in the last decade is a clear indication of the exponential progress of these studies and association approaches. This success has, however, been dimmed by recurring concerns about disparity and the lack of population diversity. As a result, researchers are now responding, and GWAS extension to diverse populations is under way. Initial GWAS methods were calibrated using European populations with long-range regions of linkage dis...

12
PennPRS: a centralized cloud computing platform for efficient polygenic risk score training in precision medicine
2025-02-10 genetic and genomic medicine 10.1101/2025.02.07.25321875
#1 (13.9%)
Show abstract

Polygenic risk scores (PRS) are becoming increasingly vital for risk prediction and stratification in precision medicine. However, PRS model training presents significant challenges for broader adoption of PRS, including limited access to computational resources, difficulties in implementing advanced PRS methods, and availability and privacy concerns over individual-level genetic data. Cloud computing provides a promising solution with centralized computing and data resources. Here we introduce ...

13
A Bayesian network-based framework to uncover the causal effects of genes on complex traits based on GWAS data
2022-12-27 genetic and genomic medicine 10.1101/2022.12.25.22283943
#1 (11.6%)
Show abstract

Deciphering the relationships between genes and complex traits could help us better understand the biological mechanisms leading to phenotypic variations and disease onset. Univariate gene-based analyses are widely used to characterize gene-phenotype relationships, but are subject to the influence of confounders. Furthermore, while some genes directly contribute to traits variations, others may exert their effects through other genes. How to quantify individual genes direct and indirect effects ...

14
MACHINE: a robust and scalable multi-ancestry fine-mapping method using a continuous global-local shrinkage prior
2025-09-29 genetic and genomic medicine 10.1101/2025.09.28.25336857
#1 (11.4%)
Show abstract

Fine mapping aims to identify causal genetic variants with nonzero phenotypic effects. Leveraging genome-wide association study (GWAS) data from diverse ancestries enhances fine-mapping accuracy and resolution by exploiting differences in linkage disequilibrium (LD) and increasing sample sizes. However, existing multi-ancestry fine-mapping methods rely on discrete priors and assume that all causal variants are shared across ancestries - an assumption that may not hold in practice. Although MESuS...

15
Clustering Of Rare Variants For Causal Variants Identification And Effect Direction Classification
2024-02-23 genetic and genomic medicine 10.1101/2024.02.22.24303151
#1 (11.2%)
Show abstract

Several gene-based tests, e.g., sequence kernel association test, have been developed for association testing of rare single nucleotide variants (SNVs) in genomic regions with disease traits. A common limitation of these aggregate methods is their inability to discriminate potentially causal variants from null variants within the tested regions. We propose a novel clustering method to classify rare variants into null and signal variant groups using summary statistics from the gene-based tests ba...

16
Genomic privacy risks in GWAS summary statistics
2025-09-12 genetic and genomic medicine 10.1101/2025.09.09.25335252
#1 (10.8%)
Show abstract

The rapid advancement in sequencing technologies has exponentially increased the availability of genomic data, heightening concerns about data privacy. Despite the perceived safety of publicly accessible genome-wide association study (GWAS) summary statistics, we demonstrate that their combination with less sensitive high-dimensional phenotype data can lead to significant leakage of confidential genomic information. By transforming a linear regression model into linear programming constraints, w...

17
From Genomics Data to Causality: An Integrated Pipeline for Mendelian Randomization
2023-11-04 epidemiology 10.1101/2023.11.04.23298053
#1 (10.3%)
Show abstract

BackgroundMendelian randomization (MR) has emerged as a valuable tool for causal inference in genetic epidemiology. Existing MR methods have issues related to pleiotropy and offer limited comprehensiveness. Here, we introduce an integrated MR analysis pipeline designed for GWAS summary statistics data. Our pipeline integrates feature selection, harmonization, and checkpoint mechanisms to improve the accuracy and reliability of MR analysis. MethodsIn classical GWAS, the p-value threshold usually...

18
Unsupervised Ensemble Learning for Efficient Integration of Pre-trained Polygenic Risk Scores
2025-01-06 genetic and genomic medicine 10.1101/2025.01.06.25320058
#1 (10.3%)
Show abstract

The growing availability of pre-trained polygenic risk score (PRS) models has enabled their integration into real-world applications, reducing the need for extensive data labeling, training, and calibration. However, selecting the most suitable PRS model for a specific target population remains challenging, due to issues such as limited transferability, data heterogeneity, and the scarcity of observed phenotype in real-world settings. Ensemble learning offers a promising avenue to enhance the pr...

19
Overcome the Limitation of Phenome-Wide Association Studies (PheWAS): Extension of PheWAS to Efficient and Robust Large-Scale ICD Codes Analysis
2024-04-19 health informatics 10.1101/2024.04.15.24305098
#1 (10.2%)
Show abstract

The Phenome-wide association studies (PheWAS) have become widely used for efficient, high-throughput evaluation of relationship between a genetic factor and a large number of disease phenotypes, typically extracted from a DNA biobank linked with electronic medical records (EMR). Phecodes, billing code-derived disease case-control status, are usually used as outcome variables in PheWAS and logistic regression has been the standard choice of analysis method. Since the clinical diagnoses in EMR are...

20
Trumpet plots: Visualizing The Relationship Between Allele Frequency And Effect Size In Genetic Association Studies
2023-04-23 genetic and genomic medicine 10.1101/2023.04.21.23288923
#1 (9.1%)
Show abstract

Recent advances in genome-wide association study (GWAS) and sequencing studies have shown that the genetic architecture of complex diseases and traits involves a combination of rare and common genetic variants, distributed throughout the genome. One way to better understand this architecture is to visualize genetic associations across a wide range of allele frequencies. However, there is currently no standardized or consistent graphical representation for effectively illustrating these results. ...