Biometrics — Latest Matching Preprints

1

Methods for detection of clusters of observations with an outlying correlation coefficient value

Desmet, L.; Venet, D.; Trotta, L.; Burzykowski, T.; Buyse, M.

2020-10-14 health systems and quality improvement 10.1101/2020.10.12.20211128 medRxiv

Top 0.1%

19.0%

Show abstract

Multivariate datasets with a clustered structure are the natural framework for, e.g., multicentre clinical trials. We propose a number of methods aimed at detecting clusters with outlying correlation coefficients. While the methods can be used in a variety of settings, we focus mainly on their application to central statistical monitoring of clinical trials. In particular, we consider the issue of detecting centers (or other clusters of patients such as regions) with outlying correlation coefficients for bivariate data in a multicenter clinical trial. It appears that, in that context, the proposed methods perform well, as we show by using a simulation study and a number of real life datasets.

2

A methodological framework to assess temporal trends and sub-national disparities in healthcare quality metrics using facility surveys, with applications to sick-child care in Kenya, Senegal, and Tanzania

Allorant, A.; Fullman, N.; Leslie, H. H.; Eliakimu, E.; Wakefield, J.; Dieleman, J. L.; Pigott, D.; Puttkammer, N.; Reiner, R. C.

2022-07-19 health systems and quality improvement 10.1101/2022.07.19.22276796 medRxiv

Top 0.1%

12.2%

Show abstract

Monitoring healthcare quality at a subnational resolution is key to identify and resolve geographic inequities and ensure that no sub-population is left behind. Yet, health facility surveys are typically not powered to report reliable estimates at a subnational scale. In this study, we present a framework to fill this gap and jointly analyse publicly available facility survey data, allowing exploration of temporal trends and subnational disparities in healthcare quality metrics. Specifically, our Bayesian hierarchical model includes random effects to account for differences between survey instruments; space-time processes to leverage correlations in space and time; and covariates to incorporate auxiliary information. We apply this framework to Kenya, Senegal, and Tanzania - three countries with at least four rounds of standardized facility surveys each - and estimate the readiness and process quality of sick-child care over time and across subnational areas. These estimates of readiness and process quality of care over time and at a fine spatial resolution show uneven progress in improving facility-based service provision in Kenya, Senegal, and Tanzania. For instance, while national gains in overall readiness of care improved in Tanzania, geographic inequities persisted; in contrast, Senegal, and Kenya experienced stagnation in overall readiness at the national level, but disparities grew across subnational areas. Overall, providers adhered to about one-third of the clinical guidelines for managing sick-child illnesses at the national level. Yet across subnational units, such adherence greatly varied (e.g., 25% to 85% between counties of Kenya in 2020). Our new approach enables identifies precise estimation of changes in the spatial distribution of healthcare quality metrics over time, at a a programmatic spatial resolution, and with accompanying uncertainty estimates. Use of our framework will provide new insights at a policy-relevant spatial resolution for national and regional decision-makers, and international funders.

3

Forward-looking serial intervals correctly link epidemic growth to reproduction numbers

Park, S. W.; Sun, K.; Champredon, D.; Li, M.; Bolker, B. M.; Earn, D. J. D.; Weitz, J. S.; Grenfell, B. T.; Dushoff, J.

2020-10-27 infectious diseases 10.1101/2020.06.04.20122713 medRxiv

Top 0.1%

12.1%

Show abstract

Generation intervals and serial intervals are critical quantities for characterizing outbreak dynamics. Generation intervals characterize the time between infection and transmission, while serial intervals characterize the time between the onset of symptoms in a chain of transmission. They are often used interchangeably, leading to misunderstanding of how these intervals link the epidemic growth rate r and the reproduction number[R] . Generation intervals provide a mechanistic link between r and[R] but are harder to measure via contact tracing. While serial intervals are easier to measure from contact tracing, recent studies suggest that the two intervals give different estimates of[R] from r. We present a general framework for characterizing epidemiological delays based on cohorts (i.e., a group of individuals that share the same event time, such as symptom onset) and show that forward-looking serial intervals, which correctly link[R] with r, are not the same as "intrinsic" serial intervals, but instead change with r. We provide a heuristic method for addressing potential biases that can arise from not accounting for changes in serial intervals across cohorts and apply the method to estimating[R] for the COVID-19 outbreak in China using serial-interval data -- our analysis shows that using incorrectly defined serial intervals can severely bias estimates. This study demonstrates the importance of early epidemiological investigation through contact tracing and provides a rationale for reassessing generation intervals, serial intervals, and[R] estimates, for COVID-19. Significance StatementThe generation- and serial-interval distributions are key, but different, quantities in outbreak analyses. Recent theoretical studies suggest that two distributions give different estimates of the reproduction number[R] from the exponential growth rate r; however, both intervals, by definition, describe disease transmission at the individual level. Here, we show that the serial-interval distribution, defined from the correct reference time and cohort, gives the same estimate of[R] as the generation-interval distribution. We then apply our framework to serial-interval data from the COVID-19 outbreak in China. While our study supports the use of serial-interval distributions in estimating[R] , it also reveals necessary changes to the current understanding and applications of serial-interval distribution.

4

Estimating Direct and Spillover Vaccine Effectiveness with Partial Interference under Test-Negative Design Sampling

Jiang, C.; Fang, F.; Talbot, D.; Schnitzer, M.

2025-02-25 infectious diseases 10.1101/2025.02.24.25322826 medRxiv

Top 0.1%

10.4%

Show abstract

The Test-Negative Design (TND), which involves recruiting care-seeking individuals who meet predefined clinical case criteria, offers valid statistical inference for Vaccine Effectiveness (VE) using data collected through passive surveillance, making it cost-efficient and timely. Infectious disease epidemiology often involves interference, where the treatment and/or outcome of one individual can affect the outcomes of others, rendering standard causal estimands ill-defined; ignoring such interference can bias VE evaluation and lead to ineffective vaccination policies. This article addresses the estimation of causal estimands for VE in the presence of partial interference using TND samples. Partial interference means that the vaccination of units within the same group/cluster may influence the outcomes of other members of the cluster. We define the population direct, spillover, total, and overall effects using the geometric risk ratio, which are identifiable under TND sampling. We investigate various stochastic policies for vaccine allocation in a counterfactual scenario, and identify policy-relevant VE causal estimands. We propose inverse-probability weighted (IPW) estimators for estimating the policy-relevant VE causal estimands with partial interference under the TND, and explore the statistical properties of these estimators.

5

A novel Bayesian model for assessing intratumor heterogeneity of tumor infiltrating leukocytes with multi-region gene expression sequencing

Yang, P.; Hubert, S. M.; Futreal, P. A.; Song, X.; Zhang, J.; Lee, J. J.; Wistuba, I.; Yuan, Y.; Zhang, J.; Li, Z.

2023-10-29 bioinformatics 10.1101/2023.10.24.563820 medRxiv

Top 0.1%

10.3%

Show abstract

Intratumor heterogeneity (ITH) of tumor-infiltrated leukocytes (TILs) is an important phenomenon of cancer biology with potentially profound clinical impacts. Multiregion gene expression sequencing data provide a promising opportunity that allows for explorations of TILs and their intratumor heterogeneity for each subject. Although several existing methods are available to infer the proportions of TILs, considerable methodological gaps exist for evaluating intratumor heterogeneity of TILs with multi-region gene expression data. Here, we develop ICeITH, immune cell estimation reveals intratumor heterogeneity, a Bayesian hierarchical model that borrows cell type profiles as prior knowledge to decompose mixed bulk data while accounting for the within-subject correlations among tumor samples. ICeITH quantifies intratumor heterogeneity by the variability of targeted cellular compositions. Through extensive simulation studies, we demonstrate that ICeITH is more accurate in measuring relative cellular abundance and evaluating intratumor heterogeneity compared with existing methods. We also assess the ability of ICeITH to stratify patients by their intratumor heterogeneity score and associate the estimations with the survival outcomes. Finally, we apply ICeITH to two multi-region gene expression datasets from lung cancer studies to classify patients into different risk groups according to the ITH estimations of targeted TILs that shape either pro- or anti-tumor processes. In conclusion, ICeITH is a useful tool to evaluate intratumor heterogeneity of TILs from multi-region gene expression data.

6

Spatially aligned random partition models on spatially resolved transcriptomics data

Duan, Y.; Guo, S.; Yan, H.; Wang, W.; Mueller, P.

2025-04-22 bioinformatics 10.1101/2025.04.16.649218 medRxiv

Top 0.1%

8.5%

Show abstract

We propose spatially aligned random partition (SARP) models for clustering multiple types of experimental units, incorporating dependence in a subvector of the cluster-specific parameters, e.g., a subvector of spatial information, as in the motivating application. The approach is developed for inference about co-localization of immune, stromal, and tumor cell sub-populations. The aim is to understand the recruitment of immune and stromal cell subtypes by tumor cells, formalized as spatial dependence of the corresponding homogeneous cell subpopulations. This is achieved by constructing Bayesian nonparametric random partition models for the different types of cells, with a hierarchically structured prior introducing the desired dependence. Specifically, we use Pitman-Yor priors and add dependence in the base measure for spatial features, while leaving the base measure corresponding to gene expression features a priori independent across different types of cells. Details of the model construction are designed to lead to a convenient MCMC algorithm for posterior inference. Simulation studies show favorable performance in identifying co-localization between types of cells. We apply the proposed approach with colorectal cancer (CRC) data and discover subtypes of immune and stromal cells that are spatially aligned with specific tumor regions.

7

Spatially varying graph estimation for spatial transcriptomics cancer data

Acharyya, S.; Kang, J.; Baladandayuthapani, V.

2025-05-09 genomics 10.1101/2025.05.04.652097 medRxiv

Top 0.1%

8.5%

Show abstract

Modern spatial transcriptomic profiling techniques facilitate spatially resolved, high-dimensional assessment of cellular gene transcription across the tumor domain. The characterization of spatially varying gene networks enables the discovery of heterogeneous regulatory patterns and biological mechanisms underlying cancer etiology. We propose a spatial Graphical Regression (sGR) model to infer spatially varying graphs for high-resolution multivariate spatial data. Unlike existing graphical models, sGR explicitly incorporates spatial information to infer non-linear conditional dependencies through Gaussian processes. It conducts sparse estimation and selection of spatially varying edges, at both spatial and sub-spatial levels. Extensive simulation studies illustrate the profitability of sGR for spatial graph structural recovery and estimation accuracy. Our methods are motivated by and applied to two spatial transcriptomics data sets in breast and prostate cancer, to investigate spatially varying gene connectivity patterns across the tumor micro-environment. Our findings reveal several novel spatial interactions between genes related to immune activation and carcinogenesis regulation such as CD19 in breast cancer and ARHGAP family in prostate cancer. We also provide a modular software package for fitting and visualization of spatially varying graphs.

8

A simulation-based procedure to estimate base rates from Covid-19 antibody test results I: Deterministic test reliabilities

Joosten, R.; Abhishta, A.

2020-05-04 infectious diseases 10.1101/2020.04.28.20075036 medRxiv

Top 0.1%

8.4%

Show abstract

We design a procedure (the complete Python code may be obtained at https://github.com/abhishta91/antibody_montecarlo) using Monte Carlo (MC) simulation to establish the point estimators described below and confidence intervals for the base rate of occurence of an attribute (e.g., antibodies against Covid-19) in an aggregate population (e.g., medical care workers) based on a test. The requirements for the procedure are the tests sample size (N) and total number of positives (X), and the data on tests reliability. The modus is the prior which generates the largest frequency of observations in the MC simulation with precisely the number of test positives (maximum-likelihood estimator). The median is the upper bound of the set of priors accounting for half of the total relevant observations in the MC simulation with numbers of positives identical to the tests number of positives. O_LSTOur rather preliminary findings areC_LSTO_LIThe median and the confidence intervals suffice universally. C_LIO_LIThe estimator [Formula] may be outside of the two-sided 95% confidence interval. C_LIO_LIConditions such that the modus, the median and another promising estimator which takes the reliability of the test into account, are quite close. C_LIO_LIConditions such that the modus and the latter estimator must be regarded as logically inconsistent. C_LIO_LIConditions inducing rankings among various estimators relevant for issues concerning over-or underestimation. C_LI JEL-codes: C11, C13, C63

9

Robust Inference of Individualized Treatment Effect in Mendelian Randomization

Liang, M.; Wu, R.; Xiao, F.; Li, X.

2026-05-12 genetics 10.64898/2026.05.08.723855 medRxiv

Top 0.1%

8.3%

Show abstract

Mendelian randomization (MR) is widely used to draw causal conclusions in the presence of unmeasured confounding, but most MR analyses focus on average treatment effects and rely on strong assumptions. For precision medicine, the primary target is instead the individualized treatment effect (ITE); yet in MR, such effects are not point-identified under core IV assumptions, and valid inference is particularly challenging. We therefore propose a robust partial identification inference framework for ITE under MR allowing multiple instruments. Under minimal causal assumptions, we derive a sharp inference procedure for the intersection bounds of ITE by adopting a multiplier bootstrap procedure with data-adaptive bootstrap distribution shifting and heterogeneous variance adjustment. In theory, we prove that the proposed method achieves nominal coverage and asymptotic sharpness. Further, we extend the procedure to tolerate possible invalid IVs under a minimal proportion rule assumption by aggregating over instrument subsets while preserving coverage. Simulation studies demonstrate that the proposed methods attain nominal coverage and substantially shorter intervals than existing procedures. We illustrate the framework using data from the Alzheimers Disease Neuroimaging Initiative to assess heterogeneous causal effects of TREM2 expression on Alzheimers disease risk across education-defined subgroups.

10

Leveraging Hardy-Weinberg disequilibrium for association testing in case-control studies

Zhang, L.; Sun, L.

2020-11-16 genetics 10.1101/2020.11.14.382796 medRxiv

Top 0.1%

8.3%

Show abstract

In a case-control association study, deviation from Hardy-Weinberg equilibrium (HWE) or Hardy-Weinberg dis-equilibrium (HWD) in the control group is usually considered as evidence for potential genotyping error, and the corresponding SNP is then removed from the study. On the other hand, assuming HWE holds in the study population, a truly associated SNP is expected to be out of HWE in the case group. Efforts have been made in combining association tests with tests of HWE in the cases to increase the power of detecting disease susceptibility loci (Song and Elston (2006), Wang and Shete (2010)). However, these existing methods are ad-hoc and sensitive to model assumptions. Utilizing the recent robust allele-based (RA) regression model for conducting allelic association tests (Zhang and Sun (2020)), here we propose a joint RA test that naturally integrates association evidence from the traditional association test and a test that evaluates the difference in HWD between the case and control groups. The proposed test is robust to genotyping error, as well as to potential HWD in the population attributed to factors that are unrelated to phenotype-genotype association. We provide the asymptotic distribution of the proposed test statistic so that it is easy to implement, and we demonstrate the accuracy and efficiency of the test through extensive simulation studies and an application.

11

Estimating mutual information under measurement error

Ma, C.; Kingsford, C.

2019-11-23 bioinformatics 10.1101/852384 medRxiv

Top 0.1%

7.2%

Show abstract

Mutual information is widely used to characterize dependence between biological signals, such as co-expression between genes or co-evolution between amino acids. However, measurement error of the biological signals is rarely considered in estimating mutual information. Measurement error is widespread and non-negligible in some cases. As a result, the distribution of the signals is blurred, and the mutual information may be biased when estimated using the blurred measurements. We derive a corrected estimator for mutual information that accounts for the distribution of measurement error. Our corrected estimator is based on the correction of the probability mass function (PMF) or probability density function (PDF, based on kernel density estimation). We prove that the corrected estimator is asymptotically unbiased in the (semi-) discrete case when the distribution of measurement error is known. We show that it reduces the estimation bias in the continuous case under certain assumptions. On simulated data, our corrected estimator leads to a more accurate estimation for mutual information when the sample size is not the limiting factor for estimating PMF or PDF accurately. We compare the uncorrected and corrected estimator on the gene expression data of TCGA breast cancer samples and show a difference in both the value and the ranking of estimated mutual information between the two estimators.

12

Identifying Effect Modification of Latent Population Characteristics on Risk Factors with a Sparse Varying Coefficient Regression

Wang, R.; Fang, L.; Wang, Y.; Jin, J.

2024-12-05 genetics 10.1101/2024.11.30.626101 medRxiv

Top 0.1%

7.1%

Show abstract

Leveraging observational data to understand the associations between risk factors and disease outcomes and conduct disease risk prediction is a common task in epidemiology. While traditional linear regression and other machine learning models have been extensively implemented for this task, the associations between risk factors and disease outcomes are typically deemed fixed. In many cases, however, such associations may vary by some underlying features of the individuals, which may involve certain subpopulation characteristics and environmental factors. While data for these latent features may not be available, the observed data on risk factors may have captured some proportion of the variation in these features. Thus extracting latent factors from risk factors and incorporating this effect modification into the model may better capture the underlying data structure and improve inference. We develop a novel regression model with some coefficients varying as functions of latent features extracted from the risk factors. We have demonstrated the superiority of our approach in various data settings via simulation studies. An application on a dataset for lung cancer patients from The Cancer Genome Atlas (TCGA) Program showed that our approach led to a 6% - 118% increase in (AUC-0.5) for distinguishing between different lung cancer stages compared to the classic lasso and elastic net regressions and identified interesting latent effect modifications associated with certain gene pathways.

13

Parallel Trends in an Unparalleled Pandemic: Difference-in-differences for infectious disease policy evaluation

Feng, S.; Bilinski, A.

2024-04-10 infectious diseases Community evaluation 10.1101/2024.04.08.24305335 medRxiv

Top 0.1%

7.0%

Show abstract

Researchers frequently employ difference-in-differences (DiD) to study the impact of public health interventions on infectious disease outcomes. DiD assumes that treatment and non-experimental comparison groups would have moved in parallel in expectation, absent the intervention ("parallel trends assumption"). However, the plausibility of parallel trends assumption in the context of infectious disease transmission is not well-understood. Our work bridges this gap by formalizing epidemiological assumptions required for common DiD specifications, positing an underlying Susceptible-Infectious-Recovered (SIR) data-generating process. We demonstrate that popular specifications can encode strict epidemiological assumptions. For example, DiD modeling incident case numbers or rates as outcomes will produce biased treatment effect estimates unless untreated potential outcomes for treatment and comparison groups come from a data-generating process with the same initial infection and equal transmission rates at each time step. Applying a log transformation or modeling log growth allows for different initial infection rates under an "infinite susceptible population" assumption, but invokes conditions on transmission parameters. We then propose alternative DiD specifications based on epidemiological parameters - the effective reproduction number and the effective contact rate - that are both more robust to differences between treatment and comparison groups and can be extended to complex transmission dynamics. With minimal power difference incidence and log incidence models, we recommend a default of the more robust log specification. Our alternative specifications have lower power than incidence or log incidence models, but have higher power than log growth models. We illustrate implications of our work by re-analyzing published studies of COVID-19 mask policies. Significance StatementDifference-in-differences is a popular observational study design for policy evaluation. However, it may not perform well when modeling infectious disease outcomes. Although many COVID-19 DiD studies in the medical literature have used incident case numbers or rates as the outcome variable, we demonstrate that this and other common model specifications may encode strict epidemiological assumptions as a result of non-linear infectious disease transmission. We unpack the assumptions embedded in popular DiD specifications assuming a Susceptible-Infected-Recovered data-generating process and propose more robust alternatives, modeling the effective reproduction number and effective contact rate.

14

An Optimally Weighted Combination Method to DetectNovel Disease Associated Genes Using Publicly Available GWAS Summary Data

Zhang, J.; Gonzales, S.; Liu, J.; Gao, X. R.; wang, x.

2019-07-20 genetics 10.1101/709808 medRxiv

Top 0.1%

6.4%

Show abstract

Gene-based analyses offer a useful alternative and complement to the usual single nucleotide polymorphism (SNP) based analysis for genome-wide association studies (GWASs). Using appropriate weights (pre-specified or eQTL-derived) can boost statistical power, especially for detecting weak associations between a gene and a trait. Because the sparsity level or association directions of the underlying association patterns in real data are often unknown and access to individual-level data is limited, we propose an optimal weighted combination (OWC) test applicable to summary statistics from GWAS. This method includes burden tests, weighted sum of squared score (SSU), weighted sum statistic (WSS), and the score test as its special cases. We analytically prove that aggregating the variants in one gene is the same as using the weighted combination of Z-scores for each variant based on the score test method. We also numerically illustrate that our proposed test outperforms several existing comparable methods via simulation studies. Lastly, we utilize schizophrenia GWAS data and a fasting glucose GWAS meta-analysis data to demonstrate that our method outperforms the existing methods in real data analyses. Our proposed test is implemented in the R program OWC, which is freely and publicly available.

15

Filter inference: A scalable nonlinear mixed effects inference approach for snapshot time series data

Augustin, D.; Lambert, B.; Wang, K.; Walz, A.-C.; Robinson, M.; Gavaghan, D.

2022-11-02 bioinformatics 10.1101/2022.11.01.514702 medRxiv

Top 0.1%

6.2%

Show abstract

Variability is an intrinsic property of biological systems and is often at the heart of their complex behaviour. Examples range from cell-to-cell variability in cell signalling pathways to variability in the response to treatment across patients. A popular approach to model and understand this variability is nonlinear mixed effects (NLME) modelling. However, estimating the parameters of NLME models from measurements quickly becomes computationally expensive as the number of measured individuals grows, making NLME inference intractable for datasets with thousands of measured individuals. This shortcoming is particularly limiting for snapshot datasets, common e.g. in cell biology, where high-throughput measurement techniques provide large numbers of single cell measurements. We extend earlier work by Hasenauer et al (2011) to introduce a novel approach for the estimation of NLME model parameters from snapshot measurements, which we call filter inference. Filter inference is a new variant of approximate Bayesian computation, with dominant computational costs that do not increase with the number of measured individuals, making efficient inferences from snapshot measurements possible. Filter inference also scales well with the number of model parameters, using state-of-the-art gradient-based MCMC algorithms, such as the No-U-Turn Sampler (NUTS). We demonstrate the properties of filter inference using examples from early cancer growth modelling and from epidermal growth factor signalling pathway modelling. Author summaryNonlinear mixed effects (NLME) models are widely used to model differences between individuals in a population. In pharmacology, for example, they are used to model the treatment response variability across patients, and in cell biology they are used to model the cell-to-cell variability in cell signalling pathways. However, NLME models introduce parameters, which typically need to be estimated from data. This estimation becomes computationally intractable when the number of measured individuals - be they patients or cells - is too large. But, the more individuals are measured in a population, the better the variability can be understood. This is especially true when individuals are measured only once. Such snapshot measurements are particularly common in cell biology, where high-throughput measurement techniques provide large numbers of single cell measurements. In clinical pharmacology, datasets consisting of many snapshot measurements are less common but are easier and cheaper to obtain than detailed time series measurements across patients. Our approach can be used to estimate the parameters of NLME models from snapshot time series data with thousands of measured individuals.

16

Bayesian semi-nonnegative matrix tri-factorization to identify pathways associated with cancer phenotypes

Park, S.; Kar, N.; Cheong, J.-H.; Hwang, T. H.

2019-08-20 bioinformatics 10.1101/739110 medRxiv

Top 0.1%

5.1%

Show abstract

Accurate identification of pathways associated with cancer phenotypes (e.g., cancer sub-types and treatment outcome) could lead to discovering reliable prognostic and/or predictive biomarkers for better patients stratification and treatment guidance. In our previous work, we have shown that non-negative matrix tri-factorization (NMTF) can be successfully applied to identify pathways associated with specific cancer types or disease classes as a prognostic and predictive biomarker. However, one key limitation of non-negative factorization methods, including various non-negative bi-factorization methods, is their lack of ability to handle non-negative input data. For example, many molecular data that consist of real-values containing both positive and negative values (e.g., normalized/log transformed gene expression data where negative value represents down-regulated expression of genes) are not suitable input for these algorithms. In addition, most previous methods provide just a single point estimate and hence cannot deal with uncertainty effectively.\n\nTo address these limitations, we propose a Bayesian semi-nonnegative matrix trifactorization method to identify pathways associated with cancer phenotypes from a realvalued input matrix, e.g., gene expression values. Motivated by semi-nonnegative factorization, we allow one of the factor matrices, the centroid matrix, to be real-valued so that each centroid can express either the up- or down-regulation of the member genes in a pathway. In addition, we place structured spike-and-slab priors (which are encoded with the pathways and a gene-gene interaction (GGI) network) on the centroid matrix so that even a set of genes that is not initially contained in the pathways (due to the incompleteness of the current pathway database) can be involved in the factorization in a stochastic way specifically, if those genes are connected to the member genes of the pathways on the GGI network. We also present update rules for the posterior distributions in the framework of variational inference. As a full Bayesian method, our proposed method has several advantages over the current NMTF methods which are demonstrated using synthetic datasets in experiments. Using the The Cancer Genome Atlas (TCGA) gastric cancer and metastatic gastric cancer immunotherapy clinical-trial datasets, we show that our method could identify biologically and clinically relevant pathways associated with the molecular sub-types and immunotherapy response, respectively. Finally, we show that those pathways identified by the proposed method could be used as prognostic biomarkers to stratify patients with distinct survival outcome in two independent validation datasets. Additional information and codes can be found at https://github.com/parks-cs-ccf/BayesianSNMTF.

17

Mixture Margin Random-effects Copula Models for Inferring Temporally Conserved Microbial Co-variation Networks from Longitudinal Data

Deek, R. A.; Li, H.

2022-04-26 bioinformatics 10.1101/2022.04.25.489333 medRxiv

Top 0.1%

5.0%

Show abstract

Longitudinal microbiome studies, in which data on a single subject is collected repeatedly over time, are becoming increasingly common in biomedical research. Such studies provide an opportunity to study the inherently dynamic nature of a microbiome in a way that cannot be done using cross-sectional studies. In this paper, we develop random-effects copula models with mixed zero-beta margins to identify biologically meaningful temporally conserved co-variation between two bacterial taxa, while accounting for the excessive zeros seen in 16S rRNA and metagenomic sequencing data. The model assumes a random-effects model for the dependence parameter in the copulas, which captures the conserved microbial co-variation while allowing for a time-specific dependence parameters. We develop a Monte Carlo EM algorithm for efficient estimation of model parameters and a corresponding Monte Carlo likelihood ratio test for the mean dependence parameter. Simulation studies show that our test controls the Type I error rate and provides an unbiased estimate of the mean dependence parameter. Additionally, we apply our method to a longitudinal pediatric cohort and identify changes in both local and global patterns of microbial co-variation networks in infants treated with antibiotics. Our analysis shows that the no antibiotics network is less dependent on individual taxon, thus making it more stable than the antibiotics network and more robust to both targeted and random attacks. Author summaryIdentification of co-variation between two microbes in microbial communities provides important insights into the community structure and stability. The commonly used measures of co-variation do not handle excessive zeros observed in the data and cannot be applied to longitudinal microbiome data directly. In this paper, we develop random-effects copula models with mixed zero-beta margins to identify biologically meaningful temporally conserved co-variation between two bacterial taxa, while accounting for the excessive zeros seen in 16S rRNA and metagenomic sequencing data. The model captures the conserved microbial co-variations while allowing for a time-specific dependence parameters. We develop an efficient Monte Carlo-based algorithm for parameter estimation and statistical inference. We analyze the data from a pediatric longitudinal cohort and identify changes in both local and global patterns of microbial co-variation networks in infants treated with antibiotics.

18

Reimagining Gene-Environment Interaction Analysis for Human Complex Traits

Miao, J.; Song, G.; Wu, Y.; Hu, J.; Wu, Y.; Basu, S.; Andrews, J. S.; Schaumberg, K.; Fletcher, J. M.; Schmitz, L. L.; Lu, Q.

2022-12-14 genetics 10.1101/2022.12.11.519973 medRxiv

Top 0.1%

4.9%

Show abstract

In this study, we introduce PIGEON--a novel statistical framework for quantifying and estimating polygenic gene-environment interaction (GxE) using a variance component analytical approach. Based on PIGEON, we outline the main objectives in GxE studies, demonstrate the flaws in existing GxE approaches, and introduce an innovative estimation procedure which only requires summary statistics as input. We demonstrate the statistical superiority of PIGEON through extensive theoretical and empirical analyses and showcase its performance in multiple analytic settings, including a quasi-experimental GxE study of health outcomes, gene-by-sex interaction for 530 traits, and gene-by-treatment interaction in a randomized clinical trial. Our results show that PIGEON provides an innovative solution to many long-standing challenges in GxE inference and may fundamentally reshape analytical strategies in future GxE studies.

19

Generative AI-assisted Bayesian-frequentist Hybrid Inference in Single-cell RNA Sequencing Analysis for Genes Associated with Alzheimer's Disease

Han, G.; Yuan, A.; Oware, K. D.; Wright, F.; Carroll, R. J.; Smith, M.; Ory, M. G.; Yan, D.; Wang, W.; Sun, Z.; Dai, Q.; Allen, C.; Dang, A.; Liu, Y.

2026-04-20 geriatric medicine 10.64898/2026.04.17.26351142 medRxiv

Top 0.1%

4.9%

Show abstract

Alzheimers disease genomics and other high-dimensional omics studies demand powerful statistical methods, yet Bayesian inference remains underutilized despite its advantages in small-sample settings, owing to the prohibitive cost of eliciting reliable priors across thousands or millions of parameters. We propose an AI-assisted Bayesian-frequentist hybrid inference framework that couples large language model based prior elicitation with the hybrid inference theory of Yuan (2009). ChatGPT-4o is queried via a standardized prompt to assess the strength of evidence linking each gene to a disease of interest, and the response is mapped to an informative normal prior via a standardized effect-size calibration. Parameters for covariates of secondary interest are treated as frequentist parameters, preserving efficiency and avoiding sensitivity to mis-specified priors. We derive closed-form hybrid estimators under uniform and conjugate normal priors in linear models, establish their asymptotic equivalence to the frequentist and full Bayes estimators, and show in simulations that hybrid inference using unconditional variance estimation leads to high statistical power while accurately controlling the Type I error rate. Applied to single-cell RNA sequencing data from the ROSMAP cohort for Alzheimers disease as an example, the framework identifies biologically coherent pathways (such as gamma-secretase pathways) previously undetected. The proposed framework offers a principled and computationally scalable approach to genome-wide Bayesian analysis, with potential for broad application across omics platforms and disease settings.

20

Detecting latent gene-environment interaction when analyzing binary traits

Zhang, Z.; Lawless, J. F.; Paterson, A. D.; Sun, L.

2024-07-13 genetics 10.1101/2024.07.10.602954 medRxiv

Top 0.1%

4.8%

Show abstract

In genome-wide association studies (GWAS), it is desirable to test for interactions (GxE) between single-nucleotide polymorphisms (SNPs,Gs) and environmental variables (Es). However, directly accounting for interaction is often infeasible, because E is latent. For quantitative traits (Y) that are approximately normally distributed, it has been shown that indirect testing on GxE can be done by testing for heteroskedasticity of Y between genotypes. However, when traits are binary, the existing methodology based on testing the heteroskedasticity of the trait across genotypes cannot be generalized. In this paper, we propose an approach to indirectly test GxE for binary traits based on the non-additive effect G, and subsequently propose a joint test that accounts for the main and interaction effects of each SNP during GWAS. We illustrate the statistical features including type-I-error control and power of the proposed method through extensive numerical studies. Applying our method to the UK Biobank dataset, we showcase the practical utility of the proposed method, revealing SNPs and genes with strong potential for latent interaction effects.