Biometrics
◐ Oxford University Press (OUP)
All preprints, ranked by how well they match Biometrics's content profile, based on 22 papers previously published here. The average preprint has a 0.01% match score for this journal, so anything above that is already an above-average fit. Older preprints may already have been published elsewhere.
Desmet, L.; Venet, D.; Trotta, L.; Burzykowski, T.; Buyse, M.
Show abstract
Multivariate datasets with a clustered structure are the natural framework for, e.g., multicentre clinical trials. We propose a number of methods aimed at detecting clusters with outlying correlation coefficients. While the methods can be used in a variety of settings, we focus mainly on their application to central statistical monitoring of clinical trials. In particular, we consider the issue of detecting centers (or other clusters of patients such as regions) with outlying correlation coefficients for bivariate data in a multicenter clinical trial. It appears that, in that context, the proposed methods perform well, as we show by using a simulation study and a number of real life datasets.
Allorant, A.; Fullman, N.; Leslie, H. H.; Eliakimu, E.; Wakefield, J.; Dieleman, J. L.; Pigott, D.; Puttkammer, N.; Reiner, R. C.
Show abstract
Monitoring healthcare quality at a subnational resolution is key to identify and resolve geographic inequities and ensure that no sub-population is left behind. Yet, health facility surveys are typically not powered to report reliable estimates at a subnational scale. In this study, we present a framework to fill this gap and jointly analyse publicly available facility survey data, allowing exploration of temporal trends and subnational disparities in healthcare quality metrics. Specifically, our Bayesian hierarchical model includes random effects to account for differences between survey instruments; space-time processes to leverage correlations in space and time; and covariates to incorporate auxiliary information. We apply this framework to Kenya, Senegal, and Tanzania - three countries with at least four rounds of standardized facility surveys each - and estimate the readiness and process quality of sick-child care over time and across subnational areas. These estimates of readiness and process quality of care over time and at a fine spatial resolution show uneven progress in improving facility-based service provision in Kenya, Senegal, and Tanzania. For instance, while national gains in overall readiness of care improved in Tanzania, geographic inequities persisted; in contrast, Senegal, and Kenya experienced stagnation in overall readiness at the national level, but disparities grew across subnational areas. Overall, providers adhered to about one-third of the clinical guidelines for managing sick-child illnesses at the national level. Yet across subnational units, such adherence greatly varied (e.g., 25% to 85% between counties of Kenya in 2020). Our new approach enables identifies precise estimation of changes in the spatial distribution of healthcare quality metrics over time, at a a programmatic spatial resolution, and with accompanying uncertainty estimates. Use of our framework will provide new insights at a policy-relevant spatial resolution for national and regional decision-makers, and international funders.
Park, S. W.; Sun, K.; Champredon, D.; Li, M.; Bolker, B. M.; Earn, D. J. D.; Weitz, J. S.; Grenfell, B. T.; Dushoff, J.
Show abstract
Generation intervals and serial intervals are critical quantities for characterizing outbreak dynamics. Generation intervals characterize the time between infection and transmission, while serial intervals characterize the time between the onset of symptoms in a chain of transmission. They are often used interchangeably, leading to misunderstanding of how these intervals link the epidemic growth rate r and the reproduction number[R] . Generation intervals provide a mechanistic link between r and[R] but are harder to measure via contact tracing. While serial intervals are easier to measure from contact tracing, recent studies suggest that the two intervals give different estimates of[R] from r. We present a general framework for characterizing epidemiological delays based on cohorts (i.e., a group of individuals that share the same event time, such as symptom onset) and show that forward-looking serial intervals, which correctly link[R] with r, are not the same as "intrinsic" serial intervals, but instead change with r. We provide a heuristic method for addressing potential biases that can arise from not accounting for changes in serial intervals across cohorts and apply the method to estimating[R] for the COVID-19 outbreak in China using serial-interval data -- our analysis shows that using incorrectly defined serial intervals can severely bias estimates. This study demonstrates the importance of early epidemiological investigation through contact tracing and provides a rationale for reassessing generation intervals, serial intervals, and[R] estimates, for COVID-19. Significance StatementThe generation- and serial-interval distributions are key, but different, quantities in outbreak analyses. Recent theoretical studies suggest that two distributions give different estimates of the reproduction number[R] from the exponential growth rate r; however, both intervals, by definition, describe disease transmission at the individual level. Here, we show that the serial-interval distribution, defined from the correct reference time and cohort, gives the same estimate of[R] as the generation-interval distribution. We then apply our framework to serial-interval data from the COVID-19 outbreak in China. While our study supports the use of serial-interval distributions in estimating[R] , it also reveals necessary changes to the current understanding and applications of serial-interval distribution.
Jiang, C.; Fang, F.; Talbot, D.; Schnitzer, M.
Show abstract
The Test-Negative Design (TND), which involves recruiting care-seeking individuals who meet predefined clinical case criteria, offers valid statistical inference for Vaccine Effectiveness (VE) using data collected through passive surveillance, making it cost-efficient and timely. Infectious disease epidemiology often involves interference, where the treatment and/or outcome of one individual can affect the outcomes of others, rendering standard causal estimands ill-defined; ignoring such interference can bias VE evaluation and lead to ineffective vaccination policies. This article addresses the estimation of causal estimands for VE in the presence of partial interference using TND samples. Partial interference means that the vaccination of units within the same group/cluster may influence the outcomes of other members of the cluster. We define the population direct, spillover, total, and overall effects using the geometric risk ratio, which are identifiable under TND sampling. We investigate various stochastic policies for vaccine allocation in a counterfactual scenario, and identify policy-relevant VE causal estimands. We propose inverse-probability weighted (IPW) estimators for estimating the policy-relevant VE causal estimands with partial interference under the TND, and explore the statistical properties of these estimators.
Yang, P.; Hubert, S. M.; Futreal, P. A.; Song, X.; Zhang, J.; Lee, J. J.; Wistuba, I.; Yuan, Y.; Zhang, J.; Li, Z.
Show abstract
Intratumor heterogeneity (ITH) of tumor-infiltrated leukocytes (TILs) is an important phenomenon of cancer biology with potentially profound clinical impacts. Multiregion gene expression sequencing data provide a promising opportunity that allows for explorations of TILs and their intratumor heterogeneity for each subject. Although several existing methods are available to infer the proportions of TILs, considerable methodological gaps exist for evaluating intratumor heterogeneity of TILs with multi-region gene expression data. Here, we develop ICeITH, immune cell estimation reveals intratumor heterogeneity, a Bayesian hierarchical model that borrows cell type profiles as prior knowledge to decompose mixed bulk data while accounting for the within-subject correlations among tumor samples. ICeITH quantifies intratumor heterogeneity by the variability of targeted cellular compositions. Through extensive simulation studies, we demonstrate that ICeITH is more accurate in measuring relative cellular abundance and evaluating intratumor heterogeneity compared with existing methods. We also assess the ability of ICeITH to stratify patients by their intratumor heterogeneity score and associate the estimations with the survival outcomes. Finally, we apply ICeITH to two multi-region gene expression datasets from lung cancer studies to classify patients into different risk groups according to the ITH estimations of targeted TILs that shape either pro- or anti-tumor processes. In conclusion, ICeITH is a useful tool to evaluate intratumor heterogeneity of TILs from multi-region gene expression data.
Duan, Y.; Guo, S.; Yan, H.; Wang, W.; Mueller, P.
Show abstract
We propose spatially aligned random partition (SARP) models for clustering multiple types of experimental units, incorporating dependence in a subvector of the cluster-specific parameters, e.g., a subvector of spatial information, as in the motivating application. The approach is developed for inference about co-localization of immune, stromal, and tumor cell sub-populations. The aim is to understand the recruitment of immune and stromal cell subtypes by tumor cells, formalized as spatial dependence of the corresponding homogeneous cell subpopulations. This is achieved by constructing Bayesian nonparametric random partition models for the different types of cells, with a hierarchically structured prior introducing the desired dependence. Specifically, we use Pitman-Yor priors and add dependence in the base measure for spatial features, while leaving the base measure corresponding to gene expression features a priori independent across different types of cells. Details of the model construction are designed to lead to a convenient MCMC algorithm for posterior inference. Simulation studies show favorable performance in identifying co-localization between types of cells. We apply the proposed approach with colorectal cancer (CRC) data and discover subtypes of immune and stromal cells that are spatially aligned with specific tumor regions.
Acharyya, S.; Kang, J.; Baladandayuthapani, V.
Show abstract
Modern spatial transcriptomic profiling techniques facilitate spatially resolved, high-dimensional assessment of cellular gene transcription across the tumor domain. The characterization of spatially varying gene networks enables the discovery of heterogeneous regulatory patterns and biological mechanisms underlying cancer etiology. We propose a spatial Graphical Regression (sGR) model to infer spatially varying graphs for high-resolution multivariate spatial data. Unlike existing graphical models, sGR explicitly incorporates spatial information to infer non-linear conditional dependencies through Gaussian processes. It conducts sparse estimation and selection of spatially varying edges, at both spatial and sub-spatial levels. Extensive simulation studies illustrate the profitability of sGR for spatial graph structural recovery and estimation accuracy. Our methods are motivated by and applied to two spatial transcriptomics data sets in breast and prostate cancer, to investigate spatially varying gene connectivity patterns across the tumor micro-environment. Our findings reveal several novel spatial interactions between genes related to immune activation and carcinogenesis regulation such as CD19 in breast cancer and ARHGAP family in prostate cancer. We also provide a modular software package for fitting and visualization of spatially varying graphs.
Joosten, R.; Abhishta, A.
Show abstract
We design a procedure (the complete Python code may be obtained at https://github.com/abhishta91/antibody_montecarlo) using Monte Carlo (MC) simulation to establish the point estimators described below and confidence intervals for the base rate of occurence of an attribute (e.g., antibodies against Covid-19) in an aggregate population (e.g., medical care workers) based on a test. The requirements for the procedure are the tests sample size (N) and total number of positives (X), and the data on tests reliability. The modus is the prior which generates the largest frequency of observations in the MC simulation with precisely the number of test positives (maximum-likelihood estimator). The median is the upper bound of the set of priors accounting for half of the total relevant observations in the MC simulation with numbers of positives identical to the tests number of positives. O_LSTOur rather preliminary findings areC_LSTO_LIThe median and the confidence intervals suffice universally. C_LIO_LIThe estimator [Formula] may be outside of the two-sided 95% confidence interval. C_LIO_LIConditions such that the modus, the median and another promising estimator which takes the reliability of the test into account, are quite close. C_LIO_LIConditions such that the modus and the latter estimator must be regarded as logically inconsistent. C_LIO_LIConditions inducing rankings among various estimators relevant for issues concerning over-or underestimation. C_LI JEL-codes: C11, C13, C63
Zhang, L.; Sun, L.
Show abstract
In a case-control association study, deviation from Hardy-Weinberg equilibrium (HWE) or Hardy-Weinberg dis-equilibrium (HWD) in the control group is usually considered as evidence for potential genotyping error, and the corresponding SNP is then removed from the study. On the other hand, assuming HWE holds in the study population, a truly associated SNP is expected to be out of HWE in the case group. Efforts have been made in combining association tests with tests of HWE in the cases to increase the power of detecting disease susceptibility loci (Song and Elston (2006), Wang and Shete (2010)). However, these existing methods are ad-hoc and sensitive to model assumptions. Utilizing the recent robust allele-based (RA) regression model for conducting allelic association tests (Zhang and Sun (2020)), here we propose a joint RA test that naturally integrates association evidence from the traditional association test and a test that evaluates the difference in HWD between the case and control groups. The proposed test is robust to genotyping error, as well as to potential HWD in the population attributed to factors that are unrelated to phenotype-genotype association. We provide the asymptotic distribution of the proposed test statistic so that it is easy to implement, and we demonstrate the accuracy and efficiency of the test through extensive simulation studies and an application.
Ma, C.; Kingsford, C.
Show abstract
Mutual information is widely used to characterize dependence between biological signals, such as co-expression between genes or co-evolution between amino acids. However, measurement error of the biological signals is rarely considered in estimating mutual information. Measurement error is widespread and non-negligible in some cases. As a result, the distribution of the signals is blurred, and the mutual information may be biased when estimated using the blurred measurements. We derive a corrected estimator for mutual information that accounts for the distribution of measurement error. Our corrected estimator is based on the correction of the probability mass function (PMF) or probability density function (PDF, based on kernel density estimation). We prove that the corrected estimator is asymptotically unbiased in the (semi-) discrete case when the distribution of measurement error is known. We show that it reduces the estimation bias in the continuous case under certain assumptions. On simulated data, our corrected estimator leads to a more accurate estimation for mutual information when the sample size is not the limiting factor for estimating PMF or PDF accurately. We compare the uncorrected and corrected estimator on the gene expression data of TCGA breast cancer samples and show a difference in both the value and the ranking of estimated mutual information between the two estimators.
Wang, R.; Fang, L.; Wang, Y.; Jin, J.
Show abstract
Leveraging observational data to understand the associations between risk factors and disease outcomes and conduct disease risk prediction is a common task in epidemiology. While traditional linear regression and other machine learning models have been extensively implemented for this task, the associations between risk factors and disease outcomes are typically deemed fixed. In many cases, however, such associations may vary by some underlying features of the individuals, which may involve certain subpopulation characteristics and environmental factors. While data for these latent features may not be available, the observed data on risk factors may have captured some proportion of the variation in these features. Thus extracting latent factors from risk factors and incorporating this effect modification into the model may better capture the underlying data structure and improve inference. We develop a novel regression model with some coefficients varying as functions of latent features extracted from the risk factors. We have demonstrated the superiority of our approach in various data settings via simulation studies. An application on a dataset for lung cancer patients from The Cancer Genome Atlas (TCGA) Program showed that our approach led to a 6% - 118% increase in (AUC-0.5) for distinguishing between different lung cancer stages compared to the classic lasso and elastic net regressions and identified interesting latent effect modifications associated with certain gene pathways.
Feng, S.; Bilinski, A.
Show abstract
Researchers frequently employ difference-in-differences (DiD) to study the impact of public health interventions on infectious disease outcomes. DiD assumes that treatment and non-experimental comparison groups would have moved in parallel in expectation, absent the intervention ("parallel trends assumption"). However, the plausibility of parallel trends assumption in the context of infectious disease transmission is not well-understood. Our work bridges this gap by formalizing epidemiological assumptions required for common DiD specifications, positing an underlying Susceptible-Infectious-Recovered (SIR) data-generating process. We demonstrate that popular specifications can encode strict epidemiological assumptions. For example, DiD modeling incident case numbers or rates as outcomes will produce biased treatment effect estimates unless untreated potential outcomes for treatment and comparison groups come from a data-generating process with the same initial infection and equal transmission rates at each time step. Applying a log transformation or modeling log growth allows for different initial infection rates under an "infinite susceptible population" assumption, but invokes conditions on transmission parameters. We then propose alternative DiD specifications based on epidemiological parameters - the effective reproduction number and the effective contact rate - that are both more robust to differences between treatment and comparison groups and can be extended to complex transmission dynamics. With minimal power difference incidence and log incidence models, we recommend a default of the more robust log specification. Our alternative specifications have lower power than incidence or log incidence models, but have higher power than log growth models. We illustrate implications of our work by re-analyzing published studies of COVID-19 mask policies. Significance StatementDifference-in-differences is a popular observational study design for policy evaluation. However, it may not perform well when modeling infectious disease outcomes. Although many COVID-19 DiD studies in the medical literature have used incident case numbers or rates as the outcome variable, we demonstrate that this and other common model specifications may encode strict epidemiological assumptions as a result of non-linear infectious disease transmission. We unpack the assumptions embedded in popular DiD specifications assuming a Susceptible-Infected-Recovered data-generating process and propose more robust alternatives, modeling the effective reproduction number and effective contact rate.
Zhang, J.; Gonzales, S.; Liu, J.; Gao, X. R.; wang, x.
Show abstract
Gene-based analyses offer a useful alternative and complement to the usual single nucleotide polymorphism (SNP) based analysis for genome-wide association studies (GWASs). Using appropriate weights (pre-specified or eQTL-derived) can boost statistical power, especially for detecting weak associations between a gene and a trait. Because the sparsity level or association directions of the underlying association patterns in real data are often unknown and access to individual-level data is limited, we propose an optimal weighted combination (OWC) test applicable to summary statistics from GWAS. This method includes burden tests, weighted sum of squared score (SSU), weighted sum statistic (WSS), and the score test as its special cases. We analytically prove that aggregating the variants in one gene is the same as using the weighted combination of Z-scores for each variant based on the score test method. We also numerically illustrate that our proposed test outperforms several existing comparable methods via simulation studies. Lastly, we utilize schizophrenia GWAS data and a fasting glucose GWAS meta-analysis data to demonstrate that our method outperforms the existing methods in real data analyses. Our proposed test is implemented in the R program OWC, which is freely and publicly available.
Augustin, D.; Lambert, B.; Wang, K.; Walz, A.-C.; Robinson, M.; Gavaghan, D.
Show abstract
Variability is an intrinsic property of biological systems and is often at the heart of their complex behaviour. Examples range from cell-to-cell variability in cell signalling pathways to variability in the response to treatment across patients. A popular approach to model and understand this variability is nonlinear mixed effects (NLME) modelling. However, estimating the parameters of NLME models from measurements quickly becomes computationally expensive as the number of measured individuals grows, making NLME inference intractable for datasets with thousands of measured individuals. This shortcoming is particularly limiting for snapshot datasets, common e.g. in cell biology, where high-throughput measurement techniques provide large numbers of single cell measurements. We extend earlier work by Hasenauer et al (2011) to introduce a novel approach for the estimation of NLME model parameters from snapshot measurements, which we call filter inference. Filter inference is a new variant of approximate Bayesian computation, with dominant computational costs that do not increase with the number of measured individuals, making efficient inferences from snapshot measurements possible. Filter inference also scales well with the number of model parameters, using state-of-the-art gradient-based MCMC algorithms, such as the No-U-Turn Sampler (NUTS). We demonstrate the properties of filter inference using examples from early cancer growth modelling and from epidermal growth factor signalling pathway modelling. Author summaryNonlinear mixed effects (NLME) models are widely used to model differences between individuals in a population. In pharmacology, for example, they are used to model the treatment response variability across patients, and in cell biology they are used to model the cell-to-cell variability in cell signalling pathways. However, NLME models introduce parameters, which typically need to be estimated from data. This estimation becomes computationally intractable when the number of measured individuals - be they patients or cells - is too large. But, the more individuals are measured in a population, the better the variability can be understood. This is especially true when individuals are measured only once. Such snapshot measurements are particularly common in cell biology, where high-throughput measurement techniques provide large numbers of single cell measurements. In clinical pharmacology, datasets consisting of many snapshot measurements are less common but are easier and cheaper to obtain than detailed time series measurements across patients. Our approach can be used to estimate the parameters of NLME models from snapshot time series data with thousands of measured individuals.
Park, S.; Kar, N.; Cheong, J.-H.; Hwang, T. H.
Show abstract
Accurate identification of pathways associated with cancer phenotypes (e.g., cancer sub-types and treatment outcome) could lead to discovering reliable prognostic and/or predictive biomarkers for better patients stratification and treatment guidance. In our previous work, we have shown that non-negative matrix tri-factorization (NMTF) can be successfully applied to identify pathways associated with specific cancer types or disease classes as a prognostic and predictive biomarker. However, one key limitation of non-negative factorization methods, including various non-negative bi-factorization methods, is their lack of ability to handle non-negative input data. For example, many molecular data that consist of real-values containing both positive and negative values (e.g., normalized/log transformed gene expression data where negative value represents down-regulated expression of genes) are not suitable input for these algorithms. In addition, most previous methods provide just a single point estimate and hence cannot deal with uncertainty effectively.\n\nTo address these limitations, we propose a Bayesian semi-nonnegative matrix trifactorization method to identify pathways associated with cancer phenotypes from a realvalued input matrix, e.g., gene expression values. Motivated by semi-nonnegative factorization, we allow one of the factor matrices, the centroid matrix, to be real-valued so that each centroid can express either the up- or down-regulation of the member genes in a pathway. In addition, we place structured spike-and-slab priors (which are encoded with the pathways and a gene-gene interaction (GGI) network) on the centroid matrix so that even a set of genes that is not initially contained in the pathways (due to the incompleteness of the current pathway database) can be involved in the factorization in a stochastic way specifically, if those genes are connected to the member genes of the pathways on the GGI network. We also present update rules for the posterior distributions in the framework of variational inference. As a full Bayesian method, our proposed method has several advantages over the current NMTF methods which are demonstrated using synthetic datasets in experiments. Using the The Cancer Genome Atlas (TCGA) gastric cancer and metastatic gastric cancer immunotherapy clinical-trial datasets, we show that our method could identify biologically and clinically relevant pathways associated with the molecular sub-types and immunotherapy response, respectively. Finally, we show that those pathways identified by the proposed method could be used as prognostic biomarkers to stratify patients with distinct survival outcome in two independent validation datasets. Additional information and codes can be found at https://github.com/parks-cs-ccf/BayesianSNMTF.
Deek, R. A.; Li, H.
Show abstract
Longitudinal microbiome studies, in which data on a single subject is collected repeatedly over time, are becoming increasingly common in biomedical research. Such studies provide an opportunity to study the inherently dynamic nature of a microbiome in a way that cannot be done using cross-sectional studies. In this paper, we develop random-effects copula models with mixed zero-beta margins to identify biologically meaningful temporally conserved co-variation between two bacterial taxa, while accounting for the excessive zeros seen in 16S rRNA and metagenomic sequencing data. The model assumes a random-effects model for the dependence parameter in the copulas, which captures the conserved microbial co-variation while allowing for a time-specific dependence parameters. We develop a Monte Carlo EM algorithm for efficient estimation of model parameters and a corresponding Monte Carlo likelihood ratio test for the mean dependence parameter. Simulation studies show that our test controls the Type I error rate and provides an unbiased estimate of the mean dependence parameter. Additionally, we apply our method to a longitudinal pediatric cohort and identify changes in both local and global patterns of microbial co-variation networks in infants treated with antibiotics. Our analysis shows that the no antibiotics network is less dependent on individual taxon, thus making it more stable than the antibiotics network and more robust to both targeted and random attacks. Author summaryIdentification of co-variation between two microbes in microbial communities provides important insights into the community structure and stability. The commonly used measures of co-variation do not handle excessive zeros observed in the data and cannot be applied to longitudinal microbiome data directly. In this paper, we develop random-effects copula models with mixed zero-beta margins to identify biologically meaningful temporally conserved co-variation between two bacterial taxa, while accounting for the excessive zeros seen in 16S rRNA and metagenomic sequencing data. The model captures the conserved microbial co-variations while allowing for a time-specific dependence parameters. We develop an efficient Monte Carlo-based algorithm for parameter estimation and statistical inference. We analyze the data from a pediatric longitudinal cohort and identify changes in both local and global patterns of microbial co-variation networks in infants treated with antibiotics.
Miao, J.; Song, G.; Wu, Y.; Hu, J.; Wu, Y.; Basu, S.; Andrews, J. S.; Schaumberg, K.; Fletcher, J. M.; Schmitz, L. L.; Lu, Q.
Show abstract
In this study, we introduce PIGEON--a novel statistical framework for quantifying and estimating polygenic gene-environment interaction (GxE) using a variance component analytical approach. Based on PIGEON, we outline the main objectives in GxE studies, demonstrate the flaws in existing GxE approaches, and introduce an innovative estimation procedure which only requires summary statistics as input. We demonstrate the statistical superiority of PIGEON through extensive theoretical and empirical analyses and showcase its performance in multiple analytic settings, including a quasi-experimental GxE study of health outcomes, gene-by-sex interaction for 530 traits, and gene-by-treatment interaction in a randomized clinical trial. Our results show that PIGEON provides an innovative solution to many long-standing challenges in GxE inference and may fundamentally reshape analytical strategies in future GxE studies.
Zhang, Z.; Lawless, J. F.; Paterson, A. D.; Sun, L.
Show abstract
In genome-wide association studies (GWAS), it is desirable to test for interactions (GxE) between single-nucleotide polymorphisms (SNPs,Gs) and environmental variables (Es). However, directly accounting for interaction is often infeasible, because E is latent. For quantitative traits (Y) that are approximately normally distributed, it has been shown that indirect testing on GxE can be done by testing for heteroskedasticity of Y between genotypes. However, when traits are binary, the existing methodology based on testing the heteroskedasticity of the trait across genotypes cannot be generalized. In this paper, we propose an approach to indirectly test GxE for binary traits based on the non-additive effect G, and subsequently propose a joint test that accounts for the main and interaction effects of each SNP during GWAS. We illustrate the statistical features including type-I-error control and power of the proposed method through extensive numerical studies. Applying our method to the UK Biobank dataset, we showcase the practical utility of the proposed method, revealing SNPs and genes with strong potential for latent interaction effects.
Su, Y.; Clark, M. E. J. Z.; Wang, C.
Show abstract
A major goal of many omics studies is to identify differential features, e.g. differentially expressed genes, between experimental groups. When performing differential analysis for a given dataset, relevant information from another platform or species is often available. Incorporating such prior information can help identify features that show consistent differential patterns across platforms or species, which are more likely to reflect shared biological processes, and thereby enhance the robustness and generalizability of the findings. However, existing differential analysis methods typically analyze only the data from the current study and do not leverage prior knowledge about the magnitude or direction of changes from other platforms or species. We address this challenge, and the associated multiple testing problem, using a Bayesian framework that enables the incorporation of prior knowledge obtained from different platforms or species. We propose a new test statistic, Bayesian Credible Ratio (BCR), based on a heteroscedastic global local shrinkage prior, and a new multiple testing criterion, sign-adjusted FDR (SFDR), that emphasize information regarding the direction of the differentially features. We prove that BCR achieves the largest count of sign-based true positives among all legitimate SFDR-controlling methods. Simulation results offer numerical evidence of its advantage compared to an empirical Bayesian method. The approach is demonstrated through the analysis of RNAseq and single-cell RNAseq datasets.
Kornilov, S. A.
Show abstract
Shenhar et al. (2026) report 50% "intrinsic" lifespan heritability after calibrating a one-component correlated-frailty survival model to Scandinavian twin lifespans. Their framework is mathematically coherent, but the intrinsic component is not identified if heritable, mortality-relevant extrinsic susceptibility is omitted at calibration. We show that one-component calibration absorbs omitted familial extrinsic structure into the intrinsic frailty scale parameter{sigma}{theta} , and that this variance absorption is visible through separate diagnostics (1) Variance absorption. Under misspecification,{sigma}{theta} is inflated by +22.1% (95% CI: 21.5-22.7%), corresponding to +49% inflation in [Formula]. Falconer h2 is downstream of calibration and inherits a +9.2 pp bias (95% CI: 8.7-9.7). The{sigma}{theta} inflation is model-general: +22% (GM), +18% (MGG), +14% (SR); any dependence summary that is strictly increasing in{sigma}{theta} inherits this inflation, so Falconer h2 is one affected downstream quantity among many (Corollary B3). (2) Structural fingerprint. In the joint twin survival surface S(t1, t2), misspecification produces systematic dependence errors (ISE 48x that of the recovery model). Conditional twin dependence is inflated at all ages, peaking at age 80 ({Delta}r = 0.048). (3) Specificity. The bias requires an omitted component that is both heritable and mortality-relevant. Three negative controls, a boundary check ({rho} = 0), and a two-component recovery refit ({sigma}{theta} restored to within -3.2%) establish specificity. ACE decomposition yields C {approx} 0 throughout: the omitted extrinsic component loads onto A (because it is shared 1.0/0.5 in MZ/DZ), so switching summary statistics does not restore identification. (4) Sensitivity and falsifiability. Over an empirically anchored regime ({sigma}{gamma} [isin] [0.30, 0.65],{rho} [isin] [0.20, 0.50]), Falconer bias ranges from +2.8 to +18.9 pp (mean 9 pp). If{rho} is sufficiently negative, the bias reverses sign in all three model families (Corollary B4). A full-likelihood robustness check shows that this upward pull is partly structural and partly estimator-specific: in the same misspecified one-component model, ML still inflates{sigma}{theta} (+3%), whereas matching only rMZ inflates it much more (+21%). These results do not resolve true intrinsic heritability but establish that Shenhars 50% estimate carries a structured, model-general upward bias originating in the fitted latent variance{sigma}{theta} .