Biostatistics — Latest Matching Preprints

1

Penalized generalized estimating equations for relative risk regression with applications to brain lesion data

Kindalova, P.; Veldsman, M.; Nichols, T. E.; Kosmidis, I.

2021-11-03 neuroscience 10.1101/2021.11.01.466751 medRxiv

Top 0.1%

6.4%

Show abstract

Motivated by a brain lesion application, we introduce penalized generalized estimating equations for relative risk regression for modelling correlated binary data. Brain lesions can have varying incidence across the brain and result in both rare and high incidence outcomes. As a result, odds ratios estimated from generalized estimating equations with logistic regression structures are not necessarily directly interpretable as relative risks. On the other hand, use of log-link regression structures with the binomial variance function may lead to estimation instabilities when event probabilities are close to 1. To circumvent such issues, we use generalized estimating equations with log-link regression structures with identity variance function and unknown dispersion parameter. Even in this setting, parameter estimates can be infinite, which we address by penalizing the generalized estimating functions with the gradient of the Jeffreys prior. Our findings from extensive simulation studies show significant improvement over the standard log-link generalized estimating equations by providing finite estimates and achieving convergence when boundary estimates occur. The real data application on UK Biobank brain lesion maps further reveals the instabilities of the standard log-link generalized estimating equations for a large-scale data set and demonstrates the clear interpretation of relative risk in clinical applications.

2

The Rayleigh Quotient and Contrastive Principal Component Analysis II

Jackson, K. C.; Carilli, M. T.; Pachter, L.

2026-04-10 bioinformatics 10.64898/2026.04.08.717236 medRxiv

Top 0.1%

5.0%

Show abstract

Contrastive principal component analysis (PCA) methods are effective approaches to dimensionality reduction where variance of a target dataset is maximized while variance of a background dataset is minimized. We previously described how contrastive PCA problems can be written as solutions to generalized eigenvalue problems that maximize particular instantiations of the Rayleigh quotient. Here, we discuss two extensions of contrastive PCA: we use kernel weighting from spatial PCA (k-{rho}PCA) to contrast spatial and non-spatial axes of variation, and separately solve the Rayleigh quotient in the space of basis function coefficients (f-{rho}PCA) to find modes of variation in functional data. Together, these extensions expand the scope of contrastive PCA while unifying disparate fields of spatial and functional methods within a single conceptual and mathematical framework. We showcase the utility of these extensions with several examples drawn from genomics, analyzing gene expression in cancer and immune response to vaccination.

3

Regression-based Modeling of Spearman's Rho for Longitudinal Metabolomics and Mental Wellness in Breast Cancer Patients

Chen, Y.; Gui, T.; Huang, Z.; Quach, N.; Tu, S.; Liu, J.; Garrett, T. J.; Starkweather, A. R.; Lyon, D. E.; Shepherd, B. E.; Tu, X. M.; Lin, T.

2026-04-16 cancer biology 10.64898/2026.04.13.718341 medRxiv

Top 0.1%

4.9%

Show abstract

SO_SCPLOWUMMARYC_SCPLOWChemotherapy in breast cancer (BC) can substantially affect mental wellness. Advances in metabolomics enable comprehensive profiling of metabolic changes over time during and after treatment, offering insights into biological mechanisms linking chemotherapy to mental health outcomes. To study the association between metabolite profiles and mental wellness, correlation-based analyses are particularly useful. Spearmans rho is a widely used correlation measure and popular alternative to Pearsons correlation, since it also applies to non-linear association between variables. However, existing methods are not designed for longitudinal data and do not allow for covariate adjustments. In this paper, we propose a novel regression-based framework grounded in a class of semiparametric models, the functional response models, to extend this popular correlation measure to longitudinal settings with missing data under the missing at random assumption. This framework facilitates inferences about temporal changes in correlations over time and association of explanatory variables for such changes. We use simulation studies to evaluate performance of the approach with moderate sample sizes. We apply the approach to a one-year longitudinal substudy of the EPIGEN study to examine the longitudinal association between metabolite profiles and mental wellness in BC patients undergoing chemotherapy. The identified metabolites may serve as candidates for future in-depth bioinformatics analyses and translational investigations.

4

Generalization of the minimum covariance determinant algorithm for categorical and mixed data types

Beaton, D.; Sunderland, K. M.; ADNI, ; Levine, B.; Mandzia, J.; Masellis, M.; Swartz, R. H.; Troyer, A. K.; ONDRI, ; Binns, M. A.; Abdi, H.; Strother, S. C.

2020-03-31 bioinformatics 10.1101/333005 medRxiv

Top 0.1%

4.9%

Show abstract

The minimum covariance determinant (MCD) algorithm is one of the most common techniques to detect anomalous or outlying observations. The MCD algorithm depends on two features of multivariate data: the determinant of a matrix (i.e., geometric mean of the eigenvalues) and Mahalanobis distances (MD). While the MCD algorithm is commonly used, and has many extensions, the MCD is limited to analyses of quantitative data and more specifically data assumed to be continuous. One reason why the MCD does not extend to other data types such as categorical or ordinal data is because there is not a well-defined MD for data types other than continuous data. To address the lack of MCD-like techniques for categorical or mixed data we present a generalization of the MCD. To do so, we rely on a multivariate technique called correspondence analysis (CA). Through CA we can define MD via singular vectors and also compute the determinant from CAs eigenvalues. Here we define and illustrate a generalized MCD on categorical data and then show how our generalized MCD extends beyond categorical data to accommodate mixed data types (e.g., categorical, ordinal, and continuous). We illustrate this generalized MCD on data from two large scale projects: the Ontario Neurodegenerative Disease Research Initiative (ONDRI) and the Alzheimers Disease Neuroimaging Initiative (ADNI), with genetics (categorical), clinical instruments and surveys (categorical or ordinal), and neuroimaging (continuous) data. We also make R code and toy data available in order to illustrate our generalized MCD.

5

Estimating mutual information under measurement error

Ma, C.; Kingsford, C.

2019-11-23 bioinformatics 10.1101/852384 medRxiv

Top 0.1%

4.9%

Show abstract

Mutual information is widely used to characterize dependence between biological signals, such as co-expression between genes or co-evolution between amino acids. However, measurement error of the biological signals is rarely considered in estimating mutual information. Measurement error is widespread and non-negligible in some cases. As a result, the distribution of the signals is blurred, and the mutual information may be biased when estimated using the blurred measurements. We derive a corrected estimator for mutual information that accounts for the distribution of measurement error. Our corrected estimator is based on the correction of the probability mass function (PMF) or probability density function (PDF, based on kernel density estimation). We prove that the corrected estimator is asymptotically unbiased in the (semi-) discrete case when the distribution of measurement error is known. We show that it reduces the estimation bias in the continuous case under certain assumptions. On simulated data, our corrected estimator leads to a more accurate estimation for mutual information when the sample size is not the limiting factor for estimating PMF or PDF accurately. We compare the uncorrected and corrected estimator on the gene expression data of TCGA breast cancer samples and show a difference in both the value and the ranking of estimated mutual information between the two estimators.

6

Strategies for addressing pseudoreplication in multi-patient scRNA-seq data

Malfait, M.; Gilis, J.; Van den Berge, K.; Assefa, A. T.; Verbist, B.; Clement, L.

2024-06-17 bioinformatics 10.1101/2024.06.15.599144 medRxiv

Top 0.1%

4.3%

Show abstract

The rapidly evolving field of single-cell transcriptomics has provided a powerful means for understanding cellular heterogeneity. Large-scale studies with multiple biological samples hold promise for discovering differentially expressed biomarkers with a higher level of confidence through a better characterization of the target population. However, the hierarchical nature of these experiments introduces a significant challenge for downstream statistical analysis. Indeed, despite the availability of numerous differential expression methods, only a select few can accurately address the within-patient correlation of single-cell expression profiles. Furthermore, due to the high computational costs associated with some of these methods, their practical use is limited. In this manuscript, we undertake a comprehensive assessment of different strategies to address the hierarchical correlation structure in multi-sample scRNA-seq data. We employ synthetic data generated from a simulator that retains the original correlation structure of multi-patient data while making minimal assumptions, providing a robust platform for benchmarking method performance. Our analyses indicate that neglecting within-patient correlation jeopardizes type I error control. We show that, in line with some previous reports and in contrast with others, Poisson Generalized Estimation Equations provide a useful and flexible framework for addressing these issues. We also show that pseudobulk approaches outperform single-cell level methods across the board. In this work, we resolve the conflicting results regarding the utility of GEEs and their performance relative to pseudobulk approaches. As such, we provide valuable guidelines for researchers navigating the complex landscape of gene expression modeling, and offer insights on choosing the most appropriate methods based on the specific structure and design of their datasets.

7

Adaptation of a Mutual Exclusivity Framework to Identify Driver Mutations within Biological Pathways

Wang, X.; Kostrzewa, C.; Reiner, A.; Shen, R.; Begg, C.

2023-09-22 cancer biology 10.1101/2023.09.19.558469 medRxiv

Top 0.1%

4.1%

Show abstract

Distinguishing genomic alterations in cancer genes that have functional impact on tumor growth and disease progression from the ones that are passengers and confer no fitness advantage has important clinical implications. Evidence-based methods for nominating drivers are limited by existing knowledge on the oncogenic effects and therapeutic benefits of specific variants from clinical trials or experimental settings. As clinical sequencing becomes a mainstay of patient care, applying computational methods to mine the rapidly growing clinical genomic data holds promise in uncovering novel functional candidates beyond the existing knowledge-base and expanding the patient population that could potentially benefit from genetically targeted therapies. We propose a statistical and computational method (MAGPIE) that builds on a likelihood approach leveraging the mutual exclusivity pattern within an oncogenic pathway for identifying probabilistically both the specific genes within a pathway and the individual mutations within such genes that are truly the drivers. Alterations in a cancer gene are assumed to be a mixture of driver and passenger mutations with the passenger rates modeled in relationship to tumor mutational burden. A limited memory BFGS algorithm is used to facilitate large scale optimization. We use simulations to study the operating characteristics of the method and assess false positive and false negative rates in driver nomination. When applied to a large study of primary melanomas the method accurately identified the known driver genes within the RTK-RAS pathway and nominated a number of rare variants with previously unknown biological and clinical relevance as prime candidates for functional validation.

8

Determining Optimal Placement of Copy Number Aberration Impacted Single Nucleotide Variants in a Tumor Progression History

Wu, C. H.; Joshi, S.; Robinson, W.; Robbins, P. F.; Schwartz, R.; Sahinalp, C.; Malikic, S.

2024-03-13 cancer biology 10.1101/2024.03.10.584318 medRxiv

Top 0.1%

3.9%

Show abstract

Intratumoral heterogeneity arises as a result of genetically distinct subclones emerging during tumor progression. These subclones are characterized by various types of somatic genomic aberrations, with single nucleotide variants (SNVs) and copy number aberrations (CNAs) being the most prominent. While single-cell sequencing provides powerful data for studying tumor progression, most existing and newly generated sequencing datasets are obtained through conventional bulk sequencing. Most of the available methods for studying tumor progression from multi-sample bulk sequencing data are either based on the use of SNVs from genomic loci not impacted by CNAs or designed to handle a small number of SNVs via enumerating their possible copy number trees. In this paper, we introduce DETOPT, a combinatorial optimization method for accurate tumor progression tree inference that places SNVs impacted by CNAs on trees of tumor progression with minimal distortion on their variant allele frequencies observed across available samples of a tumor. We show that on simulated data DETOPT provides more accurate tree placement of SNVs impacted by CNAs than the available alternatives. When applied to a set of multi-sample bulk exome-sequenced tumor metastases from a treatment-refractory, triple-positive metastatic breast cancer, DETOPT reports biologically plausible trees of tumor progression, identifying the tree placement of copy number state gains and losses impacting SNVs, including those in clinically significant genes.

9

Forward-looking serial intervals correctly link epidemic growth to reproduction numbers

Park, S. W.; Sun, K.; Champredon, D.; Li, M.; Bolker, B. M.; Earn, D. J. D.; Weitz, J. S.; Grenfell, B. T.; Dushoff, J.

2020-10-27 infectious diseases 10.1101/2020.06.04.20122713 medRxiv

Top 0.1%

3.9%

Show abstract

Generation intervals and serial intervals are critical quantities for characterizing outbreak dynamics. Generation intervals characterize the time between infection and transmission, while serial intervals characterize the time between the onset of symptoms in a chain of transmission. They are often used interchangeably, leading to misunderstanding of how these intervals link the epidemic growth rate r and the reproduction number[R] . Generation intervals provide a mechanistic link between r and[R] but are harder to measure via contact tracing. While serial intervals are easier to measure from contact tracing, recent studies suggest that the two intervals give different estimates of[R] from r. We present a general framework for characterizing epidemiological delays based on cohorts (i.e., a group of individuals that share the same event time, such as symptom onset) and show that forward-looking serial intervals, which correctly link[R] with r, are not the same as "intrinsic" serial intervals, but instead change with r. We provide a heuristic method for addressing potential biases that can arise from not accounting for changes in serial intervals across cohorts and apply the method to estimating[R] for the COVID-19 outbreak in China using serial-interval data -- our analysis shows that using incorrectly defined serial intervals can severely bias estimates. This study demonstrates the importance of early epidemiological investigation through contact tracing and provides a rationale for reassessing generation intervals, serial intervals, and[R] estimates, for COVID-19. Significance StatementThe generation- and serial-interval distributions are key, but different, quantities in outbreak analyses. Recent theoretical studies suggest that two distributions give different estimates of the reproduction number[R] from the exponential growth rate r; however, both intervals, by definition, describe disease transmission at the individual level. Here, we show that the serial-interval distribution, defined from the correct reference time and cohort, gives the same estimate of[R] as the generation-interval distribution. We then apply our framework to serial-interval data from the COVID-19 outbreak in China. While our study supports the use of serial-interval distributions in estimating[R] , it also reveals necessary changes to the current understanding and applications of serial-interval distribution.

10

Cycling and prostate cancer risk - Bayesian insights from observational study data.

Vincent, B. T.

2020-06-28 cancer biology 10.1101/2020.06.25.171546 medRxiv

Top 0.1%

3.8%

Show abstract

Men have a very high lifetime risk of developing prostate cancer, and so there is a pressing need to understand factors that influence this risk. One factor of interest is whether cycling increases of decreases prostate cancer lifetime risk. Two large observational studies of cyclists noted very low rates of prostate cancer amongst cyclists relative to the general population – neither however drew causal conclusions about risk based on this observational prevalence data alone. Here we explore if and how we can use such data to update our beliefs about whether cycling increases or decreases prostate cancer risk – we use probabilistic methods to quantify belief in risk given the observational data available. We examine whether there is a dose– response relationship, how we can make inferences about risks, and the impact upon selection bias upon these inferences. A simple analysis leads us to believe that cycling decreases risk, but we show how this is mistaken unless selection bias can be ruled out. If cyclists who develop prostate cancer are less likely to respond to these surveys, we may be mislead into believing that cycling decreases risk even if it actually increases risk. Overall we explore precisely why it is hard to draw conclusions about risk factors based upon observational prevalence data.Competing Interest StatementThe authors have declared no competing interest.View Full Text

11

Linear and partially linear models of behavioural trait variation using admixture regression

Connor, G.; Pesta, B. J.

2021-06-17 genomics 10.1101/2021.05.14.444173 medRxiv

Top 0.1%

3.7%

Show abstract

Admixture regression methodology exploits the natural experiment of random mating between individuals with different ancestral backgrounds to infer the environmental and genetic components to trait variation across racial and ethnic groups. This paper provides a statistical framework for admixture regression based on the linear polygenic index model and applies it to neuropsychological performance data from the Adolescent Brain Cognitive Development (ABCD) database. We develop and apply a new test of the differential impact of multi-racial identities on trait variation, an orthogonalization procedure for added explanatory variables, and a partially linear semiparametric functional form. We find a statistically significant genetic component to neuropsychological performance differences across racial identities, and find some possible evidence of nonlinearity in the link between admixture and neuropsychological performance scores in the ABCD data.

12

Simultaneous estimation of per cell division mutation rate and turnover rate from bulk tumour sequence data

Tibely, G.; Schrempf, D.; Derenyi, I.; Szöllosi, G. J.

2021-02-16 bioinformatics 10.1101/2021.02.12.430830 medRxiv

Top 0.1%

3.7%

Show abstract

Tumors often harbor orders of magnitude more mutations than healthy tissues. The increased number of mutations may be due to an elevated mutation rate or frequent cell death and correspondingly rapid cell turnover, or a combination of the two. It is difficult to disentangle these two mechanisms based on widely available bulk sequencing data, where sequences from individual cells are intermixed and, thus, the cell lineage tree of the tumor cannot be resolved. Here we present a method that can simultaneously estimate the cell turnover rate and the rate of mutations from bulk sequencing data. Our method works by simulating tumor growth and finding the parameters with which the observed data can be reproduced with maximum likelihood. Applying this method to a real tumor sample, we find that both the mutation rate and the frequency of death may be high. Author SummaryTumors frequently harbor an elevated number of mutations, compared to healthy tissue. These extra mutations may be generated either by an increased mutation rate or the presence of cell death resulting in increased cellular turnover and additional cell divisions for tumor growth. Separating the effects of these two factors is a nontrivial problem. Here we present a method which can simultaneously estimate cell turnover rate and genomic mutation rate from bulk sequencing data. Our method is based on the estimation of the parameters of a generative model of tumor growth and mutations. Applying our method to a human hepatocellular carcinoma sample reveals an elevated per cell division mutation rate and high cell turnover.

13

Power analysis of transcriptome-wide association study

Ding, B.; Cao, C.; Li, Q.; Wu, J.; Long, Q.

2020-07-26 bioinformatics 10.1101/2020.07.19.211151 medRxiv

Top 0.1%

3.7%

Show abstract

The transcriptome-wide association study (TWAS) has emerged as one of several promising techniques for integrating multi-scale omics data into traditional genome-wide association studies (GWAS). Unlike GWAS, which associates phenotypic variance directly with genetic variants, TWAS uses a reference dataset to train a predictive model for gene expressions, which allows it to associate phenotype with variants through the mediating effect of expressions. Although effective, this core innovation of TWAS is poorly understood, since the predictive accuracy of the genotype-expression model is generally low and further bounded by expression heritability. This raises the question: to what degree does the accuracy of the expression model affect the power of TWAS? Furthermore, would replacing predictions with actual, experimentally determined expressions improve power? To answer these questions, we compared the power of GWAS, TWAS, and a hypothetical protocol utilizing real expression data. We derived non-centrality parameters (NCPs) for linear mixed models (LMMs) to enable closed-form calculations of statistical power that do not rely on specific protocol implementations. We examined two representative scenarios: causality (genotype contributes to phenotype through expression) and pleiotropy (genotype contributes directly to both phenotype and expression), and also tested the effects of various properties including expression heritability. Our analysis reveals two main outcomes: (1) Under pleiotropy, the use of predicted expressions in TWAS is superior to actual expressions. This explains why TWAS can function with weak expression models, and shows that TWAS remains relevant even when real expressions are available. (2) GWAS outperforms TWAS when expression heritability is below a threshold of 0.04 under causality, or 0.06 under pleiotropy. Analysis of existing publications suggests that TWAS has been misapplied in place of GWAS, in situations where expression heritability is low. Author SummaryWe compared the effectiveness of three methods for finding genetic effects on disease in order to quantify their strengths and help researchers choose the best protocol for their data. The genome-wide association study (GWAS) is the standard method for identifying how the genetic differences between individuals relate to disease. Recently, the transcriptome-wide association study (TWAS) has improved GWAS by also estimating the effect of each genetic variant on the activity level (or expression) of genes related to disease. The effectiveness of TWAS is surprising because its estimates of gene expressions are very inaccurate, so we ask if a method using real expression data instead of estimates would perform better. Unlike past studies, which only use simulation to compare these methods, we incorporate novel statistical calculations to make our comparisons more accurate and universally applicable. We discover that depending on the type of relationship between genetics, gene expression, and disease, the estimates used by TWAS could be actually more relevant than real gene expressions. We also find that TWAS is not always better than GWAS when the relationship between genetics and expression is weak and identify specific turning points where past studies have incorrectly used TWAS instead of GWAS.

14

Extending Comparison Methods for Unsigned Networks to Signed Networks

Krinsman, W. E.

2022-09-26 bioinformatics 10.1101/2022.09.23.509251 medRxiv

Top 0.1%

3.7%

Show abstract

We can allow the edges of networks to have both negative and positive weights. For example, signed networks can describe the interactions of microbes. To evaluate the performance of estimators for signed networks, we need quantitative comparison methods for signed networks. Finding such comparison methods is done most easily by extending a comparison method for unsigned networks. Almost all methods reported in the literature for quantitatively comparing networks implicitly assume that edge weights are non-negative. Naive attempts to modify these methods to be applicable to signed networks can lead to nonsensical conclusions. Herein I identify requirements that should be satisfied by reasonable methods for comparing signed networks, most importantly the "double penalization principle". I extend several comparison methods for unsigned networks while satisfying these requirements. Finally, I give examples where these extensions behave reasonably but naive extensions do not.

15

A framework to efficiently smooth L1 penalties for linear regression

Hahn, G.; Lutz, S. M.; Laha, N.; Lange, C.

2020-09-19 bioinformatics 10.1101/2020.09.17.301788 medRxiv

Top 0.1%

3.7%

Show abstract

Penalized linear regression approaches that include an L1 term have become an important tool in statistical data analysis. One prominent example is the least absolute shrinkage and selection operator (Lasso), though the class of L1 penalized regression operators also includes the fused and graphical Lasso, the elastic net, etc. Although the L1 penalty makes their objective function convex, it is not differentiable everywhere, motivating the development of proximal gradient algorithms such as Fista, the current gold standard in the literature. In this work, we take a different approach based on smoothing in a fixed parameter setting (the problem size n and number of parameters p are fixed). The methodological contribution of our article is threefold: (1) We introduce a unified framework to compute closed-form smooth surrogates of a whole class of L1 penalized regression problems using Nesterov smoothing. The surrogates preserve the convexity of the original (unsmoothed) objective functions, are uniformly close to them, and have closed-form derivatives everywhere for efficient minimization via gradient descent; (2) We prove that the estimates obtained with the smooth surrogates can be made arbitrarily close to the ones of the original (unsmoothed) objective functions, and provide explicitly computable a priori error bounds on the accuracy of our estimates; (3) We propose an iterative algorithm to progressively smooth the L1 penalty which increases accuracy and is virtually free of tuning parameters. The proposed methodology is applicable to a large class of L1 penalized regression operators, including all the operators mentioned above. Although the resulting estimates are typically dense, sparseness can be enforced again via thresholding. Using simulation studies, we compare our framework to current gold standards such as Fista, glmnet, gLasso, etc. Our results suggest that our proposed smoothing framework provides predictions of equal or higher accuracy than the gold standards while keeping the aforementioned theoretical guarantees and having roughly the same asymptotic runtime scaling.

16

Optimal Experimental Design for Big Data: Applications in Brain Imaging

Bridgeford, E. W.; Wang, S.; Yang, Z.; Wang, Z.; Xu, T.; Craddock, C.; Kiar, G.; Gray-Roncal, W.; Priebe, C. E.; Caffo, B.; Milham, M.; Zuo, X.-N.; Consortium for Reliability and Reproducibility, ; Vogelstein, J. T.

2019-10-13 neuroscience 10.1101/802629 medRxiv

Top 0.1%

3.7%

Show abstract

Replicability, the ability to replicate scientific findings, is a prerequisite for scientific discovery and clinical utility. Troublingly, we are in the midst of a replicability crisis. A key to replicability is that multiple measurements of the same item (e.g., experimental sample or clinical participant) under fixed experimental constraints are relatively similar to one another. Thus, statistics that quantify the relative contributions of accidental deviations--such as measurement error--as compared to systematic deviations--such as individual differences--are critical. We demonstrate that existing replicability statistics, such as intra-class correlation coefficient and fingerprinting, fail to adequately differentiate between accidental and systematic deviations in very simple settings. We therefore propose a novel statistic, discriminability, which quantifies the degree to which an individuals samples are relatively similar to one another, without restricting the data to be univariate, Gaussian, or even Euclidean. Using this statistic, we introduce the possibility of optimizing experimental design via increasing discriminability and prove that optimizing discriminability improves performance bounds in subsequent inference tasks. In extensive simulated and real datasets (focusing on brain imaging and demonstrating on genomics), only optimizing data discriminability improves performance on all subsequent inference tasks for each dataset. We therefore suggest that designing experiments and analyses to optimize discriminability may be a crucial step in solving the replicability crisis, and more generally, mitigating accidental measurement error. Author SummaryIn recent decades, the size and complexity of data has grown exponentially. Unfortunately, the increased scale of modern datasets brings many new challenges. At present, we are in the midst of a replicability crisis, in which scientific discoveries fail to replicate to new datasets. Difficulties in the measurement procedure and measurement processing pipelines coupled with the influx of complex high-resolution measurements, we believe, are at the core of the replicability crisis. If measurements themselves are not replicable, what hope can we have that we will be able to use the measurements for replicable scientific findings? We introduce the "discriminability" statistic, which quantifies how discriminable measurements are from one another, without limitations on the structure of the underlying measurements. We prove that discriminable strategies tend to be strategies which provide better accuracy on downstream scientific questions. We demonstrate the utility of discriminability over competing approaches in this context on two disparate datasets from both neuroimaging and genomics. Together, we believe these results suggest the value of designing experimental protocols and analysis procedures which optimize the discriminability.

17

RobMixReg: an R package for robust, flexible and high dimensional mixture regression

Chang, W.; Wan, C.; Yu, C.; Yao, W.; Zhang, C.; Cao, S.

2020-08-04 bioinformatics 10.1101/2020.08.02.233460 medRxiv

Top 0.1%

3.6%

Show abstract

MotivationMixture regression has been widely used as a statistical model to untangle the latent subgroups of the sample population. Traditional mixture regression faces challenges when dealing with: 1) outliers and versatile regression forms; and 2) the high dimensionality of the predictors. Here, we develop an R package called RobMixReg, which provides comprehensive solutions for robust, flexible as well as high dimensional mixture modeling. Availability and ImplementationRobMixReg R package and associated documentation is available at CRAN: https://CRAN.R-project.org/package=RobMixReg.

18

On real-time calibrated prediction for complex model-based decision support in pandemics: Part 2

McKinley, T. J.; Williamson, D. B.; Xiong, X.; Salter, J. M.; Challen, R.; Danon, L.; Youngman, B. D.; McNeall, D.

2025-05-16 infectious diseases 10.1101/2025.05.16.25327744 medRxiv

Top 0.1%

3.6%

Show abstract

Calibration of complex stochastic infectious disease models is challenging. These often have high-dimensional input and output spaces, with the models exhibiting complex, non-linear dynamics. Coupled with a paucity of necessary data, this results in a large number of non-ignorable hidden states that must be handled by the inference routine. Likelihood-based approaches to this missing data problem are very flexible, but challenging to scale, due to having to monitor and update these hidden states. Methods based on simulating the hidden states directly from the model-of-interest have an advantage that they are often more straightforward to code, and thus are easier to implement and adapt in real-time. However, these often require evaluating very large numbers of simulations, rendering them infeasible for many large-scale problems. We present a framework for using emulation-based methods to calibrate a large-scale, stochastic, age-structured, spatial meta-population model of COVID-19 transmission in England and Wales. By embedding a model discrepancy process into the simulation model, and combining this with particle filtering, we show that it is possible to calibrate complex models to high-dimensional data by emulating the log-likelihood surface instead of individual data points. The use of embedded model discrepancy also helps to alleviate other key challenges, such as the introduction of infection across space and time. We conclude with a discussion of major challenges remaining and key areas for future work.

19

Inferring time-aware models of cancer progression using Timed Hazard Networks

Chen, J.

2022-10-24 cancer biology 10.1101/2022.10.23.513436 medRxiv

Top 0.1%

3.6%

Show abstract

Analysis of the sequential accumulation of genetic events, such as mutations and copy number alterations, is key to understanding disease dynamics and may provide insights into the design of targeted therapies. Oncogenetic graphical models are computational methods that use genetic event profiles from cross-sectional genomic data to infer the statistical dependencies between events and thereby deduce their temporal order of occurrence. Existing research focuses mainly on the development of graph structure learning algorithms. However, no algorithm explicitly links the oncogenetic graph with the temporal differences of samples in an analytic way. In this paper, we propose a novel statistical framework Timed Hazard Networks (TimedHN), that treat progression times as hidden variables and jointly infers oncogenetic graph and pseudo-time order of samples. We modeled the accumulation process as a continuous-time Markov chain and developed an efficient gradient computation algorithm for the optimization. Experiment results using synthetic data showed that our method outperforms the state-of-the-art in graph reconstruction. We highlighted the differences between TimedHN and competing methods on a luminal breast cancer dataset and illustrated the potential utility of the proposed method. Implementation and data are available at https://github.com/puar-playground/TimedHN

20

Efficient Agony Based Transfer Learning Algorithms for Survival Forecasting

Tamaskar, A.; Bannon, J.; Mishra, B.

2021-02-25 cancer biology 10.1101/2021.02.24.432695 medRxiv

Top 0.1%

3.5%

Show abstract

Progression modeling is a mature subfield of cancer bioinformatics, but it has yet to make a proportional clinical impact. The majority of the research in this area has focused on the development of efficient algorithms for accurately reconstructing sequences of (epi)genomic events from noisy data. We see this as the first step in a broad pipeline that will translate progression modeling to clinical utility, with the subsequent steps involving inferring prognoses and optimal therapy programs for different cancers and using similarity in progression to enhance decision making. In this paper we take some initial steps in completing this pipeline. As a theoretical contribution, we introduce a polytime-computable pairwise distance between progression models based on the graph-theoretic notion of "agony". Focusing on a particular progression model we can then use this agony distance to cluster (dis)similarities via multi-dimensional scaling. We recover known biological similarities and dissimilarities. Finally, we use the agony distance to automate transfer learning experiments and show a large improvement in the ability to forecast time to death.