Biostatistics
◐ Oxford University Press (OUP)
Preprints posted in the last 30 days, ranked by how well they match Biostatistics's content profile, based on 21 papers previously published here. The average preprint has a 0.00% match score for this journal, so anything above that is already an above-average fit.
Heine, J.; Fowler, E.; Eschrich, S. A.; Schell, M.
Show abstract
Data modeling in biomedical research often operates in the small-sample regime, where the number of observations is small relative to the data dimensionality; the detrimental effects of limited sample sizes are well documented in cancer studies. Synthetic data offers a potential solution to data shortfalls provided that the data generated is an adequate facsimile of the underlying distribution; the adequacy of such synthetic data remains an open-ended problem. In this work, we evaluate a synthetic generator proposed previously. The generator applies a series of transformations to the observed data to accommodate the small-sample size resulting in an uncoupled representation, where uncorrelated marginal distributions are modeled with optimized univariate kernel density estimation. In this report, (1) we develop a nonparametric method for assessing multivariate similarity based on the Cramer-Wold theorem and random projection testing, (2) investigate when the absence of bivariate correlation approximates independence in a non-normal setting, and (3) evaluate artifacts induced by data compression. The presentation is primarily methodological; low-dimensional data were used so each stage of the generation process could be analyzed explicitly. A formal testing framework was developed by comparing random projection level outcomes with a two-sample test, modeling these outcomes as Bernoulli trials, aggregating replicate outcomes within each projection direction, and pooling outcomes across many directions, yielding a scalable standardized normal test-statistic. The key innovation was decoupling the two-sample test significance level from that governing finalized normal inference. We showed the same projection framework also evaluates the full multivariate covariance structure. The generator produced high-fidelity multivariate synthetic data when the bivariate correlation approximates independence in the non-normal setting; in highly compressed data, residual modes were best modeled as normally distributed regardless of their intrinsic distributional form. Ongoing work includes applying these methods to higher-dimensional, diverse data.
Wei, M.; Peng, Q.
Show abstract
Human behavioral and mental health outcomes arise from interactions among genetic, environmental, and neurobiological systems. Existing frameworks often model these components jointly, but many treat variables independently or use static representations. This limits their ability to capture system-level dynamics and changes over time. To address this, we developed DynoSys, a unified framework that integrates these signals using three layers: predictive models, relationship exploration models, and mechanism-oriented explanation models. Building on this framework, we introduce DynoSys 2.0, a graph-based temporal modeling approach inspired by the free-energy principle by Karl Friston. In this framework, each individual is represented as a dynamic graph that evolves over time. We hypothesize that healthy development and adverse mental health outcomes correspond to different system states and trajectories. Using longitudinal data from the Adolescent Brain Cognitive Development (ABCD) Study, we construct time-indexed graphs that integrate polygenic risk scores (PRS), multi-domain environmental features, and neuroimaging-derived representations. We study six phenotypes: externalizing behavior, internalizing behavior, and sub-stance use initiation (alcohol, nicotine, cannabis, and any substance). In these graphs, nodes represent domain-level features, and edges capture relationships derived from data-driven feature selection and temporal dependencies. We model graph evolution using recurrent neural networks and graph-temporal learning methods. We also define system-level measures, including graph energy and state transitions, to quantify dynamic patterns. Our results show that DynoSys 2.0 can model behavioral development using longitudinal multi-domain data. The framework achieved meaningful prediction for both continuous behavioral symptoms and substance-use initiation outcomes, but performance differed by outcome type. Externalizing behavior was predicted more accurately than internalizing behavior, and alcohol and any substance initiation showed stronger prediction than cannabis and nicotine initiation. Graph-derived energy measures showed clearer separation for high-versus low-symptom externalizing and internalizing groups, suggesting that continuous behavioral symptoms may be linked to different latent system states over time. Overall, DynoSys 2.0 provides a flexible framework for studying behavioral risk as a dynamic developmental process, while rare-event prediction and detailed graph-level interpretation require further work.
Khosravi, S.; Francis, N. A.; Kanold, P. O.; Babadi, B.
Show abstract
Understanding how neuronal populations interact to encode and transform sensory information is a fundamental challenge in computational neuroscience. Most existing studies, however, study neural encoding, behavioral readout, and functional connectivity as disjoint problems. Two-photon calcium imaging enables simultaneous recording of large neuronal ensembles in vivo, driven by diverse stimuli and eliciting distinct behaviors. However, extracting directional functional connectivity metrics as well as encoding and readout properties of neurons from such data remains difficult due to indirect and noisy observations of spiking activity, slow temporal dynamics, and the latent interplay between external stimuli and endogenous neural processes. Here, we introduce a unified conceptual and operational modeling and inference framework for directly extracting functional Granger causal (GC) effects between neurons, from external stimuli to neurons, and from neurons to behavior, from two-photon imaging data, in the sense of Granger. Inspired by the intersection information framework, we also identify neurons that encode features of sensory stimuli that inform behavioral readout. The resulting GC networks together with the taxonomy of functional sensori-behavioral relevance, which we call G-taxonomy, provides a powerful statistical analysis framework, enabled by the integration of several techniques including state-space modeling and inference, variational inference, and point processes. We applied the proposed framework to simulated and experimentally-recorded two-photon imaging from the mouse auditory cortex (A1) during both passive listening and active tone discrimination. Our simulation studies reveal significant improvement of our proposed methodology over existing techniques. Analysis of experimental data from the mouse A1 identifies distinct groups of cells with diverse sensori-behavioral relevance, as well as changes in functional connectivity associated with correct vs. incorrect behavior. In summary, this work provides a principled and data-driven methodology for uncovering directional interactions among the neurons, sensory stimuli, and behavior, all within the same statistical framework, offering new insights into how distributed cortical populations transform sensory inputs into behaviorally relevant representations. Author SummaryThe brain processes sensory inputs through the coordinated activity of large networks of neurons and produces readouts that elicit behavior. Understanding how information flows and is processed through these networks is a central goal of neuroscience. In this study, we present a new computational framework that identifies directional interactions among neurons in an ensemble as well as from sensory stimuli to neurons and from neurons to behavior. Utilizing the Granger formalism to identify directional effects, as opposed to common correlational measures, our framework extracts said effects directly from two-photon calcium imaging data. We tested our proposed method on both simulated data and recordings from the auditory cortex of mice during passive listening and active tone discrimination tasks. Our method revealed diverse groups of neurons in the auditory cortex with distinct functional roles and relevance to sensori-behavioral integration. Our framework provides a new way to study the flow of information in the brain and can be broadly applied to uncover neural computations across sensory and cognitive systems.
Tang, Q.; Chi, E. C.; Wang, W.
Show abstract
We address the problem of fitting a collection of location-specific models under a spatial smoothness assumption. Existing approaches penalize roughness in the model parameters directly, an assumption that breaks down when smoothness is a function of parameters and auxiliary covariates rather than the parameters themselves. Our framework, the Auxiliary-Transformed Location-Aware Smoothing (ATLAS) penalty, generalizes spatial smoothness by penalizing roughness in transformations of model parameters using auxiliary information. As a concrete case study, we develop a spatially smooth deconvolution model for spatial transcriptomics that estimates tumor mixing coefficients from thousands of spots distributed on a single tissue slide. To handle the computational challenges posed by the nonlinear likelihood, nonsmooth nonconvex penalty, and spatially coupled estimation, we propose an alternating direction method of multipliers (ADMM) algorithm. Through simulation studies, we demonstrate that our framework provides substantially better spatial domain detection than approaches that smooth model parameters directly, with particularly strong gains when auxiliary covariates carry calibrated spatial structure.
Miao, X.; Edge, M. D.; Harpak, A.
Show abstract
Standard genome-wide association studies (GWASs) are vulnerable to confounding factors, including stratification, assortative mating, and dynastic effects. Family studies such as sibling-based GWAS (sib-GWAS) mitigate such confounding and are becoming the tool of choice for teasing apart direct genetic effects--causal effects of ones genotype on ones own phenotype-- from other factors. However, due in part to their smaller sample sizes, sib-GWAS allelic effect estimates are substantially more variable than standard (i.e., population-based) GWAS estimates. The quantification of this uncertainty is essential for many uses of sib-GWAS, including polygenic scoring, causal inference (e.g., Mendelian randomization), disentangling direct from indirect familial effects, and measuring assortative mating. Here, we investigate sources of uncertainty in sib-GWAS allelic effect estimators. We study their impacts on the biases of three uncertainty measurement methods, including two that are commonly used and a new resampling-based approach we propose. We find that heterogeneity in allelic effects or heteroskedasticity across families (e.g., due to variation in genetic backgrounds or environments) can bias existing methods, and that this bias is more severe for small samples and rare variants. In contrast, the resampling-based approach we propose is approximately unbiased under all scenarios we considered. We validate our theoretical predictions, as well as the importance of effect heterogeneity and heteroskedasticity, using simulations and empirical analysis in the UK Biobank. In sum, this study helps understand the sources of uncertainty in family-based genotype-phenotype association studies and provides a robust method to estimate uncertainty.
Nicol, P. B.; Shivakumar, S.; Irizarry, R.
Show abstract
The increasing number of computational methods designed to predict the effects of genetic perturbations on cellular gene expression profiles has led to a need for rigorous evaluation metrics. Recent benchmarking studies rely on correlation or cosine similarity of differential expression relative to a shared population of control cells. We show that these metrics are systematically inflated by statistical bias induced by reusing the same control population to define both quantities being compared. As a result, even non-informative methods can appear to perform well, particularly in datasets with limited numbers of control cells. Reanalysis of published datasets using a simple control-splitting procedure that removes this bias leads to a substantial reduction in performance previously attributed to biological signal.
Putta, S.; Jensen, W.; Devakonda, S.; Pennell, L.; Croteau, J.
Show abstract
High-dimensional single-cell technologies, such as flow cytometry and CITE-Seq, typically rely on established lineage markers to define cell identities. Additional markers are commonly analyzed within the context of these predefined cell types. Nonlinear projection methods such as t-SNE and UMAP provide a visual framework for this analysis by enabling the overlay of cell types and marker expression. However, these methods frequently produce projections where distinct cell types substantially overlap, hindering interpretation of marker expression patterns relative to known cell types. In this study, we investigate the underlying causes of this phenomenon and demonstrate that such overlaps often stem from the inherent high-dimensional structure of the data rather than limitations in the dimensionality reduction algorithms themselves. To address this, we introduce Cell Type Weighted Dimensionality Reduction (CWDR), a novel approach that incorporates lineage-based information through a supervised weighting mechanism. By integrating both cell identity and marker expression, CWDR preserves the visual separation between predefined cell types while maintaining the local variance necessary for downstream analysis. We validate our method across multiple high-dimensional flow cytometry and proteogenomic datasets. Our results show that CWDR significantly reduces inter-cluster overlap compared to traditional methods, providing a clearer framework for visualizing marker expression within the context of specific cell lineages.
Brandulas Cammarata, A.; Fonseca Costa, S. S.; Rosikiewicz, M.; Roux, J.; Wollbrett, J.; Bastian, F. B.; Robinson-Rechavi, M.
Show abstract
RNA-Seq is a powerful technique to provide quantitative information on gene expression. While many applications focus on measuring expression levels, accurately distinguishing between actively and inactively transcribed genes is equally important for understanding gene function, development, and disease mechanisms. However, setting a biologically meaningful threshold for calling genes expressed is challenging due to variability in noise levels across different protocols, experiments or biological samples. We propose to define this threshold per sample relative to the background level observed in inactive genomic features, inferred by the amount of reads mapped to intergenic regions of the genome, and to call genes expressed if their level of expression is significantly higher than the estimated background noise. This approach can be applied to a single RNA-Seq library as well as to a combination of libraries from the same condition, in model and non-model organisms. We show that our method yields a more accurate prediction of expression state than existing methods, illustrated by consistent expression calls for biological replicates in the same tissue.
Pailozian, K.; Kohout, P.; Damborsky, J.; Mazurenko, S.
Show abstract
MotivationProtein melting temperature (Tm) prediction accelerates the discovery of thermostable enzymes which are crucial for industrial biotechnology often requiring harsh reaction conditions. Experimental determination of Tm remains labour-intensive and varies across techniques, motivating the development of in silico predictors. Mass-spectrometry datasets such as Meltome Atlas now enable large-scale Tm prediction with models based on deep learning, but model generalisation across diverse experimental datasets has not been systematically tested. ResultsWe evaluated the generalisability of state-of-the-art deep learning approaches and explored ESM-based embeddings for Tm prediction. To this end, we assembled the ProMelt training dataset (45 441 proteins) and five independent biophysics-based validation datasets. Our analysis revealed substantial differences between proteomics- and biophysics-based Tm measurements, highlighting the challenge of cross-domain generalisation. Existing state-of-the-art predictors trained on large-scale proteomics datasets showed reduced performance on biophysics-based validation sets. Our fine-tuned embedding-based models, particularly LoRA-adapted ESM-2 (TmProt 1.0), outperformed state-of-the-art predictors in identifying thermostable proteins (Tm[≥] 60 {degrees}C) across heterogeneous datasets, achieving AUC scores of 0.75-0.77. We also demonstrated that the available models could be used efficiently in the sequence prioritization task. AvailabilityThe TmProt web server is available at https://loschmidt.chemi.muni.cz/tmprot/. Source code and data are available at https://github.com/loschmidt/TmProt.
Lan, Y.; Wu, C.-Y.; Lin, H.-H.; Cohen, T.; Warren, J. L.
Show abstract
Pairwise analysis of genomic and spatial data offers opportunities to identify and estimate the associations between covariates and the transmission of pathogens between individuals. However, such pairwise analyses are computationally intensive, and may not be feasible to conduct given the high dyad count in even moderately sized datasets. Here we compare two approaches to increase the efficiency of pairwise analysis for large datasets. We quantify and compare the performance of divide-and-conquer Bayesian model fitting and pairwise case-control approaches for estimating associations between individual- and pair-level covariates and shared membership in a transmission cluster. We utilize a large dataset (n=4,154) of spatially-referenced, genomically-sequenced Mycobacterium tuberculosis isolates collected from a single city for this analysis. We find that the case-control approach produces unbiased estimates of effect sizes with expected credible interval coverage and is more robust than the divide-and-conquer method when effect sizes are large. Thus, we recommend using the case-control approach with at least three controls per case to downscale datasets for pairwise analysis when analysis of the entire dataset is not possible. This approach mitigates the computational challenges of pairwise Bayesian modeling on datasets that require significant computational resources while maintaining desired inferential properties. Author SummaryPairwise analyses of large datasets to study pathogen transmission are computationally demanding because they typically require simultaneous analysis of each possible pair of individuals in a dataset; as datasets become larger these analyses often are not feasible to conduct even with access to high-performance computing resources. In this work, we compare a case-control approach and divide-and-conquer approaches for more efficient pairwise analysis of large datasets. Using a large dataset of Mycobacterium tuberculosis isolates including genetic and spatial data, we investigate the performance of each method for estimating the associations between host covariates and genetic clustering of isolates. We find that the case-control approach is generally preferred over methods which first divide the data into subsets and then combine results. While additional extensions of these analyses are needed to test the generality of these findings to other data settings, this work provides a practical way forward for the pairwise analysis of large datasets to study pathogen transmission.
Pham, B. K.; Davenport, S.; Azriel, D.; Schwartzman, A.
Show abstract
LD Score Regression (LDSC) is a prominent method, which estimates whole-genome SNP heritability from summary statistics via the slope of a linear regression of GWAS test statistics corresponding to a trait of interest against LD scores. It was claimed by the LDSC authors that the free intercept in the regression accounts for confounding bias such as population stratification. In this study, we argue that the intercept in LDSC must be fixed to 1 for accurate SNP heritability estimation. We show both theoretically and with simulations that the estimated intercept does not accurately capture population stratification effects, and that it adversely affects the accuracy of the heritability estimate introducing bias and increasing variance. Fixing the intercept to 1 eliminates bias and reduces variance when no population stratification is present. On the other hand, under population stratification, LDSC is biased with both the free and the fixed intercept. Additionally, we show that estimated standard errors in LDSC are underestimated, potentially leading to false-positives in downstream GWAS analyses.
Ahmed, H. F.; Samiei, T.; Nozari, E.
Show abstract
IntroductionAlthough neural activity is organized across temporal and spatial scales, the principles that determine the accuracy and fidelity of neural information representation across scales remain unclear. In particular, while recent empirical results have reported mesoscopic optimality in neural decoding, no theoretical accounts exist that explain when and why such intermediate scales emerge as optimal. Here, we develop an analytical framework to study the optimal temporal scale of information representation and its dependence on the dynamic structure of signal and noise in neural data. Materials and MethodsWe formulate a multiscale theoretical model in which neural population activity is represented by temporally encoded trial vectors at microscopic, mesoscopic, and macroscopic resolutions. Neural responses are modeled as class-dependent mean activations (signal) corrupted by temporally correlated noise, and decay rates of correlations in both signal and noise are varied parametrically. Representational quality at each scale is quantified using the sensitivity index (d-prime) for decoding condition from neural activity. ResultsWe derive closed-form expressions for the sensitivity index at each temporal scale. These expressions reveal the key roles of signal and noise correlations as the main determinants of condition decodability at all scales. Comparing expressions under various combinations of signal and noise correlations reveals two regimes. First, when signal and noise correlations are absent or persistent over time, the optimal resolution falls at one of two extremes: macroscale (resp. microscale) if signal correlations are stronger (resp. weaker) than noise correlations. In contrast, when both signal and noise correlations decay with temporal separation, temporal integration produces a nontrivial trade-off: moderate integration improves decodability by suppressing noise while preserving coherent signal, whereas excessive integration degrades signal and decodability. Therefore, only in the latter regime, mesoscopic representations emerge as optimal across a broad range of biologically plausible parameters. DiscussionThis work provides a theoretical explanation for how the optimal temporal scale of neural information representation depends on the interplay between signal and noise correlations and their temporal decay. Broadly, the framework establishes temporal integration as a principled mechanism linking multiscale neural dynamics to information representation and offers testable predictions across recording modalities and neural systems.
Alrefae, T. A.; Pons-Salort, M.; Donnelly, C. A.; Lambert, B.; Kamau, E.
Show abstract
AO_SCPLOWBSTRACTC_SCPLOWSerological assays remain the standard experimental approach for estimating the cumulative incidence of a pathogen and monitoring population immunity. The predominant approach for analysing serum titration data from virus neutralisation assays uses a nearly century-old interpolation-based method which neglects inherent imperfections in the assay and produces estimates with no measure of uncertainty. We introduce a two-part Bayesian modelling framework to estimate the underlying antibody concentrations in the raw serum samples taken from serosurveyed individuals, to improve the interpretation of serological data over age. First, we develop a mechanistic Bayesian model for serum antibody titration data that estimates latent antibody concentrations while accounting for assay variability and quantifying uncertainty. Second, we propagate this uncertainty into an age-structured serocatalytic model by integrating over posterior draws of individual antibody concentrations, allowing joint inference on latent serostate membership, force of infection, and serological waning rate. We use this framework to explore the dynamics of infection and immunity for three enterovirus serotypes: enteroviruses A71 (EV-A71) and D68 (EV-D68) and coxsackievirus A6 (CVA6). These serotypes are leading causes of outbreaks of severe respiratory illness and hand, foot, and mouth disease. Applying these approaches to three cross-sectional serosurveys, we estimated consistently higher and more persistent antibody concentrations throughout life for EV-D68 compared to EV-A71 and CVA6. Our analysis suggests that the proportion of recently infected individuals (i.e. individuals with high estimated antibody concentration levels given their age) peaks around 25% by age 7 years for both EV-A71 and CVA6 before gradually declining with age. In contrast, for EV-D68 the inferred proportion of the population in the infected state exceeds 50% by age 9 years and continues to grow with age. We also estimate that EV-D68 antibody concentration levels are higher than those of the other two serotypes, with the force of infection estimated to be highest in early childhood and declining more gradually with age than for EV-A71 and CVA6. These estimates are different to previous estimates found in the literature. Our inferential framework uncovers the wide-ranging variation in antibody levels that are often obscured by conventional endpoint titre estimation methods. We demonstrate that our framework can infer infection rates without relying on predetermined seropositivity cut-offs and without making explicit assumptions of virus-specific infection mechanisms. Author summarySerological tests measure antibody levels in blood to show how widely a virus has spread and how well populations are protected. Titre-based tests dilute blood samples in steps, mix these dilutions with virus, and add the mixture to living cells; the titre is the highest dilution where antibodies still protect cells from infection. Traditional analyses overlook test imperfections. We present a new two-part Bayesian framework to estimate antibody levels and track age-related exposure to infection. First, we estimate underlying antibody concentrations while accounting for uncertainty, then use these estimates in another model to infer age-specific transmission of three common viruses - EV-A71, EV-D68, and CVA6. Our results show that EV-D68 infections may be more common, especially in children, compared to the other viruses. This new approach provides a clearer picture of the dynamics of seroconversion, without relying on arbitrary thresholds, helping to improve public health monitoring and responses.
Szmigiel, A.; Gesteira Costa Filho, I.; Campello, R. J. G. B.
Show abstract
Clustering single-cell RNA-seq (scRNA-seq) data and related protocols remains a major challenge due to high dimensionality, sparsity, and noise. Despite numerous benchmarking studies aiming to identify the most suitable clustering methods, many suffer from methodological flaws that can undermine their conclusions. A major challenge in benchmarking is selecting representative datasets that cover the diversity of scRNA-seq experiments and include laboratory-verified labels for reliable evaluation. Consistent preprocessing of all inputs to benchmarked algorithms is crucial, as it significantly impacts performance. Beyond selecting an algorithm, a thorough exploration of hyperparameters is also essential to assess robustness and identify configurations that maximize performance. We focus on proposing an improved benchmarking framework that addresses common methodological issues in prior studies. We illustrate our proposed methodology in a case study comparing the classic Leiden and Louvain clustering algorithms with extensive hyperparameters exploration on a carefully curated collection of real gold standard datasets. By evaluating clustering performance across different hyper-parameter selection scenarios, we show that benchmarking results can be misleading, either overestimating or underestimating performance depending on how the hyperparameter space is explored. In our illustrative case study, benchmarking results do not reveal any practically relevant performance differences between the Louvain and Leiden algorithms. In contrast, we show that overlooked factors such as graph construction and quality functions critically influence clustering outcomes, particularly un-der suboptimal settings of numerical hyperparameters--the neighbor-hood size k used for similarity graph construction and the resolution hyperparameter in graph-based clustering algorithms. While noticeable trends have been observed in terms of how different (dis)similarity functions affect performance, the impact of this choice is limited and, to some extent, overridden by the graph-building approach. Across different graphs, there is a noticeable trade-off between achieving optimal performance with ideally tuned numerical hyperparameters and maintaining robustness under more realistic, unsupervised, and suboptimal settings. All in all, the analysis of our illustrative benchmarking case study offers clear guidance and objective recommendations for practitioners in the field. Most importantly, as the main contribution of this manuscript, our proposed framework sets a foundation for more reliable scRNA-seq clustering evaluation and benchmarking in future studies.
Lück, N.; Rossi, A.; Staerk, C.
Show abstract
MotivationConventional pipelines for differential expression analysis in single-cell RNA sequencing (scRNA-seq) data first cluster individual cells and then test for differentially expressed genes between the resulting clusters. Using the same data for clustering and testing, however, poses a selective inference problem and can result in overconfidence in differences that may not reflect true biological variation. ResultsWe introduce StabCell, a stability selection framework which integrates clustering and detection of differentially expressed marker genes. By repeatedly performing clustering and differential expression analysis on complementary random subsamples, StabCell assesses clustering and marker stability, yielding a stable clustering with sets of stable marker genes. In simulations, we demonstrate that StabCell provides approximate empirical per-family error rate (PFER) control, selecting fewer false positive marker genes compared with conventional approaches, especially in cases with low signal-to-noise ratio and low sequencing depth. Applying the method to a cell differentiation dataset from induced pluripotent stem cells (IPSCs) to cardiomyocytes reveals that meaningful marker genes are consistently among the top-ranked genes. These results indicate that StabCell can improve the interpretability and robustness of scRNA-seq analyses. Availability and implementationAn implementation of StabCell in the statistical programming language R is available at https://github.com/LuckyLueck/StabCell. Code to reproduce the results is available at https://github.com/LuckyLueck/StabCell_paper.
Zeng, T.; Li, H.; Zhang, S.; Tan, Y. Q.; Tian, F.; Orban, C.; An, L.; Che, W.; Cheng, J.; Chong, J. S. X.; Dehestani, N.; Dong, Z.; Li, X.; Li, Z.; Lim, M. J. R.; Lin, Y.; Ling, Q.; Ling, Z.; Low, X. Z.; Mansour L., S.; Ng, K. K.; Nguyen, T. T.; Ooi, L. Q. R.; Pande, S.; Qian, X.; Ruan, J.; Wang, Z.; Xie, Y.; Zhang, C.; Zhang, Y.; Patil, K.; Parkes, L.; Dhamala, E.; Chopra, S.; Zalesky, A.; Holmes, A.; Eickhoff, S.; Zhou, J. H.; Renaud, O.; Dosenbach, N.; Kording, K. P.; Bzdok, D.; Nichols, T.; Yeo, B. T. T.
Show abstract
Machine learning is accelerating biomedical research. Cross-validation is widely used to compare predictive performance - not only to benchmark algorithms, but also to inform scientific applications, such as ranking biomarkers. However, prediction performance estimates across cross-validation folds are not independent. Standard tests for comparing prediction performance (e.g., paired t-test) assume independence and can therefore inflate false positive rates. In a PRISMA-guided meta-analysis of 210 studies (impact factor [≥]15, 1 June 2020 - 1 June 2025), we find that 97% ignored fold dependence when comparing prediction performance. This problem is ubiquitous across scientific fields and unaffected by impact factor, rigor-promoting policies, or open science practices. Simulations across 420 scenarios spanning four diverse datasets show that ignoring fold dependence leads to invalid false positive control in most settings. Repeated cross-validation further compounds this problem, with false positive rates rising toward 100% as the number of repetitions grows. Existing fold-dependence-aware tests rely on strong assumptions because the variance of fold-level statistics and the between-fold correlation cannot be disentangled under standard cross-validation. We therefore propose the SHARP (Split-HAlf RePeated) test, a simple modification to standard cross-validation that enables direct estimation of variance and correlation. Benchmarked against 12 tests, SHARP provides the best overall balance of false-positive control, statistical power, and confidence-interval calibration across simulation schemes. We conclude by providing best practices and reporting guidelines for valid model comparison inference in biomedical machine learning and beyond.
Liu, Y.; Harris, R. E.; Clauw, D.; Bayman, E.; Leroux, A.; Lindquist, M. A.
Show abstract
Chronic pain is a widespread public health issue that imposes substantial health, emotional, and economic burdens on individuals and communities. Because pain is subjective and lacks objective biomarkers, it is typically measured using patient-reported scores, often on a numerical scale from zero to ten. Increasingly, pain studies use ecological momentary assessment, with multiple daily assessments over days and across study phases (e.g., a series of baseline and post-intervention assessments). These data frequently show many ratings at the extremes (i.e., at minimum or maximum pain scores), commonly referred to as zero- and one-inflation in the statistical literature, along with considerable within-person variability both within and across days. These phenomena present challenges for statistical analyses, as they violate assumptions of most commonly used statistical techniques (e.g., the normality assumption of linear mixed models). We propose a Bayesian beta-binomial mixed-effects model for modeling potential zero- or one-inflated pain scores while accounting for variability using random effects on the mean and variance parameters across subjects. A simulation study demonstrates that the method accurately estimates model parameters across realistic sample sizes, time points, and zero- and one-inflation levels. An application to data from two longitudinal pain studies demonstrates that the model fits the data better and, when correctly specified, yields accurate uncertainty intervals for longitudinal changes in pain compared to existing models, especially for zero- and one-inflated outcomes. Additionally, the model directly estimates the probability of clinically meaningful pain events. The proposed method provides a powerful statistical framework for studying the patient-reported pain trajectories.
Fayette, L.; Brendel, K.; Mentre, F.
Show abstract
Joint modelling of longitudinal data using non-linear mixed effects models and time-to-event outcomes provides a suitable framework to account for informative censoring when estimating biomarker dynamics and quantifying event risk using covariates and longitudinal trajectories. Their usefulness in clinical research depends on data collection design, particularly to precisely estimate the association (link) parameter between longitudinal and survival processes. However, optimal design strategies have so far been addressed separately for longitudinal and survival endpoints and remain unexplored for joint models. We propose two Fisher Information Matrix (FIM) computation methods for joint models, relying on Monte-Carlo integration over observations combined with either Markov Chains Monte-Carlo or Adaptive Gaussian Quadrature to integrate random effects. Their accuracy is assessed against clinical trial simulations in an oncological example based on the HORIZON III study with a tumour-growth-survival model including discrete and continuous covariates. We apply these methods to quantify the impact of follow-up duration, sampling richness, sample size, and covariate distribution on parameter uncertainty and test power. In our example, longitudinal-parameter uncertainty is barely affected by follow-up duration or sampling richness, whereas survival-parameter uncertainty decreases substantially from 1-year to 2-year follow-up. The number of subjects needed (NSN) to achieve <15\% uncertainty on the link parameter is comparable for a 2-year rich design and a 3-year sparse design. Optimal covariate distributions are stable across designs and systematically improve test power, outperforming longer and richer but non-optimised designs. These FIM-based methods accurately predict uncertainty and test powers, enabling design evaluation and NSN computation for joint-model-based clinical studies.
Meduri, R.; Satish, A. L.; Singh, U.
Show abstract
Selective deployment of multiple transcription start sites is a major regulatory feature of human transcriptomes. FANTOM CAGE data exhibit a near-universal TSS deployment parsimony which is disrupted in cancers. We have recently shown that TSS deployment is sensitive to gene function, futile upstream transcription, and cellular biosynthetic states. Patterns in FANTOM CAGE data can reveal mechanisms underlying TSS co-deployments. We propose and test the possibility that some TSSs act like epromoters and act as co-varying hubs of transcriptional activities for multiple other promoters. Using deep analysis of CAGE data implemented through neural networks we show that non-cancers implement transcription co-deployments through cores of epromoter-like TSSs which are generally proximal to their start codons. These TSSs show enhancer-like TFBSs profiles. A comparison with cancer CAGE data shows that the concentrated epromoter core is disrupted in cancers with multiple distal TSSs replacing the proximal TSS cores. We provide evidence that the core TSSs are rich in YY1 and CTCF binding sites and associated with genes coding for transcription factors. Our findings show that covariance of TSS deployment is sensitive to transcriptional resource cost and a parsimonic design of TSS co-deployments depends on proximal TSSs in non-cancers, a mechanism grossly disrupted in cancers. HighlightsO_LIHeterogeneous FANTOM CAGE data contains universal patterns of TSSs co-deployments. C_LIO_LITSS co-deployments exhibit a parsimonious "core-covariant" scheme which is disrupted in cancers. C_LIO_LICore TSSs are enriched in transcription factor binding sites and gene functions which justify biological features of the samples. C_LIO_LIThe DL pipeline we present identifies the core-covariant TSS sets in an unbiased manner. C_LI
Chen, L.; Acharyya, S.; May, A. M.; Udager, A. M.; Keller, E. T.; Baladandayuthapani, V.
Show abstract
Advances in spatial transcriptomics (ST) technologies enable systematic molecular characterization of tumor microenvironment, tumor gradients and gene regulatory networks. Cancer progression is known to vary along pathological gradients, yet existing network approaches for gene network inference typically ignore hierarchical spatial organization across the tumor. We develop a Bayesian multi-resolution spatial graphical regression (mSGR) framework to infer spatially varying gene networks from multi-resolution ST data. The proposed model allows precision matrices to vary across hierarchically structured spatial domains, capturing both local and global organization within the tumor. To identify spatially varying regulatory relationships, we introduce a spatially structured edge selection strategy that borrows strength across regions according to spatial proximity and pathological gradients, while Gaussian-process priors flexibly model spatial variation in edge strengths. Scalable inference is achieved through an augmented mean-field variational Bayes algorithm with node-wise parallel regressions, enabling efficient estimation in high-dimensional settings. Simulation studies demonstrate improved recovery of network structures compared with competing approaches. Applying mSGR to multi-resolution ST data from kidney cancer reveals stronger regulatory connectivity in transitional regions of epithelial-mesenchymal transition pathway and identifies hub genes along the tumor gradient, illustrating how spatially resolved network analysis can provide key insights into tumor microenvironment organization.