Biometrics — Latest Matching Preprints

1

Robust Inference of Individualized Treatment Effect in Mendelian Randomization

Liang, M.; Wu, R.; Xiao, F.; Li, X.

2026-05-12 genetics 10.64898/2026.05.08.723855 medRxiv

Top 0.1%

8.3%

Show abstract

Mendelian randomization (MR) is widely used to draw causal conclusions in the presence of unmeasured confounding, but most MR analyses focus on average treatment effects and rely on strong assumptions. For precision medicine, the primary target is instead the individualized treatment effect (ITE); yet in MR, such effects are not point-identified under core IV assumptions, and valid inference is particularly challenging. We therefore propose a robust partial identification inference framework for ITE under MR allowing multiple instruments. Under minimal causal assumptions, we derive a sharp inference procedure for the intersection bounds of ITE by adopting a multiplier bootstrap procedure with data-adaptive bootstrap distribution shifting and heterogeneous variance adjustment. In theory, we prove that the proposed method achieves nominal coverage and asymptotic sharpness. Further, we extend the procedure to tolerate possible invalid IVs under a minimal proportion rule assumption by aggregating over instrument subsets while preserving coverage. Simulation studies demonstrate that the proposed methods attain nominal coverage and substantially shorter intervals than existing procedures. We illustrate the framework using data from the Alzheimers Disease Neuroimaging Initiative to assess heterogeneous causal effects of TREM2 expression on Alzheimers disease risk across education-defined subgroups.

2

Fisher information matrix computation for joint longitudinal and survival models to support clinical study design and covariate effect assessment

Fayette, L.; Brendel, K.; Mentre, F.

2026-06-01 pharmacology and therapeutics 10.64898/2026.05.28.26354340 medRxiv

Top 0.1%

1.9%

Show abstract

Joint modelling of longitudinal data using non-linear mixed effects models and time-to-event outcomes provides a suitable framework to account for informative censoring when estimating biomarker dynamics and quantifying event risk using covariates and longitudinal trajectories. Their usefulness in clinical research depends on data collection design, particularly to precisely estimate the association (link) parameter between longitudinal and survival processes. However, optimal design strategies have so far been addressed separately for longitudinal and survival endpoints and remain unexplored for joint models. We propose two Fisher Information Matrix (FIM) computation methods for joint models, relying on Monte-Carlo integration over observations combined with either Markov Chains Monte-Carlo or Adaptive Gaussian Quadrature to integrate random effects. Their accuracy is assessed against clinical trial simulations in an oncological example based on the HORIZON III study with a tumour-growth-survival model including discrete and continuous covariates. We apply these methods to quantify the impact of follow-up duration, sampling richness, sample size, and covariate distribution on parameter uncertainty and test power. In our example, longitudinal-parameter uncertainty is barely affected by follow-up duration or sampling richness, whereas survival-parameter uncertainty decreases substantially from 1-year to 2-year follow-up. The number of subjects needed (NSN) to achieve <15\% uncertainty on the link parameter is comparable for a 2-year rich design and a 3-year sparse design. Optimal covariate distributions are stable across designs and systematically improve test power, outperforming longer and richer but non-optimised designs. These FIM-based methods accurately predict uncertainty and test powers, enabling design evaluation and NSN computation for joint-model-based clinical studies.

3

The ATLAS Penalty: Auxiliary-Transformed Location-Aware Smoothing with Applications to Spatial Transcriptomics

Tang, Q.; Chi, E. C.; Wang, W.

2026-05-20 bioinformatics 10.64898/2026.05.18.725545 medRxiv

Top 0.1%

1.7%

Show abstract

We address the problem of fitting a collection of location-specific models under a spatial smoothness assumption. Existing approaches penalize roughness in the model parameters directly, an assumption that breaks down when smoothness is a function of parameters and auxiliary covariates rather than the parameters themselves. Our framework, the Auxiliary-Transformed Location-Aware Smoothing (ATLAS) penalty, generalizes spatial smoothness by penalizing roughness in transformations of model parameters using auxiliary information. As a concrete case study, we develop a spatially smooth deconvolution model for spatial transcriptomics that estimates tumor mixing coefficients from thousands of spots distributed on a single tissue slide. To handle the computational challenges posed by the nonlinear likelihood, nonsmooth nonconvex penalty, and spatially coupled estimation, we propose an alternating direction method of multipliers (ADMM) algorithm. Through simulation studies, we demonstrate that our framework provides substantially better spatial domain detection than approaches that smooth model parameters directly, with particularly strong gains when auxiliary covariates carry calibrated spatial structure.

4

A General Statistical Framework for Hardy-Weinberg Equilibrium Inference on the X Chromosome

Zhang, L.; Paterson, A. D.; Sun, L.

2026-05-20 genetics 10.64898/2026.05.17.725730 medRxiv

Top 0.1%

1.7%

Show abstract

Testing for Hardy-Weinberg equilibrium (HWE) is a fundamental component of genetic data analysis, widely used for quality control and model validation. Although HWE testing is well established for autosomal loci, inference on the X chromosome is more complex due to sex-specific genotype structures and potential sex differences in minor allele frequency (sdMAF). Existing tests differ in their assumptions about sdMAF and male sample inclusion, often leading to distinct but poorly characterized null hypotheses. We develop a general statistical framework for HWE inference using the robust allele-based regression model. By formulating HWE testing as an assessment of allele-level dependence, the framework directly parameterizes Hardy-Weinberg disequilibrium, unifies existing Pearson{chi} 2-based tests under explicit modeling assumptions, and clarifies their null hypotheses, degrees of freedom, and sensitivity to sdMAF. The framework also accommodates covariate and population-structure adjustment within a unified regression-based formulation. The proposed framework provides robust, interpretable, and flexible inference, establishing a unified statistical foundation for HWE testing across autosomal and X-chromosomal regions. Simulation studies and analysis of high-coverage 1000 Genomes Project data demonstrate that commonly used X-chromosome tests can exhibit inflated type I error or misleading inference when sdMAF is present.

5

A Beta-Binomial Model for Estimating Zero- or One-inflated Pain Trajectories

Liu, Y.; Harris, R. E.; Clauw, D.; Bayman, E.; Leroux, A.; Lindquist, M. A.

2026-05-11 bioinformatics 10.64898/2026.05.07.721507 medRxiv

Top 0.1%

1.4%

Show abstract

Chronic pain is a widespread public health issue that imposes substantial health, emotional, and economic burdens on individuals and communities. Because pain is subjective and lacks objective biomarkers, it is typically measured using patient-reported scores, often on a numerical scale from zero to ten. Increasingly, pain studies use ecological momentary assessment, with multiple daily assessments over days and across study phases (e.g., a series of baseline and post-intervention assessments). These data frequently show many ratings at the extremes (i.e., at minimum or maximum pain scores), commonly referred to as zero- and one-inflation in the statistical literature, along with considerable within-person variability both within and across days. These phenomena present challenges for statistical analyses, as they violate assumptions of most commonly used statistical techniques (e.g., the normality assumption of linear mixed models). We propose a Bayesian beta-binomial mixed-effects model for modeling potential zero- or one-inflated pain scores while accounting for variability using random effects on the mean and variance parameters across subjects. A simulation study demonstrates that the method accurately estimates model parameters across realistic sample sizes, time points, and zero- and one-inflation levels. An application to data from two longitudinal pain studies demonstrates that the model fits the data better and, when correctly specified, yields accurate uncertainty intervals for longitudinal changes in pain compared to existing models, especially for zero- and one-inflated outcomes. Additionally, the model directly estimates the probability of clinically meaningful pain events. The proposed method provides a powerful statistical framework for studying the patient-reported pain trajectories.

6

Estimating uncertainty in family-based GWAS

Miao, X.; Edge, M. D.; Harpak, A.

2026-05-14 genetics 10.64898/2026.05.11.724392 medRxiv

Top 0.1%

0.9%

Show abstract

Standard genome-wide association studies (GWASs) are vulnerable to confounding factors, including stratification, assortative mating, and dynastic effects. Family studies such as sibling-based GWAS (sib-GWAS) mitigate such confounding and are becoming the tool of choice for teasing apart direct genetic effects--causal effects of ones genotype on ones own phenotype-- from other factors. However, due in part to their smaller sample sizes, sib-GWAS allelic effect estimates are substantially more variable than standard (i.e., population-based) GWAS estimates. The quantification of this uncertainty is essential for many uses of sib-GWAS, including polygenic scoring, causal inference (e.g., Mendelian randomization), disentangling direct from indirect familial effects, and measuring assortative mating. Here, we investigate sources of uncertainty in sib-GWAS allelic effect estimators. We study their impacts on the biases of three uncertainty measurement methods, including two that are commonly used and a new resampling-based approach we propose. We find that heterogeneity in allelic effects or heteroskedasticity across families (e.g., due to variation in genetic backgrounds or environments) can bias existing methods, and that this bias is more severe for small samples and rare variants. In contrast, the resampling-based approach we propose is approximately unbiased under all scenarios we considered. We validate our theoretical predictions, as well as the importance of effect heterogeneity and heteroskedasticity, using simulations and empirical analysis in the UK Biobank. In sum, this study helps understand the sources of uncertainty in family-based genotype-phenotype association studies and provides a robust method to estimate uncertainty.

7

Multi-resolution Spatial Graphical Regression Models for Hierarchical Spatial Transcriptomics Data

Chen, L.; Acharyya, S.; May, A. M.; Udager, A. M.; Keller, E. T.; Baladandayuthapani, V.

2026-05-15 genomics 10.64898/2026.05.12.724724 medRxiv

Top 0.1%

0.9%

Show abstract

Advances in spatial transcriptomics (ST) technologies enable systematic molecular characterization of tumor microenvironment, tumor gradients and gene regulatory networks. Cancer progression is known to vary along pathological gradients, yet existing network approaches for gene network inference typically ignore hierarchical spatial organization across the tumor. We develop a Bayesian multi-resolution spatial graphical regression (mSGR) framework to infer spatially varying gene networks from multi-resolution ST data. The proposed model allows precision matrices to vary across hierarchically structured spatial domains, capturing both local and global organization within the tumor. To identify spatially varying regulatory relationships, we introduce a spatially structured edge selection strategy that borrows strength across regions according to spatial proximity and pathological gradients, while Gaussian-process priors flexibly model spatial variation in edge strengths. Scalable inference is achieved through an augmented mean-field variational Bayes algorithm with node-wise parallel regressions, enabling efficient estimation in high-dimensional settings. Simulation studies demonstrate improved recovery of network structures compared with competing approaches. Applying mSGR to multi-resolution ST data from kidney cancer reveals stronger regulatory connectivity in transitional regions of epithelial-mesenchymal transition pathway and identifies hub genes along the tumor gradient, illustrating how spatially resolved network analysis can provide key insights into tumor microenvironment organization.

8

Granger Sensori-Behavioral Taxonomy of Neuronal Ensemble Activity from Two-Photon Calcium Imaging Data

Khosravi, S.; Francis, N. A.; Kanold, P. O.; Babadi, B.

2026-05-15 neuroscience 10.64898/2026.05.12.724603 medRxiv

Top 0.2%

0.8%

Show abstract

Understanding how neuronal populations interact to encode and transform sensory information is a fundamental challenge in computational neuroscience. Most existing studies, however, study neural encoding, behavioral readout, and functional connectivity as disjoint problems. Two-photon calcium imaging enables simultaneous recording of large neuronal ensembles in vivo, driven by diverse stimuli and eliciting distinct behaviors. However, extracting directional functional connectivity metrics as well as encoding and readout properties of neurons from such data remains difficult due to indirect and noisy observations of spiking activity, slow temporal dynamics, and the latent interplay between external stimuli and endogenous neural processes. Here, we introduce a unified conceptual and operational modeling and inference framework for directly extracting functional Granger causal (GC) effects between neurons, from external stimuli to neurons, and from neurons to behavior, from two-photon imaging data, in the sense of Granger. Inspired by the intersection information framework, we also identify neurons that encode features of sensory stimuli that inform behavioral readout. The resulting GC networks together with the taxonomy of functional sensori-behavioral relevance, which we call G-taxonomy, provides a powerful statistical analysis framework, enabled by the integration of several techniques including state-space modeling and inference, variational inference, and point processes. We applied the proposed framework to simulated and experimentally-recorded two-photon imaging from the mouse auditory cortex (A1) during both passive listening and active tone discrimination. Our simulation studies reveal significant improvement of our proposed methodology over existing techniques. Analysis of experimental data from the mouse A1 identifies distinct groups of cells with diverse sensori-behavioral relevance, as well as changes in functional connectivity associated with correct vs. incorrect behavior. In summary, this work provides a principled and data-driven methodology for uncovering directional interactions among the neurons, sensory stimuli, and behavior, all within the same statistical framework, offering new insights into how distributed cortical populations transform sensory inputs into behaviorally relevant representations. Author SummaryThe brain processes sensory inputs through the coordinated activity of large networks of neurons and produces readouts that elicit behavior. Understanding how information flows and is processed through these networks is a central goal of neuroscience. In this study, we present a new computational framework that identifies directional interactions among neurons in an ensemble as well as from sensory stimuli to neurons and from neurons to behavior. Utilizing the Granger formalism to identify directional effects, as opposed to common correlational measures, our framework extracts said effects directly from two-photon calcium imaging data. We tested our proposed method on both simulated data and recordings from the auditory cortex of mice during passive listening and active tone discrimination tasks. Our method revealed diverse groups of neurons in the auditory cortex with distinct functional roles and relevance to sensori-behavioral integration. Our framework provides a new way to study the flow of information in the brain and can be broadly applied to uncover neural computations across sensory and cognitive systems.

9

FAMES: Federated additive model using piecewise exponential survival data

Islam, N.; Luo, C.; Tong, J.; Weller, G.; Polleya, D. A.; Kent, A.; Bair, S.

2026-05-19 health informatics 10.64898/2026.05.15.26353335 medRxiv

Top 0.2%

0.7%

Show abstract

Introduction In analyses of time-to-event data, clinical characteristics can have non-linear impacts on survival outcomes, and understanding this dynamic behavior is crucial for producing real-world evidence (RWE). Nonetheless, estimating these dynamic effects is inherently challenging when utilizing real-world data (RWD), especially since sharing individual-level patient data (IPD) is heavily restricted due to regulatory limitations. Additionally, computational difficulties are exacerbated by the high dimensionality, inter-dependency, rarity, sparsity, and scarcity of features. While data augmentation through collaboration across multiple sites might address these challenges, such collaboration is often infeasible and hindered by regulatory measures that protect patient privacy, thereby preventing the sharing of IPD between sites. Objectives To address this challenge, we propose a privacy-preserving regularized algorithm that eliminates the necessity of aggregating any protected health information across sites. This algorithm employs a penalized federated additive model utilizing piecewise exponential survival (FAMES) data and estimates non-linear effects of features while accounting for non-varying confounding effects. The model is flexible and can accommodate both multiple and multivariate smooth effects simultaneously. Methods The proposed model transforms survival data into a piecewise exponential data (PED) structure and casts the semi-parametric optimization problem into a generalized additive modeling framework assuming Poisson distribution. The model uses orthonormal splines to approximate non-linear effects and incorporates L2-norm based penalty terms to control the smoothness and goodness-of-fit of these effects. The algorithm is optimized using site-specific aggregated summary statistics and is solved iteratively through the Newton-Raphson method. Results The model is employed to assess the smooth effects of clinical features, such as age and numeric laboratory values, on overall survival using RWD from approximately 874 newly diagnosed Acute Myeloid Leukemia (AML) patients treated at seven distinct sites in the United States. The model exhibited non-linear smooth effects for lactate dehydrogenase, platelets, and others underscoring their strong association with disease prognosis. The model demonstrates a lossless property, providing estimates of smooth and fixed effects that are comparable to those derived from the pooled PED. Additionally, the inference of parameters for testing the nullity of effects remains consistent. This model is communication-efficient, necessitating roughly twelve rounds of communication across sites. Conclusion We anticipate that this model can facilitate multisite collaboration and enable smaller sites to participate in generating and validating RWE, especially for rare diseases. While the model was applied within the context of AML, it is disease-agnostic and can be implemented in any other clinical context and across various sites globally without losing any generality.

10

Accounting for Uncertainty in the Null Benchmark in Two-Stage Phase II Trials

Irlmeier, R.; Jin, Z.; Ye, F.

2026-05-18 epidemiology 10.64898/2026.05.14.26353210 medRxiv

Top 0.2%

0.7%

Show abstract

Background Simon two-stage designs for binary endpoints and their time-to-event analogues, including the Kwak and Jung method, rely on a fixed null benchmark. Their Type I error control is valid only when that benchmark is correctly specified. In practice, historical benchmarks are often inconsistent due to small samples, population heterogeneity, changing eligibility criteria, and evolving standards of care. Even modest misspecifications can substantially inflate the Type I error rate, leading to costly advancement of ineffective treatments. Methods We propose the Interval-Null Robust (INR) two-stage design framework that accounts for uncertainty in the historical null benchmark. We define the null hypothesis as a plausible range of clinically uninteresting values: p[isin][p0L, p0U] for binary endpoints and {lambda}[isin][{lambda}0L, {lambda}0U] (or equivalent survival probabilities) for time-to-event endpoints. Type I error is controlled uniformly over the full null interval: sup{theta}[isin]{theta}0 Pr{theta}(Go) [≤] . Under the monotonicity of the Go probability, the supremum occurs at the least favorable null configuration - p0U and {lambda}0L - but the design is not reduced to a point-null formulation. The interval defines the uncertainty set for error control and is used in selecting among feasible designs through robust criteria such as worst-case regret or minimal average expected sample size. Results Across representative planning scenarios for both endpoint types, classic designs calibrated to a single benchmark exhibit substantial Type I error inflation when the true null parameter exceeds the assumed planning value. INR designs maintain the nominal Type I error rate across the full null interval, directly addressing this vulnerability to benchmark misspecification. The robustness-efficiency trade-off can be managed through design constraints and robust optimization criteria while preserving uniform Type I error control. Conclusions INR two-stage designs offer a transparent framework for addressing historical control uncertainty in single-arm Phase II trials. By replacing reliance on a fixed benchmark assumption with a more realistic interval of clinically plausible null values, INR design reduces the risk of false-positive Go-decisions caused by benchmark misspecification. INR applies to both binary and time-to-event endpoints and is implemented in the open-source INRDesign R package and accompanying interactive Shiny app.

11

A Bayesian modelling framework for inference of latent infection risk patterns from virus neutralisation assay titration data

Alrefae, T. A.; Pons-Salort, M.; Donnelly, C. A.; Lambert, B.; Kamau, E.

2026-05-21 bioinformatics 10.64898/2026.05.18.726027 medRxiv

Top 0.2%

0.7%

Show abstract

AO_SCPLOWBSTRACTC_SCPLOWSerological assays remain the standard experimental approach for estimating the cumulative incidence of a pathogen and monitoring population immunity. The predominant approach for analysing serum titration data from virus neutralisation assays uses a nearly century-old interpolation-based method which neglects inherent imperfections in the assay and produces estimates with no measure of uncertainty. We introduce a two-part Bayesian modelling framework to estimate the underlying antibody concentrations in the raw serum samples taken from serosurveyed individuals, to improve the interpretation of serological data over age. First, we develop a mechanistic Bayesian model for serum antibody titration data that estimates latent antibody concentrations while accounting for assay variability and quantifying uncertainty. Second, we propagate this uncertainty into an age-structured serocatalytic model by integrating over posterior draws of individual antibody concentrations, allowing joint inference on latent serostate membership, force of infection, and serological waning rate. We use this framework to explore the dynamics of infection and immunity for three enterovirus serotypes: enteroviruses A71 (EV-A71) and D68 (EV-D68) and coxsackievirus A6 (CVA6). These serotypes are leading causes of outbreaks of severe respiratory illness and hand, foot, and mouth disease. Applying these approaches to three cross-sectional serosurveys, we estimated consistently higher and more persistent antibody concentrations throughout life for EV-D68 compared to EV-A71 and CVA6. Our analysis suggests that the proportion of recently infected individuals (i.e. individuals with high estimated antibody concentration levels given their age) peaks around 25% by age 7 years for both EV-A71 and CVA6 before gradually declining with age. In contrast, for EV-D68 the inferred proportion of the population in the infected state exceeds 50% by age 9 years and continues to grow with age. We also estimate that EV-D68 antibody concentration levels are higher than those of the other two serotypes, with the force of infection estimated to be highest in early childhood and declining more gradually with age than for EV-A71 and CVA6. These estimates are different to previous estimates found in the literature. Our inferential framework uncovers the wide-ranging variation in antibody levels that are often obscured by conventional endpoint titre estimation methods. We demonstrate that our framework can infer infection rates without relying on predetermined seropositivity cut-offs and without making explicit assumptions of virus-specific infection mechanisms. Author summarySerological tests measure antibody levels in blood to show how widely a virus has spread and how well populations are protected. Titre-based tests dilute blood samples in steps, mix these dilutions with virus, and add the mixture to living cells; the titre is the highest dilution where antibodies still protect cells from infection. Traditional analyses overlook test imperfections. We present a new two-part Bayesian framework to estimate antibody levels and track age-related exposure to infection. First, we estimate underlying antibody concentrations while accounting for uncertainty, then use these estimates in another model to infer age-specific transmission of three common viruses - EV-A71, EV-D68, and CVA6. Our results show that EV-D68 infections may be more common, especially in children, compared to the other viruses. This new approach provides a clearer picture of the dynamics of seroconversion, without relying on arbitrary thresholds, helping to improve public health monitoring and responses.

12

Benchmarking foundation models for improving confounding control in target trial emulation

Kleper, S. L.; Melamed, R. D.

2026-05-13 epidemiology 10.64898/2026.05.09.26352820 medRxiv

Top 0.2%

0.7%

Show abstract

Machine learning models for causal inference aim to adjust for confounding factors that are associated with both an exposure and an outcome, creating a spurious biased association. But, these methods are rarely empirically evaluated to assess their success in mitigating such bias. Recent advances in knowledge representation, including both foundation models and knowledge graphs, could enrich these models, but rigorous evaluations are needed in order to assess their potential. Here, we ask whether enriching existing causal inference models with knowledge representations from foundation models can improve confounding control. Rather than using semi-simulated data to address this question, we focus on examples of real confounding: we emulate target randomized active comparator trials that are subject to confounding by indication. Our results can guide researchers aiming to develop or apply methods for discovering causal effects from observational data.

13

Multi-LLM Disagreement as a Scalable Detector of Human Annotation Errors in Structured Data from Clinical Free-Text

Wittlinger, S.; Meerjansen, J.; Wolf, F.; Wiest, I. C.; Ebert, M. P.; Siegel, F.; Belle, S.

2026-05-06 health systems and quality improvement 10.64898/2026.05.04.26352392 medRxiv

Top 0.3%

0.6%

Show abstract

ObjectiveStructured extraction from clinical free-text depends on human annotators whose labels are susceptible to errors and knowledge-driven mistakes; exhaustive quality control is impractical at scale. We evaluate whether disagreement among multiple locally hosted large language models (LLMs) can prioritize human annotations for targeted review. MethodsMultiple LLMs independently extract the same set of structured variables annotated by a human reviewer. For each annotation, an agreement score counts the LLMs matching the human label. Using four locally hosted LLMs (Gemma 3 27B, DeepSeek-R1 70B, GPT-OSS 120B, Mistral Large 3), we evaluated this approach on 910 German-language colonoscopy reports describing endoscopic mucosal resection, with five structured variables per case (anatomical location, two diameters, resection technique, multiple polyps), yielding 4,550 annotations and a 377-case adjudication sample. A stratified sample oversampling low-agreement strata was adjudicated blinded by an experienced reviewer and analyzed with prevalence-adjusted estimates ResultsHuman error rates rose as LLM agreement fell, from 0% at scores 3-4 to 76% at score 0. The lowest-agreement stratum was only 6.5% of annotations yet concentrated an estimated 80% of errors. The multi-LLM disagreement score achieved a prevalence-adjusted AUC-ROC of 0.991 (95% CI 0.987-0.994) and AUC-PR of 0.893 (95% CI 0.851-0.929) for error detection. DiscussionMulti-LLM disagreement outperformed single models and provided graded operating points for risk-stratified review. ConclusionMulti-LLM disagreement provides a scalable quality-control signal for targeted review of the highest-yield cases. Because all models run locally, the framework is GDPR-compliant; its language- and task-agnostic design supports application across clinical domains.

14

Accelerating Mental Health Precision Trial : An Effective Visualization-Driven Tool for Power and Sample Size Estimation in Biomarker-based Study Designs

Chen, D. Z.; Xie, A.; Ma, C.

2026-05-08 health informatics 10.64898/2026.05.06.26352613 medRxiv

Top 0.3%

0.5%

Show abstract

Precision medicine has given rise to a spectrum of biomarker-guided trial designs, from simple enrichment and strategy designs to more complex adaptive frameworks. To address the need for user-friendly tools that span this spectrum, we developed a unified R Shiny platform that first implements three standard designs: the randomize-all design, the enrichment design, and the biomarker-strategy design, allowing researchers to perform power and sample size calculations under each framework with intuitive inputs and visual outputs. Building on this foundation, the platform further extends to support two-stage general randomized basket trial designs with interim analysis, which can be viewed as a generalization of the standard designs to multiple biomarker-defined subgroups. The tool was rigorously validated by comparison with established R pipelines and published formulas, and user testing confirmed its intuitive interface. By providing seamless integration from standard to advanced designs under a common input-output framework, our platform enables researchers to directly compare power and sample size requirements across different design choices using the same underlying assumptions. The result is a freely accessible tool offering effective visualizations for the full spectrum of biomarker-guided trial designs, available at https://ampt.obicloud.ca/. Future improvements may further expand the tools capabilities to accommodate the increasing complexity of trial designs needed by the research community.

15

Synthetic Data Generation and Nonparametric Techniques for Assessing Multivariate Similarity to Address Small-Sample Size Challenges

Heine, J.; Fowler, E.; Eschrich, S. A.; Schell, M.

2026-05-07 bioinformatics 10.64898/2026.05.04.722226 medRxiv

Top 0.3%

0.5%

Show abstract

Data modeling in biomedical research often operates in the small-sample regime, where the number of observations is small relative to the data dimensionality; the detrimental effects of limited sample sizes are well documented in cancer studies. Synthetic data offers a potential solution to data shortfalls provided that the data generated is an adequate facsimile of the underlying distribution; the adequacy of such synthetic data remains an open-ended problem. In this work, we evaluate a synthetic generator proposed previously. The generator applies a series of transformations to the observed data to accommodate the small-sample size resulting in an uncoupled representation, where uncorrelated marginal distributions are modeled with optimized univariate kernel density estimation. In this report, (1) we develop a nonparametric method for assessing multivariate similarity based on the Cramer-Wold theorem and random projection testing, (2) investigate when the absence of bivariate correlation approximates independence in a non-normal setting, and (3) evaluate artifacts induced by data compression. The presentation is primarily methodological; low-dimensional data were used so each stage of the generation process could be analyzed explicitly. A formal testing framework was developed by comparing random projection level outcomes with a two-sample test, modeling these outcomes as Bernoulli trials, aggregating replicate outcomes within each projection direction, and pooling outcomes across many directions, yielding a scalable standardized normal test-statistic. The key innovation was decoupling the two-sample test significance level from that governing finalized normal inference. We showed the same projection framework also evaluates the full multivariate covariance structure. The generator produced high-fidelity multivariate synthetic data when the bivariate correlation approximates independence in the non-normal setting; in highly compressed data, residual modes were best modeled as normally distributed regardless of their intrinsic distributional form. Ongoing work includes applying these methods to higher-dimensional, diverse data.

16

On the Optimal Temporal Resolution for Information Representation in Neural Activity: A Theoretical Analysis

Ahmed, H. F.; Samiei, T.; Nozari, E.

2026-05-21 neuroscience 10.64898/2026.05.19.726394 medRxiv

Top 0.3%

0.5%

Show abstract

IntroductionAlthough neural activity is organized across temporal and spatial scales, the principles that determine the accuracy and fidelity of neural information representation across scales remain unclear. In particular, while recent empirical results have reported mesoscopic optimality in neural decoding, no theoretical accounts exist that explain when and why such intermediate scales emerge as optimal. Here, we develop an analytical framework to study the optimal temporal scale of information representation and its dependence on the dynamic structure of signal and noise in neural data. Materials and MethodsWe formulate a multiscale theoretical model in which neural population activity is represented by temporally encoded trial vectors at microscopic, mesoscopic, and macroscopic resolutions. Neural responses are modeled as class-dependent mean activations (signal) corrupted by temporally correlated noise, and decay rates of correlations in both signal and noise are varied parametrically. Representational quality at each scale is quantified using the sensitivity index (d-prime) for decoding condition from neural activity. ResultsWe derive closed-form expressions for the sensitivity index at each temporal scale. These expressions reveal the key roles of signal and noise correlations as the main determinants of condition decodability at all scales. Comparing expressions under various combinations of signal and noise correlations reveals two regimes. First, when signal and noise correlations are absent or persistent over time, the optimal resolution falls at one of two extremes: macroscale (resp. microscale) if signal correlations are stronger (resp. weaker) than noise correlations. In contrast, when both signal and noise correlations decay with temporal separation, temporal integration produces a nontrivial trade-off: moderate integration improves decodability by suppressing noise while preserving coherent signal, whereas excessive integration degrades signal and decodability. Therefore, only in the latter regime, mesoscopic representations emerge as optimal across a broad range of biologically plausible parameters. DiscussionThis work provides a theoretical explanation for how the optimal temporal scale of neural information representation depends on the interplay between signal and noise correlations and their temporal decay. Broadly, the framework establishes temporal integration as a principled mechanism linking multiscale neural dynamics to information representation and offers testable predictions across recording modalities and neural systems.

17

Gene co-expression networks reveal differential developmental modularity in Mammalian limbs

Howenstine, A. O.; Sears, K. E.

2026-05-08 developmental biology 10.64898/2026.05.07.723669 medRxiv

Top 0.3%

0.4%

Show abstract

Mammalian limb development is a complex system involving several signaling centers and coordinated cell behaviors to sculpt a functioning limb capable of the diverse locomotory strategies that mammals exhibit. To investigate the changes in development that facilitate the generation of the wide array of limb phenotypes across mammals, we take a correlation network approach to investigate the developing limbs of mice, bats, and opossums, which represent typical limb development, a novel limb phenotype, and a shift in developmental timing, respectively. Using transcriptomic data of early limb development across these taxa, we build module correlation networks and identify a difference in network connectivity and the distribution of limb development genes across bat limb development. We identify a unique signature of increased modularity in the bat forelimb that is not detected in mouse or opossum. This modularity is not associated with increased specialization of limb development modules, but rather is marked by target limb development genes being spread evenly across several modules. The opossum, with its standard phenotype but altered developmental timing, does not show a difference in modularity relative to mouse. This work points toward the benefit of a network-minded approach to transcriptomic networks, which reveals developmental modularity and potential gene targets for exploration of developmental system evolution.

18

Tolerance Regions For Compositional Data With Application To Reference Regions For Healthy Microbiome Profiles

Wickramasinghe, N.; Choudhary, P.

2026-05-07 microbiology 10.64898/2026.05.06.723285 medRxiv

Top 0.4%

0.4%

Show abstract

AO_SCPLOWBSTRACTC_SCPLOWImbalances in the human microbiome are associated with numerous diseases, highlighting the need for benchmarks that define healthy microbiome composition and identify abnormal deviations. Although the microbiome is increasingly studied as a potential clinical marker, statistical approaches for constructing reference regions of healthy microbiome composition remain relatively underexplored. This work develops statistical methods to construct reference regions for healthy microbiome data, addressing three main challenges. First, since microbiome data contain relative rather than absolute information, standard statistical methods are not directly appropriate. Therefore, microbiome profiles are treated as compositional data satisfying a sum constraint, and log-ratio transformations are used to analyze them in real space while preserving their relative structure. Second, reference regions are constructed as tolerance regions rather than confidence regions, so that they cover a pre-specified proportion of the healthy population with a given confidence level. The proposed framework incorporates both parametric and nonparametric approaches for constructing these tolerance regions. Parametric methods are considered when the ilr-transformed data approximately follow an elliptical distribution, where they can yield smaller regions while maintaining the desired coverage. Nonparametric approaches provide a flexible alternative by avoiding distributional assumptions. Third, because microbiome data are multidimensional and difficult to interpret, quantitative and graphical tools are introduced to assess atypicality and identify which microbial taxa contribute most to deviations from healthy profiles. Simulation studies are conducted to evaluate the performance of the proposed methods. The methodology is then demonstrated by constructing reference regions for healthy microbiome profiles using real-world data. Finally, the approach is applied to microbiome datasets comparing healthy and patient profiles to assess whether patient samples are identified as atypical and to examine which taxa contribute to these deviations. Overall, the proposed framework provides a clear and statistically robust approach for defining healthy microbiome reference regions and detecting atypical microbiome profiles.

19

Scalable deep-learning-based inference of time-varying transmission dynamics from outbreak phylogenies

XIE, R.; Zhukova, A.; Pena, P. G.; Iglesias, G.; Hu, S.; Wang, J.; Tsang, T. K.; Dhanasekaran, V.; Kraemer, M. U. G.; Pybus, O. G.; Gascuel, O.

2026-05-10 infectious diseases 10.64898/2026.05.07.26352673 medRxiv

Top 0.4%

0.4%

Show abstract

Infectious disease dynamics can be inferred from pathogen genomic data using phylodynamic methods, but the applicability of many such approaches to large data sets is constrained by computational cost. Recent deep-learning approaches to phylodynamics have improved scalability, yet challenges remain when genetic divergence is limited during fast spreading outbreaks. To address this, we use pathogen-specific models to show that deep-learning models trained on outbreak-like phylogenies can accurately estimate the reproductive number (R) when both the birth-death model and the expected phylogenetic resolution are matched to the target pathogen, highlighting the importance of realistic training conditions. Focusing on three major respiratory pathogens of public health importance (SARS-CoV-2, seasonal human influenza virus, and respiratory syncytial virus (RSV)), we introduce PhyloRt, a scalable framework for estimating the time-varying reproductive number (Rt) from large outbreak phylogenies. PhyloRt decomposes large trees into overlapping subtrees and applies a hierarchical deep-learning-based inference strategy to classify subtrees as exhibiting constant or time-varying reproduction numbers, enabling identifiable and computationally efficient estimation of Rt as a piecewise-constant trajectory through time. Applications to SARS-CoV-2 and influenza outbreaks show that PhyloRt recovers transmission dynamics consistent with estimates derived from mathematical epidemiological and Bayesian phylodynamic analyses. Our work enables scalable and rapid estimation of time-varying transmission dynamics from very large-scale outbreak genomic data sets, supporting real-time genomic epidemiology of emerging pathogens. SignificanceEstimating changes in transmission dynamics over time is important for responding to infectious disease outbreaks. Current methods mostly rely on reported case data from epidemiological surveillance, which can be biased or incomplete due to variable testing capabilities, particularly in resource-limited settings. A complementary approach is to use viral genomes as an alternative data source. However, inferences from genomic data can be computationally intensive and have mainly been applied retrospectively. We present PhyloRt, a scalable deep-learning-based phylodynamic framework that enables fast inference of the time-varying reproductive number (Rt) from large outbreak phylogenies. Our approach is widely applicable and provides a practical approach to monitoring epidemic dynamics, complementing traditional surveillance and supporting timely public health decision-making.

20

Classic machine learning on top of multiple position weight matrices improves genomic prediction of transcription factor binding sites

Kravchenko, P.; Vorontsov, I. E.; Makeev, V. J.; Kulakovskiy, I. V.; Penzar, D. D.

2026-05-14 bioinformatics 10.64898/2026.05.12.724515 medRxiv

Top 0.4%

0.4%

Show abstract

MotivationDNA motifs recognised by transcription factors are typically represented as position weight matrices (PWMs), assuming independent contributions of individual nucleotides to protein binding specificity. Many alternative models accounting for correlations of positional contributions have been introduced in the past decades. However, performance gains have generally not outweighed the advantages of simplicity, interpretability, and practical applicability of PWMs with the well-established codebase. Existing software tools and motif databases provide multiple non-identical PWMs for the same transcription factor or even for the same dataset. It remains a practical question whether these PWMs can be effectively combined into a single improved model. ResultsHere we describe ArChIPelago (https://github.com/autosome-ru/ArChIPelago), a computational framework that combines multiple PWMs into a joint model using classic machine learning techniques, from linear regression to ensembles of decision trees. We show that such a combination improves prediction of transcription factor binding sites in genomic sequences. With a diverse collection of 704 ChIP-Seq datasets spanning 36 orthologous human and mouse transcription factors of diverse structural families, we show that ArChIPelago consistently outperforms the best available individual mono- and dinucleotide PWMs as well as sparse local inhomogeneous mixture models. Furthermore, using both human and mouse data, we demonstrate that PWM ensembles are capable of making reliable cross-species predictions.