Biostatistics — Latest Matching Preprints

1

The Rayleigh Quotient and Contrastive Principal Component Analysis II

Jackson, K. C.; Carilli, M. T.; Pachter, L.

2026-04-10 bioinformatics 10.64898/2026.04.08.717236 medRxiv

Top 0.1%

5.0%

Show abstract

Contrastive principal component analysis (PCA) methods are effective approaches to dimensionality reduction where variance of a target dataset is maximized while variance of a background dataset is minimized. We previously described how contrastive PCA problems can be written as solutions to generalized eigenvalue problems that maximize particular instantiations of the Rayleigh quotient. Here, we discuss two extensions of contrastive PCA: we use kernel weighting from spatial PCA (k-{rho}PCA) to contrast spatial and non-spatial axes of variation, and separately solve the Rayleigh quotient in the space of basis function coefficients (f-{rho}PCA) to find modes of variation in functional data. Together, these extensions expand the scope of contrastive PCA while unifying disparate fields of spatial and functional methods within a single conceptual and mathematical framework. We showcase the utility of these extensions with several examples drawn from genomics, analyzing gene expression in cancer and immune response to vaccination.

2

Regression-based Modeling of Spearman's Rho for Longitudinal Metabolomics and Mental Wellness in Breast Cancer Patients

Chen, Y.; Gui, T.; Huang, Z.; Quach, N.; Tu, S.; Liu, J.; Garrett, T. J.; Starkweather, A. R.; Lyon, D. E.; Shepherd, B. E.; Tu, X. M.; Lin, T.

2026-04-16 cancer biology 10.64898/2026.04.13.718341 medRxiv

Top 0.1%

4.9%

Show abstract

SO_SCPLOWUMMARYC_SCPLOWChemotherapy in breast cancer (BC) can substantially affect mental wellness. Advances in metabolomics enable comprehensive profiling of metabolic changes over time during and after treatment, offering insights into biological mechanisms linking chemotherapy to mental health outcomes. To study the association between metabolite profiles and mental wellness, correlation-based analyses are particularly useful. Spearmans rho is a widely used correlation measure and popular alternative to Pearsons correlation, since it also applies to non-linear association between variables. However, existing methods are not designed for longitudinal data and do not allow for covariate adjustments. In this paper, we propose a novel regression-based framework grounded in a class of semiparametric models, the functional response models, to extend this popular correlation measure to longitudinal settings with missing data under the missing at random assumption. This framework facilitates inferences about temporal changes in correlations over time and association of explanatory variables for such changes. We use simulation studies to evaluate performance of the approach with moderate sample sizes. We apply the approach to a one-year longitudinal substudy of the EPIGEN study to examine the longitudinal association between metabolite profiles and mental wellness in BC patients undergoing chemotherapy. The identified metabolites may serve as candidates for future in-depth bioinformatics analyses and translational investigations.

3

Testing hypotheses about correlations between brain activation patterns

Diedrichsen, J.; Fu, X.; Shahbazi, M.; Bonner, S.

2026-03-24 neuroscience 10.64898/2026.03.21.713393 medRxiv

Top 0.1%

2.3%

Show abstract

Many functional magnetic resonance imaging (fMRI) studies conclude that two conditions engage "overlapping, yet partly distinct" patterns of activation. Yet, there is currently no commonly accepted method for determining the extent of this overlap. While correlations between activation patterns can serve as a measure of their correspondence, empirical correlations are strongly biased towards zero due to measurement noise, preventing their use in testing hypotheses about the actual degree of pattern correspondence. In this paper, we derive the maximum-likelihood estimate for the correlation of the true (noise-less) activation patterns and examine its behavior in the low signal-to-noise regime that is typical for fMRI studies. We show that although the maximum-likelihood estimate corrects for much of the influence of measurement noise, it is ultimately biased. We examine different ways of drawing inferences about the size of the underlying true correlations. We find that a subject-wise bootstrap on the maximum-likelihood group estimate performs best over the tested conditions. We extend the proposed method to test more general hypotheses about the representational geometry of activation patterns for more conditions, and highlight best practices, as well as common pitfalls and problems, in testing such hypotheses.

4

Synthetic Data Generation and Nonparametric Techniques for Assessing Multivariate Similarity to Address Small-Sample Size Challenges

Heine, J.; Fowler, E.; Eschrich, S. A.; Schell, M.

2026-05-07 bioinformatics 10.64898/2026.05.04.722226 medRxiv

Top 0.1%

1.7%

Show abstract

Data modeling in biomedical research often operates in the small-sample regime, where the number of observations is small relative to the data dimensionality; the detrimental effects of limited sample sizes are well documented in cancer studies. Synthetic data offers a potential solution to data shortfalls provided that the data generated is an adequate facsimile of the underlying distribution; the adequacy of such synthetic data remains an open-ended problem. In this work, we evaluate a synthetic generator proposed previously. The generator applies a series of transformations to the observed data to accommodate the small-sample size resulting in an uncoupled representation, where uncorrelated marginal distributions are modeled with optimized univariate kernel density estimation. In this report, (1) we develop a nonparametric method for assessing multivariate similarity based on the Cramer-Wold theorem and random projection testing, (2) investigate when the absence of bivariate correlation approximates independence in a non-normal setting, and (3) evaluate artifacts induced by data compression. The presentation is primarily methodological; low-dimensional data were used so each stage of the generation process could be analyzed explicitly. A formal testing framework was developed by comparing random projection level outcomes with a two-sample test, modeling these outcomes as Bernoulli trials, aggregating replicate outcomes within each projection direction, and pooling outcomes across many directions, yielding a scalable standardized normal test-statistic. The key innovation was decoupling the two-sample test significance level from that governing finalized normal inference. We showed the same projection framework also evaluates the full multivariate covariance structure. The generator produced high-fidelity multivariate synthetic data when the bivariate correlation approximates independence in the non-normal setting; in highly compressed data, residual modes were best modeled as normally distributed regardless of their intrinsic distributional form. Ongoing work includes applying these methods to higher-dimensional, diverse data.

5

DynoSys 2.0: Graph-Based Modeling of Dynamic Risk States and System Transitions in Human Behaviours Development

Wei, M.; Peng, Q.

2026-05-13 neuroscience 10.64898/2026.05.06.723259 medRxiv

Top 0.1%

1.6%

Show abstract

Human behavioral and mental health outcomes arise from interactions among genetic, environmental, and neurobiological systems. Existing frameworks often model these components jointly, but many treat variables independently or use static representations. This limits their ability to capture system-level dynamics and changes over time. To address this, we developed DynoSys, a unified framework that integrates these signals using three layers: predictive models, relationship exploration models, and mechanism-oriented explanation models. Building on this framework, we introduce DynoSys 2.0, a graph-based temporal modeling approach inspired by the free-energy principle by Karl Friston. In this framework, each individual is represented as a dynamic graph that evolves over time. We hypothesize that healthy development and adverse mental health outcomes correspond to different system states and trajectories. Using longitudinal data from the Adolescent Brain Cognitive Development (ABCD) Study, we construct time-indexed graphs that integrate polygenic risk scores (PRS), multi-domain environmental features, and neuroimaging-derived representations. We study six phenotypes: externalizing behavior, internalizing behavior, and sub-stance use initiation (alcohol, nicotine, cannabis, and any substance). In these graphs, nodes represent domain-level features, and edges capture relationships derived from data-driven feature selection and temporal dependencies. We model graph evolution using recurrent neural networks and graph-temporal learning methods. We also define system-level measures, including graph energy and state transitions, to quantify dynamic patterns. Our results show that DynoSys 2.0 can model behavioral development using longitudinal multi-domain data. The framework achieved meaningful prediction for both continuous behavioral symptoms and substance-use initiation outcomes, but performance differed by outcome type. Externalizing behavior was predicted more accurately than internalizing behavior, and alcohol and any substance initiation showed stronger prediction than cannabis and nicotine initiation. Graph-derived energy measures showed clearer separation for high-versus low-symptom externalizing and internalizing groups, suggesting that continuous behavioral symptoms may be linked to different latent system states over time. Overall, DynoSys 2.0 provides a flexible framework for studying behavioral risk as a dynamic developmental process, while rare-event prediction and detailed graph-level interpretation require further work.

6

Unlocking Multi-Sample Differential Expression for Spatial Transcriptomics Data with TESSERA

Constantine, F.; Laszik, Z.; Dudoit, S.; Purdom, E.

2026-04-30 bioinformatics 10.64898/2026.04.27.720955 medRxiv

Top 0.1%

1.6%

Show abstract

Spatial transcriptomics allows the unprecedented examination of gene expression levels at the resolution of spatially-situated single cells in a high-throughput manner. As the technology is adopted more broadly, studies frequently collect data from multiple tissue samples, which leads to unique challenges that traditional spatial statistical methods are not equipped to handle. In particular, factors that differ across samples, such as different coordinate systems, different numbers and types of cells, different underlying tissue architectures, among others, preclude the application of traditional methods to our new setting. In this work, we propose a novel method, TESSERA, based on a spatial generalized linear model, for analyzing multi-sample spatial transcriptomics count data. Importantly, we provide a mathematical and computational framework for efficient and scalable model fitting and statistical inference to accompany the specification of our model. Our method for fitting the model enables the estimation of a common set of fixed effects across samples. This allows us to address a variety of differential expression questions, such as identification of which genes are differentially expressed between conditions (e.g., diseases, treatments), while accounting for spatial correlation between cells within a sample. We benchmark our proposed method on simulated data and apply it to a spatial transcriptomics dataset of human kidney samples. We find that our method provides a hitherto nonexistent extension to the multi-sample setting while remaining competitive with or outperforming existing algorithms in the single-sample setting.

7

Generative AI-assisted Bayesian-frequentist Hybrid Inference in Single-cell RNA Sequencing Analysis for Genes Associated with Alzheimer's Disease

Han, G.; Yuan, A.; Oware, K. D.; Wright, F.; Carroll, R. J.; Smith, M.; Ory, M. G.; Yan, D.; Wang, W.; Sun, Z.; Dai, Q.; Allen, C.; Dang, A.; Liu, Y.

2026-04-20 geriatric medicine 10.64898/2026.04.17.26351142 medRxiv

Top 0.1%

1.2%

Show abstract

Alzheimers disease genomics and other high-dimensional omics studies demand powerful statistical methods, yet Bayesian inference remains underutilized despite its advantages in small-sample settings, owing to the prohibitive cost of eliciting reliable priors across thousands or millions of parameters. We propose an AI-assisted Bayesian-frequentist hybrid inference framework that couples large language model based prior elicitation with the hybrid inference theory of Yuan (2009). ChatGPT-4o is queried via a standardized prompt to assess the strength of evidence linking each gene to a disease of interest, and the response is mapped to an informative normal prior via a standardized effect-size calibration. Parameters for covariates of secondary interest are treated as frequentist parameters, preserving efficiency and avoiding sensitivity to mis-specified priors. We derive closed-form hybrid estimators under uniform and conjugate normal priors in linear models, establish their asymptotic equivalence to the frequentist and full Bayes estimators, and show in simulations that hybrid inference using unconditional variance estimation leads to high statistical power while accurately controlling the Type I error rate. Applied to single-cell RNA sequencing data from the ROSMAP cohort for Alzheimers disease as an example, the framework identifies biologically coherent pathways (such as gamma-secretase pathways) previously undetected. The proposed framework offers a principled and computationally scalable approach to genome-wide Bayesian analysis, with potential for broad application across omics platforms and disease settings.

8

Omitted familial extrinsic risk inflates inferred intrinsic lifespan heritability

Kornilov, S. A.

2026-04-06 genetics 10.64898/2026.04.02.716222 medRxiv

Top 0.1%

1.1%

Show abstract

Shenhar et al. (2026) report 50% "intrinsic" lifespan heritability after calibrating a one-component correlated-frailty survival model to Scandinavian twin lifespans. Their framework is mathematically coherent, but the intrinsic component is not identified if heritable, mortality-relevant extrinsic susceptibility is omitted at calibration. We show that one-component calibration absorbs omitted familial extrinsic structure into the intrinsic frailty scale parameter{sigma}{theta} , and that this variance absorption is visible through separate diagnostics (1) Variance absorption. Under misspecification,{sigma}{theta} is inflated by +22.1% (95% CI: 21.5-22.7%), corresponding to +49% inflation in [Formula]. Falconer h2 is downstream of calibration and inherits a +9.2 pp bias (95% CI: 8.7-9.7). The{sigma}{theta} inflation is model-general: +22% (GM), +18% (MGG), +14% (SR); any dependence summary that is strictly increasing in{sigma}{theta} inherits this inflation, so Falconer h2 is one affected downstream quantity among many (Corollary B3). (2) Structural fingerprint. In the joint twin survival surface S(t1, t2), misspecification produces systematic dependence errors (ISE 48x that of the recovery model). Conditional twin dependence is inflated at all ages, peaking at age 80 ({Delta}r = 0.048). (3) Specificity. The bias requires an omitted component that is both heritable and mortality-relevant. Three negative controls, a boundary check ({rho} = 0), and a two-component recovery refit ({sigma}{theta} restored to within -3.2%) establish specificity. ACE decomposition yields C {approx} 0 throughout: the omitted extrinsic component loads onto A (because it is shared 1.0/0.5 in MZ/DZ), so switching summary statistics does not restore identification. (4) Sensitivity and falsifiability. Over an empirically anchored regime ({sigma}{gamma} [isin] [0.30, 0.65],{rho} [isin] [0.20, 0.50]), Falconer bias ranges from +2.8 to +18.9 pp (mean 9 pp). If{rho} is sufficiently negative, the bias reverses sign in all three model families (Corollary B4). A full-likelihood robustness check shows that this upward pull is partly structural and partly estimator-specific: in the same misspecified one-component model, ML still inflates{sigma}{theta} (+3%), whereas matching only rMZ inflates it much more (+21%). These results do not resolve true intrinsic heritability but establish that Shenhars 50% estimate carries a structured, model-general upward bias originating in the fitted latent variance{sigma}{theta} .

9

Granger Sensori-Behavioral Taxonomy of Neuronal Ensemble Activity from Two-Photon Calcium Imaging Data

Khosravi, S.; Francis, N. A.; Kanold, P. O.; Babadi, B.

2026-05-15 neuroscience 10.64898/2026.05.12.724603 medRxiv

Top 0.1%

0.8%

Show abstract

Understanding how neuronal populations interact to encode and transform sensory information is a fundamental challenge in computational neuroscience. Most existing studies, however, study neural encoding, behavioral readout, and functional connectivity as disjoint problems. Two-photon calcium imaging enables simultaneous recording of large neuronal ensembles in vivo, driven by diverse stimuli and eliciting distinct behaviors. However, extracting directional functional connectivity metrics as well as encoding and readout properties of neurons from such data remains difficult due to indirect and noisy observations of spiking activity, slow temporal dynamics, and the latent interplay between external stimuli and endogenous neural processes. Here, we introduce a unified conceptual and operational modeling and inference framework for directly extracting functional Granger causal (GC) effects between neurons, from external stimuli to neurons, and from neurons to behavior, from two-photon imaging data, in the sense of Granger. Inspired by the intersection information framework, we also identify neurons that encode features of sensory stimuli that inform behavioral readout. The resulting GC networks together with the taxonomy of functional sensori-behavioral relevance, which we call G-taxonomy, provides a powerful statistical analysis framework, enabled by the integration of several techniques including state-space modeling and inference, variational inference, and point processes. We applied the proposed framework to simulated and experimentally-recorded two-photon imaging from the mouse auditory cortex (A1) during both passive listening and active tone discrimination. Our simulation studies reveal significant improvement of our proposed methodology over existing techniques. Analysis of experimental data from the mouse A1 identifies distinct groups of cells with diverse sensori-behavioral relevance, as well as changes in functional connectivity associated with correct vs. incorrect behavior. In summary, this work provides a principled and data-driven methodology for uncovering directional interactions among the neurons, sensory stimuli, and behavior, all within the same statistical framework, offering new insights into how distributed cortical populations transform sensory inputs into behaviorally relevant representations. Author SummaryThe brain processes sensory inputs through the coordinated activity of large networks of neurons and produces readouts that elicit behavior. Understanding how information flows and is processed through these networks is a central goal of neuroscience. In this study, we present a new computational framework that identifies directional interactions among neurons in an ensemble as well as from sensory stimuli to neurons and from neurons to behavior. Utilizing the Granger formalism to identify directional effects, as opposed to common correlational measures, our framework extracts said effects directly from two-photon calcium imaging data. We tested our proposed method on both simulated data and recordings from the auditory cortex of mice during passive listening and active tone discrimination tasks. Our method revealed diverse groups of neurons in the auditory cortex with distinct functional roles and relevance to sensori-behavioral integration. Our framework provides a new way to study the flow of information in the brain and can be broadly applied to uncover neural computations across sensory and cognitive systems.

10

The ATLAS Penalty: Auxiliary-Transformed Location-Aware Smoothing with Applications to Spatial Transcriptomics

Tang, Q.; Chi, E. C.; Wang, W.

2026-05-20 bioinformatics 10.64898/2026.05.18.725545 medRxiv

Top 0.1%

0.7%

Show abstract

We address the problem of fitting a collection of location-specific models under a spatial smoothness assumption. Existing approaches penalize roughness in the model parameters directly, an assumption that breaks down when smoothness is a function of parameters and auxiliary covariates rather than the parameters themselves. Our framework, the Auxiliary-Transformed Location-Aware Smoothing (ATLAS) penalty, generalizes spatial smoothness by penalizing roughness in transformations of model parameters using auxiliary information. As a concrete case study, we develop a spatially smooth deconvolution model for spatial transcriptomics that estimates tumor mixing coefficients from thousands of spots distributed on a single tissue slide. To handle the computational challenges posed by the nonlinear likelihood, nonsmooth nonconvex penalty, and spatially coupled estimation, we propose an alternating direction method of multipliers (ADMM) algorithm. Through simulation studies, we demonstrate that our framework provides substantially better spatial domain detection than approaches that smooth model parameters directly, with particularly strong gains when auxiliary covariates carry calibrated spatial structure.

11

Incorporating Uncertainty in Study Participants' Age in Serocatalytic Models

Chen, J.; Lambe, T.; Kamau, E.; Donnelly, C.; Lambert, B.; Bajaj, S.

2026-03-16 infectious diseases 10.64898/2026.03.14.26346885 medRxiv

Top 0.1%

0.7%

Show abstract

AO_SCPLOWBSTRACTC_SCPLOWSerological surveys measure the presence of antibodies in a population to infer past exposure to an infectious pathogen. If study participants ages are known, serocatalytic models can be used to retrace the historical transmission strength of a pathogen within that population, quantified by the force of infection (FOI). These models rely on age information as a key variable since infection risks are interpreted in relation to how long individuals have been at risk. However, due to data constraints, participants ages may be provided only within "age bins". A common approach is then to assign individuals ages to be midpoints of their respective age bins, ignoring uncertainty in this quantity. In this study, we quantify the bias introduced by this midpoint approach and develop a Bayesian framework that explicitly accounts for uncertainty in age. By comparing inference under constant, age-dependent, and time-dependent FOI scenarios, we show that incorporating uncertainty in age in serocatalytic models yields more reliable FOI estimates without sacrificing computational complexity. These improvements support the interpretation of serological data and inform public health decisions, such as estimating disease burden and identifying targeted vaccination groups.

12

Dynamic Bayesian networks for neural information flow:evaluation of continuous and discrete scoring metrics

Thomas-Hegarty, J.; Pulver, S. R.; Smith, V. A.

2026-03-05 neuroscience 10.64898/2026.03.03.709276 medRxiv

Top 0.1%

0.7%

Show abstract

Neural information flow describes the movement of activity between neurons or brain areas. Advances in experimental methods have allowed production of large amounts of observational data related to neuronal activity from the single-neuron to population level. Most current methods for analysing these data are based on pairwise comparison of activity, and fall short of reliably extracting neural information flow network structure. Dynamic Bayesian networks may overcome some of these limitations. Here we evaluate the performance of a range of Bayesian network scoring metrics against the performance of multivariate Granger causality and LASSO regression for their ability to learn the connectivity underlying simulated single-neuron and neuronal population data. We find that discrete dynamic Bayesian networks are the best performing method for single-neuron data, and perform consistently for neural-population data. Continuous dynamic Bayesian networks have a tenancy to learn overly dense structures for both data types, but may have utility in scoping studies on single-neuron data. Multivariate Granger causality is the most robust method for learning structure of neural information flow between neural-populations, but performs poorly on single-neuron data. Significance testing within multivariate Granger causality produces variable results between data types. Overall, this work highlights how the analysis of neural information flow can vary depending on they type and structure of underlying data, and promotes discrete dynamic Bayesian networks as a useful and consistent tool for neural information flow analysis.

13

On the predictability of progression-free survival in ovarian cancer from NanoString gene expression data

Van Kleunen, L. B.; Bowman, G.; Stockman, S. E.; Townsend, H. A.; Barrios, L.; Jordan, K. R.; Wolsky, R. J.; Behbakht, K.; Sikora, M. J.; Richer, J. K.; Hu, J.; Bitler, B. G.; Clauset, A.

2026-04-24 cancer biology 10.64898/2026.04.22.719856 medRxiv

Top 0.2%

0.6%

Show abstract

In the treatment of high grade serous ovarian cancer (HGSC), patients initially diagnosed with unresectable tumors are first treated with neoadjuvant chemotherapy (NACT) to reduce tumor burden prior to surgery. Analysis of matched pre- and post-NACT samples from the same patients enables the investigation of chemotherapy impacts and the biomarkers of progression. Although the tumor immune microenvironment (TIME) has increasingly been recognized as critical in shaping the development and progression of HGSC, we lack a comprehensive understanding of how chemotherapy remodels the TIME. Previous studies have found evidence for a general inflammatory response post-NACT, despite inconsistencies regarding which differentially expressed genes and pathways are implicated. We combine matched NanoString gene expression data from multiple sources to create a large dataset of matched pre- and post- NACT samples (N=83, with 29 novel to this study) and investigate reproducibility. Further, we use machine learning methods to investigate whether patient progression-free survival (PFS) can be predicted from the observed impact of chemotherapy on the TIME as represented by the comprehensive set of NanoString features. We find overall low predictability of PFS from all NanoString features, suggesting that previous results may have been limited by small sample size effects and that larger datasets are needed to identify more generalizable and translatable findings. We identify a set of differential expression features that are the most important for predicting patient outcomes that can be validated in future computational and biological studies. Author summaryA subset of patients with high grade serous ovarian cancer are treated with chemotherapy before surgery to reduce tumor burden. We investigate a large dataset of samples taken before and after chemotherapy. These matched samples enable an investigation of how the environment around tumors, for example immune cell infiltration, reacts to chemotherapy, providing insights into biomarkers for treatment response and treatments that could complement chemotherapy. This larger dataset only partially replicates results from previous studies, while also providing new insights. Machine learning models designed to predict the time to patient recurrence from available biomarkers indicate that they are not strongly predictive of patient outcomes, in contrast to past studies. These results suggest that larger datasets are needed. We identify a set of genes that change with chemotherapy and are indicative of and potentially useful for predicting time to disease recurrence and can be further investigated.

14

Estimating uncertainty in family-based GWAS

Miao, X.; Edge, M. D.; Harpak, A.

2026-05-14 genetics 10.64898/2026.05.11.724392 medRxiv

Top 0.2%

0.6%

Show abstract

Standard genome-wide association studies (GWASs) are vulnerable to confounding factors, including stratification, assortative mating, and dynastic effects. Family studies such as sibling-based GWAS (sib-GWAS) mitigate such confounding and are becoming the tool of choice for teasing apart direct genetic effects--causal effects of ones genotype on ones own phenotype-- from other factors. However, due in part to their smaller sample sizes, sib-GWAS allelic effect estimates are substantially more variable than standard (i.e., population-based) GWAS estimates. The quantification of this uncertainty is essential for many uses of sib-GWAS, including polygenic scoring, causal inference (e.g., Mendelian randomization), disentangling direct from indirect familial effects, and measuring assortative mating. Here, we investigate sources of uncertainty in sib-GWAS allelic effect estimators. We study their impacts on the biases of three uncertainty measurement methods, including two that are commonly used and a new resampling-based approach we propose. We find that heterogeneity in allelic effects or heteroskedasticity across families (e.g., due to variation in genetic backgrounds or environments) can bias existing methods, and that this bias is more severe for small samples and rare variants. In contrast, the resampling-based approach we propose is approximately unbiased under all scenarios we considered. We validate our theoretical predictions, as well as the importance of effect heterogeneity and heteroskedasticity, using simulations and empirical analysis in the UK Biobank. In sum, this study helps understand the sources of uncertainty in family-based genotype-phenotype association studies and provides a robust method to estimate uncertainty.

15

Robust Random Forests for Genomic Prediction: Challenges and Remedies

Lourenco, V. M.; Ogutu, J. O.; Piepho, H.-P.

2026-04-01 bioinformatics 10.64898/2026.03.30.715203 medRxiv

Top 0.2%

0.6%

Show abstract

Data contamination--from recording errors to extreme outliers--can compromise statistical models by biasing predictions, inflating prediction errors, and, in severe cases, destabilizing performance in high-dimensional settings. Although contamination can affect responses and covariates, we focus on response contamination and evaluate Random Forests through simulation. Using a synthetic animal-breeding dataset, we assess robust Random Forests across several contamination scenarios and validate them on plant and animal datasets. We thereby clarify the consequences of contamination for prediction, develop a robust Random Forest framework, and evaluate its performance. We examine preprocessing or data-transformation strategies, algorithmic modifications, and hybrid approaches for robustifying Random Forests. Across these approaches, data transformation emerges as the most effective strategy, delivering the strongest performance under contamination. This strategy is simple, general, and transferable to other Machine Learning methods, offering a remedy for robust genomic prediction. In real breeding data, robust Random Forests are useful when substantial contamination, phenotypic corruption, misrecording, or train-deployment mismatch is plausible and the goal is to recover a latent signal for genomic prediction and selection; ranking-based robust Random Forests are the dependable first option, whereas weighting-based Random Forests should be used only when their weighting scheme preserves rank structure and improves prediction. Robustification is not universally necessary, but it becomes important when contamination distorts the link between observed responses and the predictive target; standard Random Forests remain the default for clean data, whereas robust Random Forests should be fitted alongside them whenever contamination is plausible, with the final choice guided by data, trait, and breeding objective. Author summaryMachine learning (ML) methods are widely used for prediction with high-dimensional, complex data, and supervised approaches such as Random Forests (RF) have proved effective for genomic prediction (GP) and selection. Yet their performance can be severely compromised by data contamination if the algorithms rely on classical data-driven procedures that are sensitive to atypical observations. Robustifying ML methods is therefore important both for improving predictive performance under contamination and for guiding their practical use in high-dimensional prediction problems. To address this need, we develop robust preprocessing, algorithm-level, and hybrid strategies for improving RF performance with contaminated data. Using simulated animal data, we show that ranking-and weighting-based robust RF provide the strongest overall compromise for genomic prediction and selection under contamination. Validation on several plant and animal breeding datasets further shows that the benefits of robustification are not universal, but depend on the dataset, trait, and breeding objective. Although motivated by RF, the framework we propose is general, practical, and readily transferable to other ML methods. It also offers a basis for deciding when robustness should complement standard RF rather than replace it outright.

16

Cellector: A tool to detect foreign genotype cells in scRNAseq data with applications in leukemia and microchimerism.

Heaton, H.; Behboudi, R.; Ward, C.; Weerakoon, M.; Kanaan, S.; Reichle, S.; Hunter, N.; Furlan, S.

2026-03-30 bioinformatics 10.64898/2026.03.26.714571 medRxiv

Top 0.2%

0.5%

Show abstract

The existence of rare, genetically distinct cells can occur in various samples such as transplant patients, naturally occurring microchimerism between maternal and fetal tissues, and cancer samples with sufficient mutational burden. Computational methods for detecting these foreign cells are vital to studying these biological conditions. An application that is of particular interest is that of leukemia patients post hematopoietic cell transplant (HCT). In many leukemias, a primary therapy is HCT, after which, the primary genotype of the bone marrow and blood cells should be of donor origin. If cells exist that are of the patients genotype and the cell type lineage of the particular leukemia, this is known as measurable residual disease (MRD). If the MRD is high enough, this may represent a relapse of the patients leukemia. Furthermore, accurately estimating the MRD is important for driving clinical decision making for these patients. Here we present Cellector, a computational method for identifying rare foreign genotype cells in single cell RNAseq (scRNAseq) datasets. We show cellector accurately detects microchimeric cells down to an exceedingly low percentage of these cells present (0.05% or lower).

17

Spurious correlation inflates performance in single-cell perturbation prediction

Nicol, P. B.; Shivakumar, S.; Irizarry, R.

2026-05-12 bioinformatics 10.64898/2026.05.07.723486 medRxiv

Top 0.2%

0.5%

Show abstract

The increasing number of computational methods designed to predict the effects of genetic perturbations on cellular gene expression profiles has led to a need for rigorous evaluation metrics. Recent benchmarking studies rely on correlation or cosine similarity of differential expression relative to a shared population of control cells. We show that these metrics are systematically inflated by statistical bias induced by reusing the same control population to define both quantities being compared. As a result, even non-informative methods can appear to perform well, particularly in datasets with limited numbers of control cells. Reanalysis of published datasets using a simple control-splitting procedure that removes this bias leads to a substantial reduction in performance previously attributed to biological signal.

18

A biologically annotated neural network for proteomic discovery in Parkinsons disease

Vijayaraghavan, A.; Crawford, L.; Krishnakant, S.; Amini, A. P.; Conard, A. M.; Olsen, A. L.; Chahine, L. M.; Severson, K. A.

2026-04-30 neurology 10.64898/2026.04.29.26351681 medRxiv

Top 0.2%

0.5%

Show abstract

AO_SCPLOWBSTRACTC_SCPLOWMachine learning models that can utilize high-dimensional data to make predictions and derive biological insights can improve understanding of diseases. Here, we develop a biologically annotated neural network model for proteomics data (P-BANN) which has several practical advantages: (1) it incorporates known relationships between proteins and signaling pathways into its architecture design; (2) it uses Bayesian principles to enable variable selection on the most important proteins for a disease of interests; and (3) it combines structured and black-box variational inference to analyze different classes of phenotypes at scale. To demonstrate the value of the approach, we apply P-BANN to one of the most common neurodegenerative disorders: Parkinsons disease (PD). We consider two biomarker-defined phenotypes within the PD population: presence of neuronal-predominate aggregated -synuclein in cerebrospinal fluid, and changes in dopamine transporter binding in the striatum on imaging. By considering biomarkers of both neuropathological hallmarks of PD, we can examine the extent to which their underlying biology is connected. Using the P-BANN framework, we discover sparse, statistically-calibrated sets of proteins which map to pathways, enabling more straightforward interpretation and generation of testable hypotheses.

19

Cell Type Weighted Dimensionality Reduction

Putta, S.; Jensen, W.; Devakonda, S.; Pennell, L.; Croteau, J.

2026-05-05 bioinformatics 10.64898/2026.04.30.721796 medRxiv

Top 0.2%

0.5%

Show abstract

High-dimensional single-cell technologies, such as flow cytometry and CITE-Seq, typically rely on established lineage markers to define cell identities. Additional markers are commonly analyzed within the context of these predefined cell types. Nonlinear projection methods such as t-SNE and UMAP provide a visual framework for this analysis by enabling the overlay of cell types and marker expression. However, these methods frequently produce projections where distinct cell types substantially overlap, hindering interpretation of marker expression patterns relative to known cell types. In this study, we investigate the underlying causes of this phenomenon and demonstrate that such overlaps often stem from the inherent high-dimensional structure of the data rather than limitations in the dimensionality reduction algorithms themselves. To address this, we introduce Cell Type Weighted Dimensionality Reduction (CWDR), a novel approach that incorporates lineage-based information through a supervised weighting mechanism. By integrating both cell identity and marker expression, CWDR preserves the visual separation between predefined cell types while maintaining the local variance necessary for downstream analysis. We validate our method across multiple high-dimensional flow cytometry and proteogenomic datasets. Our results show that CWDR significantly reduces inter-cluster overlap compared to traditional methods, providing a clearer framework for visualizing marker expression within the context of specific cell lineages.

20

Cell DiffErential Expression by Pooling (CellDEEP) highlights issues in differential gene expression in scRNA-seq

Cheng, Y.; Kettlewell, T.; Laidlaw, R. F.; Hardy, O. M.; McCluskey, A.; Otto, T. D.; Somma, D.

2026-03-11 bioinformatics 10.64898/2026.03.09.710522 medRxiv

Top 0.2%

0.5%

Show abstract

Accurate identification of differentially expressed genes (DEGs) in single-cell RNA sequencing (scRNA-seq) data remains challenging. Single-cell-specific statistical models often report large numbers of candidate genes but can exhibit inflated false positive rates, whereas pseudobulk approaches improve false discovery control at the cost of reduced sensitivity. To overcome the noise and bias that other tools have, and allow the user to have more control of the DEG process, we present CellDEEP, which uses a cell aggregation (metacell) approach. This tool provides a framework for flexible selection of pooling strategies and parameterisation for differential expression analysis (DE). Benchmarking on simulated and real datasets, including COVID-19 and rheumatoid arthritis, shows that CellDEEP often outperforms other methods, consistently reduces false positives compared to single-cell methods and recovers more true positives than pseudobulk methods. Our work shifts the focus from selecting a single "best" method to an approach that reduces cell-level noise while preserving biological signal, together with transparent validation framework, advancing more reliable differential-expression analysis in single-cell transcriptomics. Graphical abstract O_FIG O_LINKSMALLFIG WIDTH=189 HEIGHT=200 SRC="FIGDIR/small/710522v1_ufig1.gif" ALT="Figure 1"> View larger version (35K): org.highwire.dtl.DTLVardef@14692f9org.highwire.dtl.DTLVardef@5b37d6org.highwire.dtl.DTLVardef@aece11org.highwire.dtl.DTLVardef@5ade3d_HPS_FORMAT_FIGEXP M_FIG C_FIG