Biostatistics
◐ Oxford University Press (OUP)
Preprints posted in the last 90 days, ranked by how well they match Biostatistics's content profile, based on 21 papers previously published here. The average preprint has a 0.00% match score for this journal, so anything above that is already an above-average fit.
Diedrichsen, J.; Fu, X.; Shahbazi, M.; Bonner, S.
Show abstract
Many functional magnetic resonance imaging (fMRI) studies conclude that two conditions engage "overlapping, yet partly distinct" patterns of activation. Yet, there is currently no commonly accepted method for determining the extent of this overlap. While correlations between activation patterns can serve as a measure of their correspondence, empirical correlations are strongly biased towards zero due to measurement noise, preventing their use in testing hypotheses about the actual degree of pattern correspondence. In this paper, we derive the maximum-likelihood estimate for the correlation of the true (noise-less) activation patterns and examine its behavior in the low signal-to-noise regime that is typical for fMRI studies. We show that although the maximum-likelihood estimate corrects for much of the influence of measurement noise, it is ultimately biased. We examine different ways of drawing inferences about the size of the underlying true correlations. We find that a subject-wise bootstrap on the maximum-likelihood group estimate performs best over the tested conditions. We extend the proposed method to test more general hypotheses about the representational geometry of activation patterns for more conditions, and highlight best practices, as well as common pitfalls and problems, in testing such hypotheses.
Kornilov, S. A.
Show abstract
Shenhar et al. (2026) report 50% "intrinsic" lifespan heritability after calibrating a one-component correlated-frailty survival model to Scandinavian twin lifespans. Their framework is mathematically coherent, but the intrinsic component is not identified if heritable, mortality-relevant extrinsic susceptibility is omitted at calibration. We show that one-component calibration absorbs omitted familial extrinsic structure into the intrinsic frailty scale parameter{sigma}{theta} , and that this variance absorption is visible through separate diagnostics (1) Variance absorption. Under misspecification,{sigma}{theta} is inflated by +22.1% (95% CI: 21.5-22.7%), corresponding to +49% inflation in [Formula]. Falconer h2 is downstream of calibration and inherits a +9.2 pp bias (95% CI: 8.7-9.7). The{sigma}{theta} inflation is model-general: +22% (GM), +18% (MGG), +14% (SR); any dependence summary that is strictly increasing in{sigma}{theta} inherits this inflation, so Falconer h2 is one affected downstream quantity among many (Corollary B3). (2) Structural fingerprint. In the joint twin survival surface S(t1, t2), misspecification produces systematic dependence errors (ISE 48x that of the recovery model). Conditional twin dependence is inflated at all ages, peaking at age 80 ({Delta}r = 0.048). (3) Specificity. The bias requires an omitted component that is both heritable and mortality-relevant. Three negative controls, a boundary check ({rho} = 0), and a two-component recovery refit ({sigma}{theta} restored to within -3.2%) establish specificity. ACE decomposition yields C {approx} 0 throughout: the omitted extrinsic component loads onto A (because it is shared 1.0/0.5 in MZ/DZ), so switching summary statistics does not restore identification. (4) Sensitivity and falsifiability. Over an empirically anchored regime ({sigma}{gamma} [isin] [0.30, 0.65],{rho} [isin] [0.20, 0.50]), Falconer bias ranges from +2.8 to +18.9 pp (mean 9 pp). If{rho} is sufficiently negative, the bias reverses sign in all three model families (Corollary B4). A full-likelihood robustness check shows that this upward pull is partly structural and partly estimator-specific: in the same misspecified one-component model, ML still inflates{sigma}{theta} (+3%), whereas matching only rMZ inflates it much more (+21%). These results do not resolve true intrinsic heritability but establish that Shenhars 50% estimate carries a structured, model-general upward bias originating in the fitted latent variance{sigma}{theta} .
Sapoval, N.; Nakhleh, L.
Show abstract
Gene tree parsimony (GTP) is a common approach for efficient reconciliation of multiple discordant gene tree phylogenies for the inference of a single species tree. However, despite the popularity of GTP methods due to their low computational costs, prior work has shown that some commonly employed parsimony costs are statistically inconsistent under the multispecies coalescent process. Furthermore, a fine-grained analysis of the inconsistency has indicated potentially complimentary behavior of duplication and deep coalescence costs for symmetric and asymmetric species trees. In this work, we prove inconsistency of GTP estimators for all linear combinations of duplication, loss and deep coalescence scores. We also explore empirical implications of this result evaluating inference results of several GTP cost schemes under varying levels of incomplete lineage sorting.
Gilbert, J.; Wu, C. H.; Knittel, H.; Schäffer, A. A.; Malikic, S.; Sahinalp, C.
Show abstract
Understanding and comparing tumor evolutionary histories is fundamental to cancer genomics. Clonal trees, used to model tumor progression, are rooted, unordered trees in which each node represents a subclone labeled by a set of distinct mutations. To compare two clonal trees, we introduce omlta, the optimal multi-label tree alignment, which removes the minimum number of mutation labels from the trees, so that the remaining trees are isomorphic. Computing omlta is NP-hard. Here, we present an algorithm to compute the omlta, with a running time of [Formula] where L [≥] 1 is the total number of mutation labels occurring in the input trees and k is the minimum possible number of mutation labels that need to be removed for the alignment. Our implementation (https://github.com/algo-cancer/omlta) is the first computational tool for determining the optimal alignment between clonal trees. We applied omlta to 126 cases from the TRACERx study on non-small cell lung cancers and some melanoma single-cell data.
Ivanov, S.; Fosse, S.; dos reis, M.; Duchene, S.
Show abstract
Bayesian inference of divergence times for extant species using molecular data is an unconventional statistical problem: Divergence times and molecular rates are confounded, and only their product, the molecular branch length, is statistically identifiable. This means we must use priors on times and rates to break the identifiability problem. As a consequence, there is a lower bound in the uncertainty that can be attained under infinite data for estimates of evolutionary timescales using the molecular clock. With infinite data (i.e., an infinite number of sites and loci in the alignment) uncertainty in ages of nodes in phylogenies increases proportionally with their mean age, such that older nodes have higher uncertainty than younger nodes. On the other hand, if extinct taxa are present in the phylogeny, and if their sampling times are known (i.e., heterochronous data), then times and rates are identifiable and uncertainties of inferred times and rates go to zero with infinite data. However, in real heterochronous datasets (such as viruses and bacteria), alignments tend to be small and how much uncertainty is present and how it can be reduced as a function of data size are questions that have not been explored. This is clearly important for our understanding of the tempo and mode of microbial evolution using the molecular clock. Here we conducted extensive simulation experiments and analyses of empirical data to develop the infinite-sites theory for heterochronous data. Contrary to expectations, we find that uncertainty in ages of internal nodes scales positively with the distance to their closest tip with known age (i.e., calibration age), not their absolute age. Our results also demonstrate that estimation uncertainty decreases with calibration age more slowly in data sets with more, rather than fewer site patterns, although overall uncertainty is lower in the former. Our statistical framework establishes the minimum uncertainty that can be attained with perfect calibrations and sequence data that are effectively infinitely informative. Finally, we discuss the implications for viral sequence data sets. In a vast majority of cases viral data from outbreaks is not sufficiently informative to display infinite-sites behaviour and thus all estimates of evolutionary timescales will be associated with a degree of uncertainty that will depend on the size of the data set, its information content, and the complexity of the model. We anticipate that our framework is useful to determine such theoretical limits in empirical analyses of microbial outbreaks.
Yelmen, B.; Güler, M. N.; Estonian Biobank Research Team, ; Kollo, T.; Möls, M.; Charpiat, G.; Jay, F.
Show abstract
Over the past two decades, genome-wide association studies (GWAS) enabled the discovery of thousands of variants associated with many complex human traits. However, conventional GWAS are still widely performed with linear models with the assumption that the genetic effects are predominantly additive. In this work, we investigate the test statistic behavior when linear models are used to obtain significant genotype-phenotype associations without accounting for epistasis. We first algebraically derive mean and variance shift in the null statistic due to the omitted interaction term, and define the boundary between conservative (i.e., deflated statistic tail) and anti-conservative (i.e., inflated statistic tail) regimes for the common GWAS significance threshold. We then perform phenotype simulation analyses using the Estonian Biobank genotypes and validate the mathematical model. We demonstrate that the anti-conservative regime is plausible under realistic parameter settings and models omitting interaction terms can produce spurious significance. Our findings suggest caution when interpreting statistically significant signals reported in the literature based on linear models, especially for large-scale GWAS.
Ramirez-Lopez, L.; Kang, P.
Show abstract
Irritable Bowel Syndrome (IBS) affects a substantial proportion of university students, yet its factors remain incompletely characterised in South Asian populations. We reanalysed a publicly available dataset of 550 Bangladeshi students from Hasan et al. (2025), conducting a data audit that identified implausible records, including males reporting menstrual symptoms, and reduced the analytic sample to 506 observations. Using Explainable Boosting Machines (EBMs), which capture non-linear effects and pairwise interactions without sacrificing interpretability, we found that psychological distress, elevated BMI and academic dissatisfaction were the strongest predictors of IBS (mean AUC = 0.852 across 100 stratified train-test splits). Critically, several findings diverged from the original logistic regression analysis. Physical activity showed a non-linear risk pattern only at high intensity, the association with gender was substantially weaker when we accounted for metabolic and psychological factors as well and malnourishment does not have a strong an impact as in the original study. These divergences likely arise because the machine-learning model captures non-linear effects and interactions that were not represented in the original regression specification. Our findings underscore the value of reanalysing existing datasets with methods suited to capturing complexity and highlight data quality verification as a necessary step in the secondary analysis.
Chen, J.; Lambe, T.; Kamau, E.; Donnelly, C.; Lambert, B.; Bajaj, S.
Show abstract
AO_SCPLOWBSTRACTC_SCPLOWSerological surveys measure the presence of antibodies in a population to infer past exposure to an infectious pathogen. If study participants ages are known, serocatalytic models can be used to retrace the historical transmission strength of a pathogen within that population, quantified by the force of infection (FOI). These models rely on age information as a key variable since infection risks are interpreted in relation to how long individuals have been at risk. However, due to data constraints, participants ages may be provided only within "age bins". A common approach is then to assign individuals ages to be midpoints of their respective age bins, ignoring uncertainty in this quantity. In this study, we quantify the bias introduced by this midpoint approach and develop a Bayesian framework that explicitly accounts for uncertainty in age. By comparing inference under constant, age-dependent, and time-dependent FOI scenarios, we show that incorporating uncertainty in age in serocatalytic models yields more reliable FOI estimates without sacrificing computational complexity. These improvements support the interpretation of serological data and inform public health decisions, such as estimating disease burden and identifying targeted vaccination groups.
Abdallah, H. H.; Kopchick, J.; Hadous, J.; Easter, P.; Rosenberg, D. R.; Stanley, J. A.; Salch, A.; Diwadkar, V. A.
Show abstract
Functional brain imaging data can provide a window into the task-driven network states that shape brain function (or dysfunction). Conventionally, these network states can be represented as bivariate correlation matrices (which are formed from fMRI time series from multiple brain regions/nodes within any task window). Here, we treat these conventional connectivity matrices as connectivity terrains in order to recover local structure. In principle, any such terrain can be traversed node by node, where from any node, one can move towards its nearest functional neighbor (i.e., its maximally correlated node). In terrains with meaningful structure, such traversals across multiple nodes should converge to attractor nodes; here, the nodes that flow into a shared attractor form an attractor basin, which effectively is a sub-network within the system. Extant methods (e.g., degree distribution and characteristic path length) can summarize global network properties but cannot identify attractor nodes and basins. Here, we construct a new relation, called transitive maximal correlation (TMC) that can recover attractors and attractor basins in connectivity terrains. Node A is said to be transitively maximally correlated to node B if and only if B is an attractor into which A flows. We first develop the mathematical basis for deriving a TMC matrix TMC(M) from a bivariate correlation matrix M (before explaining this with hypothetical data). We next apply the TMC relation to connectivity terrains derived from real fMRI time series data, where these data were acquired in two distinct task-domains (that varied in their extent of cross-cerebral demand): i) associative learning and ii) visually guided motor control. We show that TMC is remarkably sensitive to inter-hemispheric structure in the connectivity terrain; here, attractor pairs that were inter-hemispheric homologues were more likely to be observed for the cross-cerebral learning task, than the more circumscribed motor-control data. We confirm the condition-specific sensitivity of TMC showing that observed attractor basins differed significantly across conditions of the learning task. Finally, we demonstrate that TMC complements graph theoretic constructions like path length and betweenness centrality. We suggest that TMC is a mathematically sound and novel method for capturing functional properties of brain networks.
Thomas-Hegarty, J.; Pulver, S. R.; Smith, V. A.
Show abstract
Neural information flow describes the movement of activity between neurons or brain areas. Advances in experimental methods have allowed production of large amounts of observational data related to neuronal activity from the single-neuron to population level. Most current methods for analysing these data are based on pairwise comparison of activity, and fall short of reliably extracting neural information flow network structure. Dynamic Bayesian networks may overcome some of these limitations. Here we evaluate the performance of a range of Bayesian network scoring metrics against the performance of multivariate Granger causality and LASSO regression for their ability to learn the connectivity underlying simulated single-neuron and neuronal population data. We find that discrete dynamic Bayesian networks are the best performing method for single-neuron data, and perform consistently for neural-population data. Continuous dynamic Bayesian networks have a tenancy to learn overly dense structures for both data types, but may have utility in scoping studies on single-neuron data. Multivariate Granger causality is the most robust method for learning structure of neural information flow between neural-populations, but performs poorly on single-neuron data. Significance testing within multivariate Granger causality produces variable results between data types. Overall, this work highlights how the analysis of neural information flow can vary depending on they type and structure of underlying data, and promotes discrete dynamic Bayesian networks as a useful and consistent tool for neural information flow analysis.
Weir, B. S.; Goudet, J.
Show abstract
We review the rich literature on the estimation of measures of inbreeding, relatedness and population structure, beginning with Sewall Wrights F-statistics and moving onto the descriptive statistics of Masatoshi Nei and Clark Cockerham. The current availability of genome-level single nucleotide variant data is allowing for sophisticated treatments of inferred identity by descent segments and inferred ancestral recombination graphs. Underlying such disparate methods is an emphasis of characterizing the descent status of alleles within and between individuals and populations and we have found allele-sharing statistics a convenient framework for examining the differences and similarities among different estimators. We have been able to resolve some long-standing reported differences among estimators, especially those involving the work of Nei. In the course of our algebraic and empirical treatment of descent measure estimation we have been able to formulate a set of five recommendations. Following the early work of Sewall Wright, we recommend 1. State that descent measures for pairs of alleles are relative to values in a reference set of allele pairs. With this view, we recommend 2. Use estimators that preserve descent measure rankings over different reference sets. Allele-sharing estimators satisfy this recommendation. Reducing genotypic data to allelic data has the benefit of reducing dimensionality, but we recommend 3. If genotypic data are available, avoid having to assume Hardy-Weinberg equilibrium by not reducing them to allelic data. Partly as a consequence of working with genotypic data, we recommend 4. Recognize that allele frequencies do not need to be estimated. Not estimating allele frequencies prevents the confounding of descent estimates for target pairs of alleles by the status of all pairs in a reference set. On the basis of both theoretical and empirical results, finally we recommend 5. Consider both inbreeding and kinship when estimating either one. It is difficult to envisage a natural population with relatedness but no inbreeding, or vice versa.
Chatzis, C.; Horner, D.; Bro, R.; Schoos, A.-M. M.; Rasmussen, M. A.; Acar, E.
Show abstract
MotivationTemporal multivariate data is ubiquitous in many domains, for instance, being collected over time at planned visits (every few months/years) in longitudinal cohorts, or every few minutes/hours in challenge tests. The analysis of such data often focuses on revealing the underlying temporal patterns common across subjects. However, there are subject-specific differences in temporal patterns, which hold the promise to enhance our understanding of underlying mechanisms and facilitate personalized approaches. Nevertheless, extracting subject-specific temporal patterns from longitudinal multivariate data reliably is an open challenge. ResultsWe introduce coupled matrix factorizations (CMF) as effective tools to capture subject-specific temporal patterns focusing on two novel applications: analysis of longitudinal metabolomics data and sensitization data. Our analysis shows that CMF models reliably capture subject-specific (shape) differences in temporal patterns revealing further in-sights compared to the state of the art. In metabolomics, CMF models reveal differences in metabolic responses of individuals (in a postprandial meal challenge) according to anthropometric and insulin sensitivity measures. In sensitization data analysis, CMF-based methods capture differences in temporal trajectories of children according to delivery/birth mode. We demonstrate the reliability of extracted patterns using reproducibility and replicability. AvailabilityThe code is available on https://github.com/cchatzis/Revealing-Subject-specific-Temporal-Patterns-from-Longitudinal-Data. Clinical data is not publicly available due to privacy reasons. Data can be made available under a joint research collaboration by contacting COPSAC (administration@dbac.dk).
Hou, S.; Shen, H.; Zhang, Y.
Show abstract
BackgroundBiomolecular condensates formed via liquid-liquid phase separation (LLPS) play vital roles in cellular organization and function. Computational prediction of phase-separating proteins (PSPs) is increasingly used to prioritize candidates at proteome scale, making robust, well-designed benchmarks essential for fair evaluation and iterative improvement of PSP predictors. ResultsWe first show that a recently released PSP benchmark is substantially confounded by the imbalances in taxonomic origin and intrinsic-disorder compositions between positive and negative sets, allowing predictors to achieve high apparent performance by exploiting non-LLPS shortcuts and obscuring their true ability to distinguish PSPs. To minimize these effects, we construct a taxonomy-aware, disorder-matched PSP benchmark. Using this benchmark, we find that absolute sequence and biophysical feature values of PSPs differ markedly across taxa, whereas LLPS-associated feature shifts relative to taxon-specific proteome backgrounds are comparatively conserved. Benchmarking twenty PSP predictors under this framework reveals pronounced taxon-dependent variation in performance. Moreover, PSPs lacking IDRs consistently constitute a more challenging regime across methods, motivating routine disorder-stratified evaluation. ConclusionsOur taxonomy-aware, disorder-matched benchmarking framework reduces shortcut-driven biases, enables more interpretable evaluation of PSP predictors, and provides guidance for developing models that capture transferable LLPS-associated signals rather than dataset- or taxon-specific shortcuts.
Lourenco, V. M.; Ogutu, J. O.; Piepho, H.-P.
Show abstract
Data contamination--from recording errors to extreme outliers--can compromise statistical models by biasing predictions, inflating prediction errors, and, in severe cases, destabilizing performance in high-dimensional settings. Although contamination can affect responses and covariates, we focus on response contamination and evaluate Random Forests through simulation. Using a synthetic animal-breeding dataset, we assess robust Random Forests across several contamination scenarios and validate them on plant and animal datasets. We thereby clarify the consequences of contamination for prediction, develop a robust Random Forest framework, and evaluate its performance. We examine preprocessing or data-transformation strategies, algorithmic modifications, and hybrid approaches for robustifying Random Forests. Across these approaches, data transformation emerges as the most effective strategy, delivering the strongest performance under contamination. This strategy is simple, general, and transferable to other Machine Learning methods, offering a remedy for robust genomic prediction. In real breeding data, robust Random Forests are useful when substantial contamination, phenotypic corruption, misrecording, or train-deployment mismatch is plausible and the goal is to recover a latent signal for genomic prediction and selection; ranking-based robust Random Forests are the dependable first option, whereas weighting-based Random Forests should be used only when their weighting scheme preserves rank structure and improves prediction. Robustification is not universally necessary, but it becomes important when contamination distorts the link between observed responses and the predictive target; standard Random Forests remain the default for clean data, whereas robust Random Forests should be fitted alongside them whenever contamination is plausible, with the final choice guided by data, trait, and breeding objective. Author summaryMachine learning (ML) methods are widely used for prediction with high-dimensional, complex data, and supervised approaches such as Random Forests (RF) have proved effective for genomic prediction (GP) and selection. Yet their performance can be severely compromised by data contamination if the algorithms rely on classical data-driven procedures that are sensitive to atypical observations. Robustifying ML methods is therefore important both for improving predictive performance under contamination and for guiding their practical use in high-dimensional prediction problems. To address this need, we develop robust preprocessing, algorithm-level, and hybrid strategies for improving RF performance with contaminated data. Using simulated animal data, we show that ranking-and weighting-based robust RF provide the strongest overall compromise for genomic prediction and selection under contamination. Validation on several plant and animal breeding datasets further shows that the benefits of robustification are not universal, but depend on the dataset, trait, and breeding objective. Although motivated by RF, the framework we propose is general, practical, and readily transferable to other ML methods. It also offers a basis for deciding when robustness should complement standard RF rather than replace it outright.
Heaton, H.; Behboudi, R.; Ward, C.; Weerakoon, M.; Kanaan, S.; Reichle, S.; Hunter, N.; Furlan, S.
Show abstract
The existence of rare, genetically distinct cells can occur in various samples such as transplant patients, naturally occurring microchimerism between maternal and fetal tissues, and cancer samples with sufficient mutational burden. Computational methods for detecting these foreign cells are vital to studying these biological conditions. An application that is of particular interest is that of leukemia patients post hematopoietic cell transplant (HCT). In many leukemias, a primary therapy is HCT, after which, the primary genotype of the bone marrow and blood cells should be of donor origin. If cells exist that are of the patients genotype and the cell type lineage of the particular leukemia, this is known as measurable residual disease (MRD). If the MRD is high enough, this may represent a relapse of the patients leukemia. Furthermore, accurately estimating the MRD is important for driving clinical decision making for these patients. Here we present Cellector, a computational method for identifying rare foreign genotype cells in single cell RNAseq (scRNAseq) datasets. We show cellector accurately detects microchimeric cells down to an exceedingly low percentage of these cells present (0.05% or lower).
Kuijjer, M. L.; De Marzio, M.; Glass, K.
Show abstract
1Analysis of biological networks can provide unprecedented insights into the mechanisms underlying disease. Although many methods have been developed to estimate biological networks, these approaches typically use multiple experimental samples to estimate a single aggregate network, which fails to capture population-level heterogeneity. Recently, several methods have been developed that overcome this limitation by inferring networks for individual samples, i.e. single-sample networks. However, each approach for inferring single-sample networks has been formulated differently, making it challenging to compare them. To address this issue, we re-cast the mathematics of several single-sample network methods using common variables. We then systematically explore the parameters, caveats, and underlying assumptions made by each method and examine how these differences impact single-sample network prediction. Our analyses point to a critical trade-off that occurs when trying to simultaneously predict network edges that are both shared across samples as well as edges that are specific to a given sample. For example, the mathematics of both SWEET and BONOBO includes a scale factor that drives the weights of edges in the predicted single-sample networks toward a background network. The result is that, although networks predicted by these methods tend to have the highest accuracy, this often comes at the cost of very low specificity, an important caveat since the primary goal of sample-specific network modeling is to obtain networks that are specific to each input sample. In contrast, SSN estimates the most specific but least accurate networks, while LIONESS straddles these domains, with an accuracy almost as high as SWEET and BONOBO and a specificity almost as high as SSN. Overall, our analyses highlight some of the broader challenges in this emerging field. However, they also point to important methodological synergies, providing an opportunity to create a common framework that can be used to improve single-sample network inference.
Coskun, M.; Lopes, F. B.; Kubilay Tolunay, P.; Chance, M. R.; Koyuturk, M.
Show abstract
Novel technologies for the acquisition of protein expression data at the single cell level are emerging rapidly. Although there exists a substantial body of computational algorithms and tools for the analysis of single cell gene expression (scRNAseq) data, tools for even basic tasks such as clustering or cell type identification for single cell proteomic (scProteomics) data are relatively scarce. Adoption of algorithms that have been developed for scRNAseq into scProteomics is challenged by the larger number of drop-outs, missing data, and noise in single cell proteomic data. Graph contrastive learning (GCL) on cell-to-cell similarity graphs derived from single cell protein expression profiles show promise in cell type identification. However, missing edges and noise in the cell-to-cell similarity graph requires careful design of convolution matrices to overcome the imperfections in these graphs. Here, we introduce scPO_SCPLOWROFITEROLEC_SCPLOW (Single Cell Proteomics Clustering via Spectral Filters), a computational framework to facilitate effective use of spectral graph filters in GCL-based clustering of single cell proteomic data. Since clustering assumes a homophilic network topology, we consider three types of homophilic filters: (i) random walks, (ii) heat kernels, (iii) beta kernels. Direct implementation of these filters is computationally prohibitive, thus the filters are either truncated or approximated in practice. To overcome this limitation, scPO_SCPLOWROFITEROLEC_SCPLOW uses Arnoldi orthonormalization to implement polynomial interpolations of any given spectral graph filter. Our results on comprehensive single cell proteomic data show that (i) graph contrastive learning with learnable polynomial coefficients that are carefully initialized improves the effectiveness and robustness of cell type identification, (ii) heat kernels and beta kernels improve clustering performance over adjacency matrices or random walks, and (iii) polynomial interpolation of spectral filters outperforms approximation or truncation. The source code for scPO_SCPLOWROFITEROLEC_SCPLOW and Supplementary Materials are available at https://github.com/mustafaCoskunAgu/scProfiterole.
Xu, Y.; Anderson, I. J.; McCord, R. P.; Shen, T.
Show abstract
Specific interchromosomal interactions indicate direct and nonrandom physical associations between pairs of genome positions on two different chromosomes. These contact interactions can be direct communication between non-homologous chromosomes and can enable coordinated activities. It is useful to annotate these complex contact interaction patterns and render them to a property associated with a single genome position, both for a clean visualization of the patterns and for facilitating the comparison with linear genomic annotations and underpinning biological functions. We utilize abstract graphs to characterize interchromosomal interaction, as network analysis may succinctly summarize complex interaction structures. We built a graph representation of cross-chromosomal contact interactions derived from Hi-C data and implemented three network-based annotations which consistently indicate the interchromosomal interaction strength associated with specific genomic positions. Equipped with these metrics, we further investigate whether a chromosome relies on shared hot spots to communicate with other chromosomes. We found that half of the strong interaction positions of chromosome 19 are shared for interacting with chromosomes 17 and 22. We further found that lamina-associated domains (LADs) participate in fewer interchromosomal contacts. Overall, the network-based annotation framework reveals distinct chromosome regulation patches and provides insight into how chromosomes associate with each other and organize with respect to the nuclear envelope.
Cheng, Y.; Kettlewell, T.; Laidlaw, R. F.; Hardy, O. M.; McCluskey, A.; Otto, T. D.; Somma, D.
Show abstract
Accurate identification of differentially expressed genes (DEGs) in single-cell RNA sequencing (scRNA-seq) data remains challenging. Single-cell-specific statistical models often report large numbers of candidate genes but can exhibit inflated false positive rates, whereas pseudobulk approaches improve false discovery control at the cost of reduced sensitivity. To overcome the noise and bias that other tools have, and allow the user to have more control of the DEG process, we present CellDEEP, which uses a cell aggregation (metacell) approach. This tool provides a framework for flexible selection of pooling strategies and parameterisation for differential expression analysis (DE). Benchmarking on simulated and real datasets, including COVID-19 and rheumatoid arthritis, shows that CellDEEP often outperforms other methods, consistently reduces false positives compared to single-cell methods and recovers more true positives than pseudobulk methods. Our work shifts the focus from selecting a single "best" method to an approach that reduces cell-level noise while preserving biological signal, together with transparent validation framework, advancing more reliable differential-expression analysis in single-cell transcriptomics. Graphical abstract O_FIG O_LINKSMALLFIG WIDTH=189 HEIGHT=200 SRC="FIGDIR/small/710522v1_ufig1.gif" ALT="Figure 1"> View larger version (35K): org.highwire.dtl.DTLVardef@14692f9org.highwire.dtl.DTLVardef@5b37d6org.highwire.dtl.DTLVardef@aece11org.highwire.dtl.DTLVardef@5ade3d_HPS_FORMAT_FIGEXP M_FIG C_FIG
Zhang, X.; Vandekar, S.; Chen, A. A.; Kang, K.; Seidlitz, J.; Alexander-Bloch, A.; Liu, J.
Show abstract
Large-scale neuroimaging studies often collect multiple modalities, such as task and resting-state functional MRI, diffusion MRI, and structural MRI. Joint inference across these modalities uses shared variation to improve statistical efficiency, increase replicability, and provide a more integrated view of brain-phenotype associations. In practice, however, such analyses are limited because complex cross-modality covariance cannot be flexibly modeled, which makes the resulting joint effects difficult to interpret. A recent distance-based ANOVA extension allows multimodal analysis and increases power for detecting group differences, but it cannot easily distinguish location from scale effects in distance space, offers only an omnibus pseudo-F test without interpretable parameters, and requires computationally intensive permutation inference. We propose a novel semiparametric, U-statistics-based Generalized Estimating Equation (UGEE) framework that unifies univariate and multivariate distance models. By regressing pairwise dissimilarities on covariates, this method yields interpretable regression coefficients that disentangle location and scale effects and quantify inter-modality differences, while flexibly accounting for correlations among modality distances. The estimator is based on efficient influence functions, ensuring asymptotic efficiency, robustness to misspecification, and computational scalability for large-scale data analysis. We evaluate the proposed method through extensive simulations and analyses of the Adolescent Brain Cognitive Development dataset. Results show that UGEE accurately estimates modality, group, and interaction effects and achieves a 100-fold speed-up compared with permutation-based approaches. This framework provides a general and computationally efficient tool for semiparametric inference on multimodal data, particularly suited for large neuroimaging applications.