GENETICS
◐ Oxford University Press (OUP)
Preprints posted in the last 30 days, ranked by how well they match GENETICS's content profile, based on 189 papers previously published here. The average preprint has a 0.05% match score for this journal, so anything above that is already an above-average fit.
Waples, R. S.
Show abstract
Interest in quantifying linkage disequilibrium (LD, non-random associations of alleles at different loci) has skyrocketed in recent years as researchers have focused on use of LD in genome-wide association studies (GWAS), for studying historical demography, and for estimating effective population size (Ne). The most widely used LD metric is r2 = the squared correlation of alleles at a pair of loci. Despite a half century of efforts, developing an unbiased expectation of r2 as a function of the many factors that can affect it (physical linkage, genetic drift, selection, migration, mutation, mating systems) remains elusive. Furthermore, even when all of these other factors are absent, empirical estimates of r2 are upwardly biased by sampling a finite number (S) of individuals, and that must be accounted for if one wants to focus on the desired signal of LD. Previous approaches to estimate [Formula] have been shown to be biased to greater or lesser degrees. The purpose of this short paper is to demonstrate that a simple and apparently exact expression for [Formula] does exist for the special case where sampling error is the only factor contributing to r2, in which case [Formula] = 1/(S - 1). When other factors contribute heavily to LD, [Formula] shrinks toward 0 as empirical r2 [->] 1. However, for estimating contemporary Ne with unlinked markers, empirical r2 will generally be small and 1/(S - 1) will provide a robust estimate of [Formula].
Percival-Smith, A.; Brabrook, C.
Show abstract
An expectation of a hypothesis that proposes cell-to-cell signalling pathways are redundant due to the redundancy of pathway terminal transcription factors (TFs) was tested by screening 35 signalling ligands (SLs) for rescue of a decapentaplegic (dpp) hypomorphic wing growth phenotype. The screen identified three examples of partial rescue: Hedgehog (HH), Semphorin 1a (SEMA1A) and Wnt ortholog 2 (WNT2). HH overexpression with dppGAL4 may increase the expression of DPP activity from the hypomorphic dpp alleles. However, SEMA1A and WNT2 did not phenocopy ectopic expression of HH or DPP and neither SEMA1A nor WNT2 were required for wing growth suggesting substitution of DPP for partial restoration of wing growth. The WNT2 rescue was dependent on the Frizzled 4 (FZ4) WNT receptor excluding the possibility that WNT2 weakly binds the DPP receptor. Although examples of phenotypic nonspecificity of SL function were identified, this is an expectation, and not direct proof, of the hypothesis of TF redundancy. Screen Report SummaryAn expectation of a hypothesis proposing that cell-to-cell signalling pathways are redundant due to the redundancy of the pathway terminal transcription factors was tested by screening for replacement of one signalling ligand (DPP; SLa) with another SLb for wing growth. Three non-DPP SLs were identified in the screen of 35SLs: HH, SEMA1A and WNT2. Genetic analysis of Sema1a and Wnt2 suggests functional complementation of dpp for wing growth suggesting that SEMA1A and WNT2 partially replace DPP for wing growth. Therefore, an expectation of the hypothesis is met.
Clo, J.
Show abstract
Whole genome duplication is a common mutation in eukaryotes with far-reaching phenotypic effects. The resulting morphological, physiological, and fitness consequences and how they affect the survival probability of newly polyploid lineages are intensively studied, but very little is known about the effect of genome doubling on the short-term evolvability of populations. Understanding the effect of polyploidization on the adaptive potential of populations is of crucial importance to predict the future of polyploid populations. In this paper, I investigate the immediate consequences of genome doubling on the genetic variance of populations. To do so, I performed numerical iterations and simulations of how the genetic variance of a quantitative trait changes after polyploidization, under different genetic architectures (additivity, dominance, and epistasis). I found that genetic variance generally decreases after genome doubling. Non-additive gene actions can make autotetraploid populations genetically more diverse than their diploid progenitors in rare cases, notably with overdominance and directional epistasis. By collecting estimates from the agronomic literature, I found that both dominance and epistatic variance contribute to the genetic variance of polyploid populations. These results bring new insights into the adaptive potential of newly formed tetraploid populations, and call for further experimental investigations of how polyploidization is associated with a short-term decrease in evolvability.
Lopez-Cortegano, E.; Charlesworth, B.
Show abstract
A sudden reduction in population size increases the rate of genetic drift, reducing variability and increasing the mean level of homozygosity. The resulting increased exposure of recessive or partially recessive, strongly deleterious alleles to selection against homozygotes may lead to their being purged from the population, potentially allowing mean fitness to increase after an initial decline, and accelerating the decline in inbreeding depression associated with reduced variability. However, detailed population genetic theory on the effects of population bottlenecks on mean fitness and inbreeding depression remains limited. We develop a theoretical framework for small, randomly mating populations founded from a large population near mutation-selection-drift equilibrium, using both simulations and approximate analytical predictions. These provide quantitative predictions for the dynamics of the populations mean fitness and level of inbreeding depression following a bottleneck. In particular, we derive an approximate expression for the time needed for mean fitness to recover after an initial decline; such a recovery requires selection to be sufficiently strong relative to drift and mutations to be sufficiently recessive. In contrast, weakly deleterious mutations cause reductions in mean fitness and inbreeding depression that are similar in size to those predicted from increases in neutral homozygosity.
Haque, T.; Siddiq, M. A.; Duveau, F. M.; Wittkopp, P.
Show abstract
Genetically identical cells grown in the same environment show variation in gene expression known as expression noise. Expression noise can be heritable and impact fitness, making it subject to natural selection. Increasing expression noise for the Saccharomyces cerevisiae TDH3 gene was shown to be beneficial in glucose-based media when mean TDH3 expression was far from the fitness optimum but deleterious when it was close to this optimum. Here, we show that growth on different carbon sources alters the effects of new mutations on TDH3 expression noise and examine the fitness effects of changing expression noise. In galactose-based media, we observed the same relationship between expression noise and fitness seen in glucose-based media, but in glycerol- and ethanol-based media, we observed the opposite relationship or no significant relationship, respectively. Using simulations of single-cell organisms, we found that these differences were most likely explained by environment-specific relationships between gene expression and fitness. We also found that, far from the optimum, the fitness effects of noise were greatest when expression was highly heritable between mother and daughter cells. The empirical observations and simulations reported in this study show how environments influence both the production of expression noise and its impacts on fitness.
Martinez-Rodriguez, L. E.; Bell, S. P.
Show abstract
The origin recognition complex (ORC) selects origins of replication and directs the loading of the Mcm2-7 replicative helicase at these sites. Five of the six ORC subunits are related to the AAA+ family of ATPases. Although functions for ATP hydrolysis by Cdc6 and the Mcm2-7 complex have been described, the essential role of ORC ATP hydrolysis remains unclear. We performed a genetic screen in Saccharomyces cerevisiae for suppressors of the lethal phenotype of the orc4-R267A allele, which disrupts ORC ATP hydrolysis in vitro. We identified six causative mutations, five of which are distributed across different ORC subunits. The suppressor mutations in Orc1 and Orc4, but not the other ORC subunits, increase the in vitro helicase loading activity of ATPase-defective ORC (ORC4R). Allele specificity studies showed the alleles specifically suppress defects at ATPase interfaces within the ORC-Cdc6 complex. The sixth allele is a mutation in TOA2, a subunit of the TFIIA general transcription factor. Mutations in the general transcription factors TBP and TFIIB, and the large subunit of RNA Polymerase II also suppressed the orc4-R267A lethality, suggesting that reducing transcription is sufficient for suppression. Our study identifies multiple ways to suppress the lethal phenotype of an ATPase defective ORC allele and reveals a connection between ORC ATP hydrolysis and transcription.
Cannon, M. J.; Bratulin, A.; Kuzma, K.; Puthawala, D.; Corsmeier, D.; Schieffer, K.; Kelly, B.; Cottrell, C.; Wagner, A. H.
Show abstract
Genomic medicine relies on expert evaluation of genomic variants, but this process is dramatically slowed by a lack of readily-accessible genomic knowledge. Although genomic knowledge resources such as ClinVar and CIViC support structured data sharing and provide interfaces for adding structure, much of the variant interpretation data generated upstream of these resources is not readily interoperable with these resources, limiting the ability of clinical labs to share data and creating knowledge silos. Here we evaluate a strategy for breaking down these knowledge silos in a pilot study to transform semi-structured variant classification knowledge into computable clinical assertions leveraging the Global Alliance for Genomics and Health (GA4GH) Genomic Knowledge Standards specifications. We programmatically mapped previously captured somatic cancer clinical significance classifications from spreadsheets to the GA4GH Variant Annotation specification. For diagnostic classification data, this approach enabled reuse of standards-aware submission tooling to share 1,499 records to ClinVar. We then studied how AI-assisted curation approaches to overcome gaps in unstructured text enabled scalable curation of prior classifications in unstructured text. Using this approach, we were able to accurately classify clinical significance for 71.8% (117/163) of randomly sampled prognostic evidence statements. We conclude with an overview of how this work may be generalized to make computationally inaccessible variant evidence from other clinical laboratories broadly reusable in downstream knowledgebases such as CIViC and ClinVar.
Pham, B. K.; Davenport, S.; Azriel, D.; Schwartzman, A.
Show abstract
LD Score Regression (LDSC) is a prominent method, which estimates whole-genome SNP heritability from summary statistics via the slope of a linear regression of GWAS test statistics corresponding to a trait of interest against LD scores. It was claimed by the LDSC authors that the free intercept in the regression accounts for confounding bias such as population stratification. In this study, we argue that the intercept in LDSC must be fixed to 1 for accurate SNP heritability estimation. We show both theoretically and with simulations that the estimated intercept does not accurately capture population stratification effects, and that it adversely affects the accuracy of the heritability estimate introducing bias and increasing variance. Fixing the intercept to 1 eliminates bias and reduces variance when no population stratification is present. On the other hand, under population stratification, LDSC is biased with both the free and the fixed intercept. Additionally, we show that estimated standard errors in LDSC are underestimated, potentially leading to false-positives in downstream GWAS analyses.
Oubninte, S.; Ruczinski, I.; Yanek, L. R.; Mathias, R.; Bureau, A.
Show abstract
Few studies assessed the performance of population-based phasing combined with parental genotypes to infer recombination on whole genome sequence (WGS) data. In this study, our objective was to evaluate whether Shapeit2 duoHMM, a Hidden Markov Model using parental genotypes, infers recombination events reliably on WGS and with narrower intervals than SNP arrays. We based our analysis on the overlap between recombination events inferred by Merlin on SNP genotypes and Shapeit2 on WGS and SNP genotypes. We used a sample of 61 extended families from the GeneSTAR study with TopMED freeze 8 WGS on 580 sequenced subjects (60% of sample). Shapeit2 was run with a window size of 500 kilobases and 200 states on WGS. To mimic a SNP array, we extracted genotypes of 355,112 autosomal markers on the Illumina OmniExpress array. The number of recombination events per meiosis inferred by Shapeit2 on the WGS data (36.8) was aligned with the expected numbers over autosomes (35.7), although Merlin overestimated this number (115.0). 73% of Shapeit2 recombination events on WGS were detected by Merlin, a proportion rising to 91% when restricting to events also inferred by Shapeit2 on OmniExpress genotypes. Furthermore, Shapeit2 recombination intervals were narrower on WGS than OmniExpress genotypes (median of 4,530 bp vs. 49,458 bp). This suggests that Shapeit2 on WGS is a reliable and accurate method for inferring recombination events.
Redhuis, A. C.; Wittkopp, P. J.
Show abstract
Organisms cope with environmental changes by modifying gene expression. To understand how regulatory networks controlling expression plasticity evolve, we analyzed RNAseq data from Saccharomyces cerevisiae, Saccharomyces paradoxus, and their F1 hybrids at multiple timepoints after transferring cells from standard laboratory conditions to five environments (low phosphorus, low nitrogen, hydroxyurea shock, heat stress, and cold stress) and during the diauxic shift. In each of the six datasets, we identified genes that changed expression following the transition to the new environment and used hierarchical clustering to identify genes that increased or decreased in expression. We then compared these classifications between orthologs to identify genes with divergent plasticity. For some genes, plasticity was more extreme in one species than the other, and for others, expression of orthologs changed in opposite directions when acclimating to the same environment. Most cases of plasticity divergence were seen only in one environment and were attributable primarily to trans-regulatory divergence. Using environment-specific regulatory networks inferred from data in Yeastract, we found that divergent plasticity of environment-specific transcription factors generally did not predict divergent plasticity of their target genes. We also found that, as a group, genes with conserved plasticity tended to have more regulatory interactions than genes with divergent plasticity. Interesting patterns of expression divergence were also observed for five transcription factors in the pleiotropic drug resistance network and their target genes that might contribute to phenotypic divergence. Together, these findings show how environment-specific trans-regulatory divergence and combinatorial gene regulation shape the evolution of expression plasticity.
Miao, X.; Edge, M. D.; Harpak, A.
Show abstract
Standard genome-wide association studies (GWASs) are vulnerable to confounding factors, including stratification, assortative mating, and dynastic effects. Family studies such as sibling-based GWAS (sib-GWAS) mitigate such confounding and are becoming the tool of choice for teasing apart direct genetic effects--causal effects of ones genotype on ones own phenotype-- from other factors. However, due in part to their smaller sample sizes, sib-GWAS allelic effect estimates are substantially more variable than standard (i.e., population-based) GWAS estimates. The quantification of this uncertainty is essential for many uses of sib-GWAS, including polygenic scoring, causal inference (e.g., Mendelian randomization), disentangling direct from indirect familial effects, and measuring assortative mating. Here, we investigate sources of uncertainty in sib-GWAS allelic effect estimators. We study their impacts on the biases of three uncertainty measurement methods, including two that are commonly used and a new resampling-based approach we propose. We find that heterogeneity in allelic effects or heteroskedasticity across families (e.g., due to variation in genetic backgrounds or environments) can bias existing methods, and that this bias is more severe for small samples and rare variants. In contrast, the resampling-based approach we propose is approximately unbiased under all scenarios we considered. We validate our theoretical predictions, as well as the importance of effect heterogeneity and heteroskedasticity, using simulations and empirical analysis in the UK Biobank. In sum, this study helps understand the sources of uncertainty in family-based genotype-phenotype association studies and provides a robust method to estimate uncertainty.
Kocik, R. A.; Ahrens, J.; Gasch, A. P.
Show abstract
Yeast responding to acute stress reallocate cellular resources, in part via the Environmental Stress Response (ESR) that induces stress-defense genes while repressing ribosome-biogenesis and growth genes. The purpose and regulation of coordinated induction and repression is incompletely understood, but both responses are influenced by ESR transcription factors Msn2 and Msn4 (Msn2/4). Here we used single-cell microscopy and transcriptomic analysis to investigate the role of upstream regulator Pde2 in ESR regulation and post-stress fitness. Loss of PDE2 weakened and shortened Msn2 activation following salt stress and produced muted induction of Msn2/4 targets, similar to a msn2{triangleup}msn4{triangleup} strain. In contrast, Pde2 had at most a minor impact on ESR repressor Dot6, yet was important for repression of its targets beyond Msn2/4 influence. Consistent with our recent resource-reallocation model, pde2{triangleup} cells had normal or faster post-stress growth rates, despite weaker activation of the ESR. We discuss implications for ESR regulation and function.
Lesturgie, P.; Blanckaert, A.; Sousa, V. C.
Show abstract
Most species are geographically structured, leaving characteristic signatures in neutral regions of the genome. These signatures can be distorted when neutral regions are linked to deleterious mutations. In such regions, purifying selection can reduce genetic diversity through Background Selection (BGS) or, for recessive mutations, increase diversity through Associative Overdominance (AOD). While the effect of BGS and AOD are well characterized in panmictic populations, their effects remain largely unexplored in structured populations. Here, we investigated an Isolation with Migration model using forward simulations across a range of migration, selection, dominance, and recombination parameters. We first used a genotype-based approach to quantify the effects of deleterious mutations on standard summary statistics ({pi}, dxy, FST, DAFi). We then showed that an Ancestral Recombination Graph-based (ARG) approach, tracking tree sequences from a sample of one diploid per deme, recovers the same patterns while directly relating genetic variation to the underlying coalescent processes. When recombination is sufficiently low, we found a BGS-driven regime for weakly codominant mutations, characterized by lower diversity and increased genetic differentiation (FST). For recessive mutations, we first identified an AOD-driven regime, characterized by increased diversity and lower FST values followed by a transition to a subsequent BGS-driven regime. Genealogies were similarly impacted by deleterious mutations: BGS shrunk coalescent times and produced a shift towards lineage sorting topologies, while AOD stretched coalescent times and produced a shift toward incomplete lineage-sorting topologies. These patterns were weakened by gene flow, with FST and topologies remaining close to expected under neutrality, while diversity and coalescence times remained robust to demography. Our results provide clear evidence of BGS, AOD, and of their transition in a structured model with gene flow. Importantly, these processes leave distinct and interpretable signatures on gene trees, highlighting the potential of ARG-based approaches for inferring linked selection and dominance in structured populations. Author summaryCharacterizing how demography and selection jointly shape genomic variation is a central question in population genetics. As deleterious mutations reduce fitness, they are continuously removed from populations by purifying selection. Through linkage, this affects nearby regions of the genome, leaving signatures of selection on linked neutral genetic diversity. While these effects are well understood in random mating populations, much less is known in structured populations. Specifically, the occurrence of Background Selection (BGS), which reduces diversity, and Associative Overdominance (AOD), which increases diversity, remains underexplored. Here, we used simulations to investigate how deleterious mutations shape genomic variation in a structured two-population isolation with migration model. By combining standard population genetic analyses with a genealogical approach based on Ancestral Recombination Graphs (ARGs), we showed that BGS and AOD leave distinct and interpretable signatures on common summary statistics and the underlying genealogies. We identified clear signatures of BGS and AOD when recombination was low and revealed a transition from AOD to BGS for recessive mutations, as the strength of selection increased. Our results highlight the importance of jointly considering demography and linked selection when interpreting genomic data and demonstrate the potential of ARGs to jointly infer demography, selection, and dominance from genomic data.
Hu, S.; Cheng, H.; Gillenwater, L.; Manpearl, K.; Mandava, A.; Wang, Y.; Pividori, M.; Stranger, B.; Krishnan, A.; Greene, C.; Gao, Y.
Show abstract
Objective. Biomedical knowledge graphs (KGs) such as PrimeKG, Hetionet, UMLS, and PharmGKB are increasingly used as the substrate for downstream machine-learning, retrieval-augmented generation, drug-repurposing, and electronic health record (EHR) augmentation pipelines. The dominant assumption in published work is that integrating two or more such KGs is a tractable engineering step solved by identifier (ID) matching. This paper interrogates that assumption empirically. We quantify how much concept overlap survives realistic alignment, and we characterize the new failure modes introduced by the methods that practitioners reach for when ID matching is insufficient. Materials and Methods. We compared four widely used biomedical KGs (PrimeKG, Hetionet v1.0, the full UMLS Metathesaurus, and PharmGKB) across eleven node types using a tiered alignment pipeline: (1) direct ID matching for nodes sharing a primary vocabulary; (2) cross-ontology bridging using standard mappings (e.g., MONDO-DOID, HPO-UMLS, HPO-UMLS-MeSH for side effects, NCBI Gene-HGNC-UMLS, UBERON-FMA/SNOMEDCT_US/NCI/MeSH for anatomy); (3) ClinicalBERT cosine-similarity grouping at threshold >= 0.98 for over-segmented disease nodes, with a deterministic suffix-stripping canonicalizer; (4) exact name matching for ontology-poor types (anatomy, REACTOME pathways); and (5) embedding-based fuzzy matching with UMLS lookup (SapBERT and ClinicalBERT) for free-text microbiome concepts. We applied the pipeline to a 698-concept gut-microbiome benchmark spanning taxa, pathways, and disease labels, validated grouping decisions against the curated SSSOM mappings released by the MONDO project, and audited the ClinicalBERT consolidation against five clinical-genetics case studies drawn from the literature. Results. Per-type pairwise coverage was strikingly asymmetric. Genes/proteins and the three Gene Ontology categories aligned cleanly across PrimeKG and Hetionet (mutual coverage 94-99%), but disease overlap was sparse: only 0.7% of PrimeKG individual disease nodes mapped to Hetionet, rising to 2.0% after MONDO grouping (versus 78.7% and 18.4% from the Hetionet side). PrimeKG-to-UMLS coverage spanned 100% (effect/phenotype via HPO) down to 20.8% (REACTOME pathways), with drugs at 73.7% and anatomy at 58.8%. PrimeKG-to-PharmGKB drug coverage required up to two bridging hops (DrugBank -> UMLS -> RxNorm/ATC/MeSH). Bigger was not uniformly more complete: on a 698-concept microbiome drug benchmark, Hetionet missed 0 concepts while PrimeKG missed 16. ClinicalBERT-based grouping consolidated 22,205 raw MONDO disease nodes into 17,080 groups but introduced three reproducible failure modes documented in case studies: (i) peer over-merging: for example, all 22 osteogenesis imperfecta subtypes collapsed into a single node despite distinct severity classes; (ii) parent-child collapse: e.g. acute myeloid leukemia merged with myeloid leukemia, erasing the acute/chronic distinction that drives clinical management; and (iii) lexical false positives: neurofibromatosis and schwannomatosis grouped together despite cellular-pathology differences. Discussion. Identifier matching alone is a weak baseline for biomedical KG integration. Cross-ontology bridges and embedding-based consolidation expand coverage but do so at the cost of clinically meaningful resolution, and the resulting failures are systematic rather than random. Reporting only aggregate coverage statistics obscures these losses, which propagate silently into downstream tasks. Conclusion. We provide reusable per-type coverage tables, a taxonomy of three integration failure modes, and concrete recommendations for downstream studies that depend on a unified biomedical KG. We argue that future KG integration work should report per-type coverage and per-cluster confidence rather than aggregate match rates.
Valinejad, J.; Moon, S.; Xu, Y.; Zhu, Q.
Show abstract
The significant challenges associated with rare diseases in the medical and research domains include the scarcity of information, which is often confined to unstructured formats. Although existing approaches provide valuable insights, there is a need to develop effective methods to identify information pertinent to rare diseases for advancing rare disease research. We identified mentions of rare diseases in relevant texts and assessed their relevance using derived scores, the confidence score and semantic similarity from a fine-tuned BioMedBERT encoder. This encoder was fine-tuned using rare disease related text from Online Mendelian Inheritance in Man (OMIM), Orphanet, a manually validated dataset, and STS benchmark datasets. The process of identifying meaningful rare disease mentioned was presented through two case studies that retrieved relevant NIH-funded projects, utilizing a generated knowledge graph in Neo4j to host data on 2,067 GARD diseases with over 320,000 NIH funded projects. Through various case studies with NIH-funded projects related to rare diseases, we demonstrated the effectiveness of our approach in systematically providing rare disease related data to enhance our understanding of rare diseases for future investigations.
Hill, J. L.; Ellis, J. P.; Williams, R. T.; Apodaca, A.; Basu, A.; Moore, A.; Osborne Nishimura, E.
Show abstract
At a mere 20 cells, the Caenorhabditis elegans intestine regulates metabolism, energy homeostasis, host defense, yolk production, and genetic aging, all while dynamically responding to its environment. How the intestine develops to carry out these disparate functions is unknown, and how cells differ along the length of the intestine is unclear. To address these questions, we performed single-cell RNA sequencing (scRNA-seq) on FACS-enriched intestinal cells from mixed-stage C. elegans embryos. The resulting single-cell transcriptomes of 974 cells organized into 13 clusters, suggesting a diversity of cell types and states. We used two post hoc approaches to ascribe identities to each cluster. First, genes with known developmental timing in early-, mid-, and late-stages were used to place clusters in time, and smiFISH microscopy was used to fine-tune the assignments. Second, the eight late-stage clusters were assessed for their region of origin. To assign these clusters to anatomical regions, we identified marker genes for each cluster and assessed their expression along the anterior-to-posterior length of the intestine using smiFISH microscopy. Genes associated with growth and cell division were expressed in early stages, whereas genes associated with immune responses and metabolism were expressed later. Genes associated with biotic responses and RNA metabolism were the most likely to vary across the intestines anterior-posterior axis. Finally, perturbation of anterior-localized intestinal transcripts more robustly affected intestinal function compared to central or posterior-localized genes. Overall, this research illustrates the intrinsic heterogeneity across the 20 cells of the embryonic intestine and sets the stage for future works aimed at understanding cell-specific intestinal responses to diet and the environment. ARTICLE SUMMARYWe investigate how the Caenorhabditis elegans intestine develops specialized functions on a spatiotemporal scale. We used single-cell RNA-sequencing to analyze embryonic intestinal cells and identify 13 distinct clusters. Combining gene expression analysis with microscopy, we assigned clusters to developmental stages and anatomical regions. Clusters associated with early intestine development express genes linked to growth and cell division, while later-stage clusters express genes involved in metabolism and immune responses. Genes varied across the intestines anterior-to-posterior axis, and disrupting anterior-specific genes produced stronger functional effects. These findings reveal previously unrecognized intestinal diversity and provide insight into how intestinal cells specialize during development.
Zhang, L.; Paterson, A. D.; Sun, L.
Show abstract
Testing for Hardy-Weinberg equilibrium (HWE) is a fundamental component of genetic data analysis, widely used for quality control and model validation. Although HWE testing is well established for autosomal loci, inference on the X chromosome is more complex due to sex-specific genotype structures and potential sex differences in minor allele frequency (sdMAF). Existing tests differ in their assumptions about sdMAF and male sample inclusion, often leading to distinct but poorly characterized null hypotheses. We develop a general statistical framework for HWE inference using the robust allele-based regression model. By formulating HWE testing as an assessment of allele-level dependence, the framework directly parameterizes Hardy-Weinberg disequilibrium, unifies existing Pearson{chi} 2-based tests under explicit modeling assumptions, and clarifies their null hypotheses, degrees of freedom, and sensitivity to sdMAF. The framework also accommodates covariate and population-structure adjustment within a unified regression-based formulation. The proposed framework provides robust, interpretable, and flexible inference, establishing a unified statistical foundation for HWE testing across autosomal and X-chromosomal regions. Simulation studies and analysis of high-coverage 1000 Genomes Project data demonstrate that commonly used X-chromosome tests can exhibit inflated type I error or misleading inference when sdMAF is present.
Chang, X.; Hou, S.; Zhou, X.
Show abstract
Calibrated prediction intervals for polygenic scores (PGS) are essential for communicating individual-level uncertainty in genomic medicine. We present updated comparisons of two methods for constructing such intervals: CalPred, a parametric approach, and PredInterval, a non-parametric approach. Our results show that both methods can achieve calibrated coverage, although CalPred additionally requires a sufficiently large calibration set. The two methods also exhibit complementary trade-offs with respect to dataset size and risk identification. We further show that contextual calibration, as introduced in Hou et al. and followed in Shi et al., is most naturally achieved through appropriate phenotype normalization and data preprocessing. Apparent miscalibration can arise from inadequate normalization or from providing contextual information to some methods but not others. In UK Biobank, standard GWAS phenotype normalization procedures are sufficient to achieve contextual calibration for traits analyzed. In the extreme simulations of Hou et al. and Shi et al., supplying contextual covariates to PredInterval restores contextual calibration without normalization, and appropriate normalization can achieve contextual calibration without supplying covariates, while also substantially improving upstream tasks including association power and PGS accuracy. Together, these results underscore the central role of phenotype normalization and data preprocessing in GWAS analyses, including reliable uncertainty quantification for PGS.
Acharya, S. R.; Garcia-Abadillo, J.; Lyerly, J.; Brown-Guedira, G.; Jarquin, D.; Bandillo, N.
Show abstract
Genomic prediction models that account genotype-by-environment (GxE) have the potential to accelerate the rate of genetic gain for yield and agronomic performance, yet relatively few studies have applied GxE prediction in public soft red winter wheat (Triticum aestivum) breeding programs. In this study, we extended a reaction norm-based genomic prediction framework by integrating weather-based environmental covariates to more effectively capture genotype- environment interactions. Key agronomic traits, including seed yield, plant height, test weight, and heading date, were evaluated across 33 environments (location-year) using over 3,200 breeding lines from the North Carolina State University small grains breeding program. Multiple genomic prediction models were compared using several cross-validation (CV) schemes representing common breeding scenarios. Across traits, the reaction norm M5 model, which incorporates both GxE and genotype-by-environmental covariate interactions (GxO), achieved the highest prediction accuracy (PA) in CV2 (predicting incomplete field trials) and CV1 for yield and test weight (predicting new lines). The highest PA was observed for test weight under CV2 (0.54) and for yield under CV1 (0.41). Under CV0 (predicting new environments), the M3 model incorporating GxE produced highest PA across traits, with the greatest accuracy for plant height (0.45), although differences among M2, M3, and M4 were small. Prediction under CV00 (predicting new lines in new environments) remained more challenging, with PA values 0.10 - 0.20 across traits. Overall, our results demonstrate that integrating environmental covariates into genomic prediction models can improve predictive performance across diverse wheat-growing environments in North Carolina, supporting their utility for applied breeding efforts. CORE IDEASO_LIIntegrating genotype-by-environment (GxE) interactions with environmental covariates improves prediction accuracy across environments. C_LIO_LIModel performance varies by prediction scenario, with different approaches performing best for new lines, incomplete trials, or new environments. C_LIO_LIPrediction of new lines in new environments remains challenging. C_LI PLAIN LANGUAGE SUMMARYThis study explores how adding environmental information to genomic prediction models can improve prediction accuracy in a public winter wheat breeding program. Using data from multi-environment trials conducted across diverse conditions in North Carolina, we evaluated statistical models that capture how different wheat lines respond to changing environments. By incorporating weather data, we improved the ability to predict performance across locations and years. These findings provide practical insights for refining selection strategies and accelerating genetic gain in wheat breeding.
Kalra, S.; Sanchez, G.; Stubin, A.; Le, A.; Bakshian, A.; Ortiz Diaz, B.; Mark, B. M.; Pena, C.; Parker, E.; Johnston, E.; Hsu, E.; Brangham, G.; Bala-Mehta, I.; Perez, L.; Milrod, M.; Stanten, M.; Nakamura, M.; Hwang, P.; Ptaszynska, S.; Cander, S.; Park, S.; Tan, T. L.; Zhou, Y.; Coolon, J.
Show abstract
Gene-by-environment (GxE) interactions play a major role in shaping both phenotypic and molecular variation, with important implications for human health and disease. In this study, we used the Doxycycline (Dox) regulated, tetracycline-responsive (Tet-Off) promoter system to sequentially reduce or titrate gene expression levels of the essential yeast transcription factor Repressor Activator Protein 1 (RAP1) similar to a hypomorph allele series, across three distinct environments: Yeast Peptone Dextrose (YPD) media, YPD media with Heat Shock (HS), and Yeast Peptone Acetate (YPAC) media. We then performed RNA sequencing (RNA Seq) to assess global transcriptional responses to RAP1 reduction in these different growth environments. Our analysis first focused on the independent effects of varying RAP1 expression levels within and across environments. We then explored GxE interactions, revealing a subset of genes with significant consequences of reduced levels of RAP1 and environment-specific expression patterns. Notably, many genes exhibited opposite effects of RAP1 titration on gene expression when yeast were grown in YPAC media compared to YPD media and/or HS, suggesting environment-dependent regulatory architecture. This design reveals how cells integrate internal transcriptional and regulatory changes with external environmental cues, providing a deeper view of GxE architecture. Using Weighted Gene Co-expression Network Analysis (WGCNA), we identified co-regulated gene modules, and by combining this with transcription factor motif enrichment tests, our study identified candidate regulators driving their dynamics. Our findings demonstrate that gene regulatory networks can vary dramatically depending on the environmental context an organism experiences, which can then influence the specific phenotypes produced by a particular genetic perturbation. This illustrates the complexity of genotype-environment interactions and the importance of studying gene function in multiple environments to gain a truly comprehensive understanding of a genes sometimes numerous and diverse functions.