Genomics — Latest Matching Preprints

1

Evolutionary Stratification of Codon Usage Bias In Plants Arises from GC3 Composition and Translational Optimization

Mohanta, T. K.

2026-07-01 genomics 10.64898/2026.06.26.734692 medRxiv

Top 0.2%

4.3%

Show abstract

Codon usage bias is a fundamental genomic characteristic that prefers non-random preferential use of synonymous codons. It is a major determinant of translational efficiency, gene regulation, and molecular evolution. However, the evolutionary bias and functional relevance of codon usage bias across the plant lineage is poorly defined and yet to understand what are the major factors responsible for relative synonymous codon usage (RSCU) in genomes and how codon usage bias influences the gene regulation, molecular evolution genomes. A genome-wide codon usage bias study of coding DNA sequences of 262 plant genome was conducted. It encompassed more than 4.6 billion codons from > 11 million coding sequences. Relative synonymous codon usage, codon adaptation index, codon-anticodon mapping, effective number of codon (ENC)-GC3, GC1,2-GC3, parity rule 2 (PR2-bias), molecular economy, and machine learning approaches were used for the study. It was found that codon usage bias was strongly non-random and exhibited a clear phylogenetic structuring. The higher plants favoured A/T-ending, whereas early-diverging lineages were enriched in G/C-ending codons. Analysis of RSCU, codon adaptation index, and codon-anticodon pairing indicated that translational selection is mediated by tRNA availability, contributing sustainability to these molecular patterns. Machine-learning approaches identified a small subset of codons having outsized influence on genome-wide codon usage landscapes. Further studies revealed the presence of robust inverse relationships between the effective number of codons and GC content at synonymous third positions. Neutrality analysis revealed approximately 61% of variation was driven by mutational pressure, tempered by selective constraints. Phylogenetic reconstruction showed a progressive relaxation of codon bias from algae to angiosperms while maintaining a conserved molecular economy cost of ~ 30 ATP per codon across the lineages. The study revealed codon usage bias is lineage-specific evolutionary conserved trait governed by mutation, selection, and translational optimization.

2

A telomere-to-telomere (T2T) pig genome assembly reveals Y chromosome diversity and structural variations of Wuzhishan pigs

Ren, Y.; Wang, F.; Li, X.; Liu, G.; Sun, R.; Zheng, X.; Zhang, Y.; Lin, R.; Lu, X.; Chen, L.; Xin, W.; Fei, Y.; Chao, Z.

2026-04-27 genomics 10.64898/2026.04.23.720499 medRxiv

Top 0.2%

4.2%

Show abstract

BackgroudWuzhishan (WZS) pigs are native to Hainan Province of China, and serve as both important agricultural resources and biomedical models. Although the published WZS pig genome (T2T-pig1.0) even achieving telomere-to telomere (T2T) completeness, substantial genetic diversity still exists within the same pig breed, another WZS pig genome named WZS-T2T was assembled in this study. ResultsMultiple sequencing data were used to assemble genome, and finally yielded a [~]2.68 Gb telomere-to-telomere genome, with N50 length [~]142.87 Mb, and annotated protein coding genes of 23,100. Compared to T2T-pig1.0, QV and BUSCO value was higher, and the Y chromosome (ChrY) length was longer in WZS-T2T than that of T2T-pig1.0. ChrY of two WZS pigs shared 11 genes, including sex differentiation-related genes of SHOX, PRKX, and DDX3X, and SRY; however, energy metabolism gene SLC25A4 and the macrophage-related receptor gene CSF2RA of ChrY were specific to WZS-T2T. An inversion SV on chromosome 10 with length [~]33.86 Mb was identified between two WZS pigs, and three proofs were proposed for proving the accuracy sequence orientation of WZS-T2T.The genetic diversity was consistent with LD decay speed in population different analysis. WZS pigs exhibited higher genetic diversity than other four pig populations (Tunchang pigs, Yuxi black pigs, Large White pig, and Duroc pigs) examined in this study, and presented slower LD decay compared to other four breeds. ConclusionsTherefore, WZS-T2T provided a higher-quality assembly, and potential advantages of both agricultural production and biomedical targets for WZS pigs.

3

Deciphering the limitations of immortalized hepatocyte cell lines for the study of liver cis-regulatory elements

Bellesis, A.; Li, X.; Moore-Frederick, D.; Xu, D.; Delbridge, K.; Ma, J.; Vaccaro, G.; Edward, B. A. A.; Kellogg, M.; Creeger, Y.; Okamoto, A. S.; Kaplow, I. M.

2026-06-09 genomics 10.64898/2026.06.05.730479 medRxiv

Top 0.2%

4.0%

Show abstract

Immortalized cell lines are widely used in biological research despite their known differences from their tissues and cell types of origin. Such cell lines are especially popular for testing hypotheses regarding the activity of cis-regulatory elements (CREs) that regulate gene expression. Previous investigations of blood and skin cell lines revealed many differences between the transcriptional regulatory networks of the cell lines and the associated primary cells. Similar comparisons for other tissues have been limited. Here, we used ATAC-seq to profile CREs in four immortalized liver cell lines and found many differences between each cell lines CREs and primary liver tissue, including differences in the transcription factors that are likely to bind them and differences in the genes that they are likely to regulate. Modifying cell culture conditions based on recommendations in the literature did not improve the similarity with primary liver tissue. Our results suggest that differences between the transcriptional regulatory networks in cell lines and primary tissue should be considered when designing and interpreting cell line experiments.

4

Rubus armeniacus genome sequence reveals the secrets of blackberry anthocyanin biosynthesis

Wolff, K.; Nowak, M. S.; Thoben, C.; Beuerle, T.; Pucker, B.

2026-05-10 genomics 10.64898/2026.05.05.723051 medRxiv

Top 0.2%

3.6%

Show abstract

Here, we present a comprehensive multiomics analysis of anthocyanin biosynthesis in Rubus armeniacus, known for its dark fruits. A phased genome sequence of the tetraploid blackberry was generated, achieving an N50 of 34 Mb with an assembly size of 1.2 Gbp based on Oxford Nanopore Technology sequencing (ONT). The BUSCO score for the total assembly shows a high completeness of 99.1%. The assembly was separated into 4 pseudohaplophases, with the pseudohaplophase A representing the R. armeniacus genome in 7 chromosome scale contigs, with an N50 of 46 Mbp and 98.8% conserved BUSCO genes. A total of 118,183 protein coding genes were annotated within the genome assembly and all relevant genes encoding enzymes and transcriptional regulators of the anthocyanin biosynthesis pathway were identified within each pseudohaplophase. To further understand the underlying cause of dark pigmentation, the gene expression was analysed during different stages of berry development revealing a strong induction of anthocyanin biosynthesis genes including the anthocyanin activating subgroup 6 MYB transcriptions during the berry ripening process. Further, a quantification of cyanidin-3-O-glucoside in methanolic berry extract, utilizing a UHPLC-HRAM-MS analysis, revealed an approximately 500-fold increase of cyanidin-3-O-glucoside from green to black fruit, indicating that dark pigmentation in R. armeniacus results from high anthocyanin accumulation. Significance statementThis study provides a multiomics analysis of the dark pigmentation of Rubus armeniacus, including a high quality phased assembly and an in-depth analysis of the anthocyanin biosynthesis pathway. A transcriptional and metabolomic analysis revealed that dark berry pigmentation is caused by a high accumulation of cyanidin-3-O-glucoside during fruit ripening.

5

A Foundational Exome Resource for Jordan: Dual Ancestry Admixture and Population-Specific Variants to Improve Clinical Variant Interpretation

Froukh, T.

2026-05-27 genetic and genomic medicine 10.64898/2026.05.23.26353895 medRxiv

Top 0.2%

3.5%

Show abstract

Currently, the genetic architecture of Middle Eastern populations is underrepresented in global genomic databases. This gap increases the rate of Variants of Uncertain Significance (VUSs) and clinical misinterpretations of genomic data especially in Middle Eastern populations. Whole exome sequencing was conducted on 90 healthy individuals from Jordan and the data were analysed using Principal Component Analysis (PCA) and multi-computational filtering. PCA revealed a double ancestry (EUR-AFR) admixture rather than a triple admixture (EUR-AFR-AMR). More than 3,500 populations-specific variants (PSVs) were identified, of which 72% were singletons. Additionally, 19 variants were significantly enriched compared to the maximum allele frequencies in public global databases (Fisher's exact test with Benjamini-Hochberg false discovery rate correction, p-value < 0.05). Consequently, the results suggest the reclassification of variants of Uncertain Significance (VUS) which reside in the ECE2 gene to likely benign and the variants of Conflicting Classification of Pathogenicity in the genes IL1RN and THPO to benign based on the significant allele frequency (AF=0.0389, p-value < 0.05). Furthermore, a pathogenic ClinVar variant was identified in a healthy individual, warranting careful interpretation. The findings underscore the importance of identifying PSVs in order to minimize or even prevent clinical misdiagnosis and highlight the unique genetic signature in Jordan. The study serves as a foundational resource for precision medicine in the region.

6

The genetic architecture of milk urea concentration in dairy cattle differs across the lactation cycle

He, Q.; Vasiljevic, S.; Kadri, N.; Watson, N.; Stratz, P.; Mapel, X. m.; Leonard, A. S.; seefried, F. R.; Pausch, H.

2026-04-24 genomics 10.64898/2026.04.22.719978 medRxiv

Top 0.3%

3.3%

Show abstract

Milk urea concentration (MUC) is an indicator of dietary protein utilization and nitrogen use efficiency in dairy cows. We performed genome-wide association studies (GWAS) on MUC in early, mid, and late lactation in the Holstein (HOL) and Brown Swiss (BSW) dairy cattle breeds using imputed sequence variants. We identified 11 and 17 independent quantitative trait loci (QTL) for MUC across the three lactation stages in BSW and HOL, respectively. While many of these QTL have previously been reported for MUC and other dairy traits, our study provides evidence that some QTL exert lactation-stage specific effects. Our findings suggest that variants at the DGAT1 locus on BTA14 have pleiotropic effects on MUC and other dairy traits. This QTL showed an early lactation-specific association with MUC but impacted milk and fat yield across the entire lactation. We fine-mapped two QTL for MUC in early and mid-lactation in BSW on BTA9 (lead SNP: 9:21392941, Pcorrected = 1.1E-17) and BTA28 (lead SNP: 28:6518357; Pcorrected = 3E-11). We identified lncRNA ENSBTAG00000058688 and IBTK as positional and functional candidate genes for the BTA9 QTL, and KCNK1 as positional and functional candidate gene that harbors a highly significant missense variant for the BTA28 QTL. In conclusion, our results shed light on the genetic architecture of MUC and highlighted QTL harboring potential functional variants underpinning milk urea variation within and across breeds.

7

Annotation-Based Gene-Peak Links Improve Regulatory Network Prediction of Gene Expression in Human Kidney Multi-Omics

Wang, X.; Siegmund, K.; Goodrich, J. A.; Nelson, J.; Mi, H.; Zhang, L.; Gazal, S.; Queme, B.; Shibata, D.; Street, K.

2026-06-17 genomics 10.64898/2026.06.12.731741 medRxiv

Top 0.3%

3.2%

Show abstract

BackgroundLinking distal regulatory elements to their target genes is a central problem for interpreting chromatin accessibility and other non-coding genomic data. Proximity-based mapping is convenient but ignores three-dimensional enhancer-promoter architecture and can misassign long-range regulatory effects. Correlation-based approaches can also miss regulatory links because of limited statistical power and restrictive distance or significance thresholds. Single-cell and single-nucleus multi-omic datasets, such as 10x Multiome profiles that jointly measure chromatin accessibility and gene expression in the same cells or nuclei, now provide a way to evaluate gene-peak linkage strategies by testing how well linked accessibility features predict gene expression. Existing methods often focus on scoring individual enhancer-gene pairs. In this study, we proposed and constructed a fast annotation-based candidate gene-peak network and tested whether it improves downstream prediction of gene expression. MethodsWe first built a unified gene-peak regulatory network by integrating enhancer-based, promoter-based, and proximity-based linkage strategies. We then used single-cell multiome data from the Kidney Precision Medicine Project (KPMP) 10x Multiome cohort to evaluate whether the proposed links captured regulatory signals. We aggregated RNA expression and ATAC accessibility at the cell-type cluster level and trained predictive models to evaluate how well different linkage strategies could explain gene expression based on accessibility. Model performance was compared between annotation-based (including both enhancer- and promoter-based links) and proximity-based gene-peak links using testing R{superscript 2} and mean squared error (MSE) in a strict set of 1,704 genes and an adaptive set of 7,973 genes with more relaxed requirements. ResultsIn the strict regime (1,704 genes with [≥]20 peaks assigned by the nearest-TSS rule and [≥]10 annotated enhancer-based peaks), the annotation-based model consistently achieved higher testing R{superscript 2} and lower testing MSE than the proximity-based model. In the larger adaptive regime (7,973 genes with [≥]5 proximity-based and [≥]2 enhancer-based peaks), we defined for each gene a balanced number of selected peaks based on its available closest and enhancer links; the annotation-based model again showed globally higher testing R{superscript 2} and lower testing MSE. These improvements were observed over a broad range of candidate-linked peak numbers. ConclusionsUsing human kidney 10x Multiome data, we show that a fast annotation-based gene-peak linkage framework can improve prediction of gene expression from chromatin accessibility compared with conventional approaches. These results support the use of biologically informed enhancer and promoter annotations when constructing candidate gene regulatory networks. Our framework also showed concordance with the correlation-based Signac LinkPeaks method while providing broader coverage and greater computational efficiency. We have implemented these annotation-based linkage methods in the GPlinksR R package, providing a fast and scalable tool for constructing regulatory networks.

8

Mapping non-coding functional elements in allotetraploid Cyprinus carpio embryo development reveals subgenome variation of transcription regulation

Jimenez-Gonzalez, A.; Madrero Pardo, A.; Hadzhiev, Y.; Blasweiler, A.; Zunar, B.; Csenki-Bakos, Z.; Muller, T.; Megens, H.-J.; Wiegertjes, G. F.; Lenhard, B.; Baranasic, D.; Mueller, F.

2026-06-26 genomics 10.64898/2026.06.22.733679 medRxiv

Top 0.3%

3.2%

Show abstract

Common carp (Cyprinus carpio) is an important freshwater species for ornamental and aquaculture purposes, and a key cyprinid model for studying allotetraploidy. Its two chromosomally-separated subgenomes show distinct gene expression profiles, but how their regulatory landscapes control gene expression dynamics during development remains unknown. We generated a regulatory atlas by combining transcriptomes across 12 developmental stages with chromatin accessibility maps, transcription start sites and gene regulation-associated histone post-translational modifications. Subgenome-specific annotation and comparison of 254,276 developmental regulatory elements (PADREs) revealed that regulatory subgenome divergence is most prominent during early development, converging toward the phylotypic period, mirroring expression convergence between subgenomes at the same stages. This dynamic was driven by enhancers, while promoters maintained a more stable subgenome bias, extending the hourglass model of developmental constraint to allotetraploid subgenome regulation. Subgenome-specific enhancers were preferentially retained in subgenome B, whereas subgenome A shifted toward homeologous enhancer activity near the phylotypic stage, indicating directional regulatory divergence between subgenomes. Comparison with zebrafish revealed high concordance with sequence conservation and that subgenome B retained more ancestral cyprinid regulatory elements than subgenome A. This developmental regulatory atlas provides a foundational resource for investigating cis-regulatory evolution following the fourth round of vertebrate genome duplication.

9

A Single-Nucleus Transcriptomic Atlas Reveals Cell-Type-Specific Responses to OsHV-1 Infection in the Pacific Oyster

Dewari, P. S.; Regan, T.; Chapuis, A. F.; Florea, A.; Furniss, J. J.; Clark, T. C.; Taylor, R. S.; Bean, T. P.

2026-05-18 genomics 10.64898/2026.05.15.723513 medRxiv

Top 0.3%

3.1%

Show abstract

BackgroundThe Pacific oyster (Crassostrea/Magallana gigas) is increasingly recognised as a model marine invertebrate. Valued for both ecological and commercial importance, Pacific oysters are farmed widely, supporting global food security by providing a sustainable nutrient-rich source of protein. Despite the significant and recurring economic losses caused by Ostreid herpesvirus (OsHV-1) outbreaks, only a limited number of studies have examined host-pathogen interplay at single-cell resolution. The few available studies largely focus on circulating immune cells (haemocytes), thereby overlooking the complexity of host responses across different tissues and organs. ResultsWe present a detailed single-nucleus transcriptomic atlas of the whole Pacific oysters, including during OsHV-1 infection. A total of 18 distinct transcriptomic clusters were resolved, capturing major cell populations from the gill, mantle, hepatopancreas, adductor muscle, and haemocytes. Notably, three populations- gill ciliary cells, hepatopancreas cells, and an immune-enriched cluster 1- exhibited pronounced transcriptomic responses to OsHV-1 infection. Across the 6, 24, 72, and 96 hours post-infection (hpi) time course, viral transcripts were detected almost exclusively at 72 hpi, with enrichment primarily in adductor muscle cells and two immune cell populations- immature haemocytes, and hyalinocytes. ConclusionsOur findings suggest potential entry portals and tissue-specific replication sites for the OsHV-1 virus in Pacific oysters. This atlas resource provides a high-resolution cellular framework for understanding host-virus interactions and establishes a foundation for future investigations into herpesvirus pathogenesis in marine invertebrates.

10

Genetic and epigenetic diversity of Salmonella enterica isolates from Kazakhstan from clinical and veterinary sources

Yessimseit, D. T.; Rysbekova, A. K.; Zhumadilova, Z. B.; Abdeliyev, B. Z.; Kassenova, A. K.; Tukhanova, N. B.; Abdrakhmanova, A. K.; Mereke, A.; Agzam, S. D.; Nurpeisova, A. S.; Nissanova, R.; Maksatova, A. M.; Reva, O. N.; Abdirassilova, A. A.

2026-06-04 genomics 10.64898/2026.06.01.729238 medRxiv

Top 0.4%

2.8%

Show abstract

BackgroundSalmonella enterica is a major cause of foodborne and invasive infections worldwide. Increasing antimicrobial resistance and adaptation to diverse ecological niches require an improved understanding of the genetic and epigenetic diversity of circulating strains. This study investigated the genomic and epigenetic diversity of S. enterica isolates collected in Kazakhstan from clinical, animal, and environmental sources. MethodsWhole-genome sequencing was performed using the Illumina sequencing platform. Several selected strains were additionally sequenced using PacBio SMRT technology for DNA methylation profiling. Genome assembly, plasmid reconstruction, MLST genotyping, and analyses of virulence genes, antimicrobial resistance determinants, and genome methylation associated with restriction-modification (RM) systems and orphan methyltransferases were performed using established bioinformatics tools. ResultsThe ST11 genotype predominated among clinical isolates, but these strains formed distinct clusters differing in plasmid composition, virulence-associated genes, and resistance determinants. Most strains carried two large plasmids associated with environmental persistence and virulence, whereas the recent hospital isolate 19S, belonging to the ST11 group, carried two alternative plasmids enriched in virulence and antibiotic resistance genes. All genomes demonstrated conserved DAM-associated adenine methylation at GATC motifs, partial DCM-mediated cytosine methylation at CCWGG motifs, and widespread adenine methylation at CAGAG motifs linked to type III RM system. In contrast, the type I RM system present in the majority of sequenced strains was suppressed under laboratory growth conditions and remained active only in strain 19S, possibly due to mutations identified in the hsdM gene that may have released this methyltransferase from suppression. Novel epigenetic modification signals involving cytosine and guanine in replichore-biased tandem repeats were also identified. ConclusionsS. enterica strains circulating in Kazakhstan exhibit substantial genomic and epigenetic diversity associated with different survival and transmission strategies. DNA methylation profiling provided additional insights beyond conventional MLST genotyping and identified strain 19S as a promising model for future studies of epigenetic regulation in bacterial virulence and adaptation mediated through genomic DNA methylation.

11

Integrative Bioinformatics Approach to Identify Prognostic Gene Signatures for Risk Stratification in Thyroid Carcinoma

Malik, S.; Raghava, G. P. S.

2026-04-27 bioinformatics 10.64898/2026.04.23.720344 medRxiv

Top 0.4%

2.7%

Show abstract

Thyroid cancer is a heterogeneous malignancy with variable outcomes, highlighting the need for reliable biomarkers and effective risk stratification. In this study, we implemented a multi-step integrative framework to identify distinct prognostic biomarker sets using transcriptomic data from 572 thyroid cancer patients. Correlation analysis followed by false discovery rate (FDR) correction revealed significant associations of genes. Notably, MAFF (r = 0.25, p = 1.34x10-, FDR = 2.46x10-), NR4A3 (r = 0.24, p = 1.26x10-, FDR = 9.25x10-), and SRF showed strong positive correlations, whereas LOC728264 (r = -0.21, p = 7.39x10-, FDR = 6.36x10-) and VAMP1 (r = -0.20, p = 1.20x10-, FDR = 1.3x10-) exhibited negative correlations with OS. Univariate Cox regression identified several survival-associated genes, including TMEM90B (HR = 10.66, p = 2.88x10-) and PTH1R (HR = 9.88, p = 5.55x10-). LASSO regression further identified 31 key prognostic genes, including 13 potential drug targets predominantly functioning as inhibitors. Machine learning models based on seven independent 20-gene biomarker sets effectively predicted Class 0 (0-1 years), Class 1 (1-3 years), Class 2 (3-5 years), and Class 3 (>5 years), achieving AUC values of 0.91-0.94 and Kappa up to 0.76. An ensemble model further improved prediction (AUC = 0.95, Kappa = 0.72). Incorporating clinical variables (age, gender, stage) enhanced model performance (AUC = 0.96, Kappa = 0.80). Reduced 10- and 5-gene subsets demonstrated consistent yet slightly lower performance (AUC = 0.90 and 0.86, respectively). Collectively, the 20-gene set exhibited the strongest predictive and prognostic potential, highlighting the importance of integrating molecular and clinical features for risk stratification in thyroid cancer.All data and code are openly available (https://github.com/raghavagps/THCA_prognostic_biomarkers), supporting future research in thyroid cancer prediction.

12

A gapless Landrace pig genome resolves centromeres and telomeres and highlights telomere repeat structures in different pig breeds

Grove, H.; Stenlokk, K. S. R.; Lien, S.; Gjuvsland, A. B.; Arnyasi, M.; van Son, M.; Kent, M.

2026-06-30 genomics 10.64898/2026.06.25.734473 medRxiv

Top 0.4%

2.7%

Show abstract

Abstract The Duroc-derived reference genome Sscrofa11.1 has provided a critical foundation for pig genomics, providing a high-quality reference genome for accurate variant detection and comparative genomics but does not capture breed-specific variation. Here, we present a near-complete, gap-free genome assembly for the Landrace pig (Landrace_v1, GCA_963921485.1), spanning all 20 chromosomes and totaling 2.6 Gb, including 176 Mb of sequence absent from Sscrofa11.1. Comparative analyses with recently published high-quality pig genomes reveal a conserved centromere organization across breeds, accompanied by substantial variation in repeat composition and length, and identify a pig specific pattern of telomere variant repeats across eight pig breeds. The improved resolution of repetitive regions in Landrace_v1 enables more complete reconstruction of complex gene families, including olfactory receptors, and uncovers structural variation at the KIT proto-oncogene receptor tyrosine kinase locus not represented in the Duroc reference. Together, these findings highlight the limitations of single-reference genomes and demonstrate the value of breed-specific assemblies for capturing genomic diversity and improving downstream analyses.

13

Single-nucleus multiome sequencing identifies candidate regulators of mouse gastric epithelial homeostasis

Monteiro de Barros, M. R.; Bosch, K.; Soualhi, S.; Issa Bhaloo, S.; Chu, T.; Hemrajani, T.; Cho, J.; Ozuner, K.; Fu, R.; Geiger, H.; Robine, N.; Carter, J. E. B.; Maniatis, S.; Ryeom, S.; Tavare, S.; Nowicki-Osuch, K.

2026-04-27 genomics 10.64898/2026.04.23.720450 medRxiv

Top 0.4%

2.5%

Show abstract

Background & AimsGastric epithelial cells maintain homeostasis through dynamic self-renewal mechanisms involving stem and progenitor cells; however, identifying them has been challenging. This study aims to identify stem cells of healthy gastric epithelium and cell type-specific regulators defining gastric epithelial homeostasis via single-nucleus multiome analysis. MethodsTen unique gastric samples were collected from 8-12 week old wildtype mice. Isolated nuclei were subjected to simultaneous profiling of gene expression and chromatin accessibility. After quality control, 31,598 cells were analyzed with Seurat and Signac using weighted-nearest neighbors analysis for joint RNA and ATAC clustering. Furthermore, SCENIC+, MultiVelo, EpiCHAOS and Cell plasticity score were used to uncover gene regulatory networks, cell state dynamics and lineage trajectories. ResultsOur analyses were validated by the identification of known regulators of stem-cell differentiation into mature cell types. More importantly, it revealed previously uncharacterized regulatory networks comprising novel transcription factor combinations that define cell identities, including Ppara, Pparg, Arid5b and Sox5 as candidate regulators of parietal, foveolar, chief and neck cells, respectively. Further, our data support the identity of isthmus cells as stem-like cells of healthy gastric epithelium, as evidenced by epigenetic plasticity that simultaneously contains open chromatin states of all differentiated cell types in the absence of transcriptional reprogramming. ConclusionConsistent with Waddingtons epigenetic landscape hypothesis, gastric epithelial homeostasis is controlled by orchestrated epigenetic and transcriptional programs. Contrary to the prevailing hypothesis, stem cells can be defined not by a separate epigenetic state but by epigenetic superposition of differentiated cell states. Future work is needed to define the universality of these results.

14

Chromosome-level genome assembly of the Northeast China Brown Frog (Rana dybowskii)

zhang, y.; Wang, D.; Zhao, R.; Li, S.; Zheng, X.; Hu, G.

2026-06-15 genomics 10.64898/2026.06.11.731602 medRxiv

Top 0.5%

2.4%

Show abstract

Rana dybowskii is distributed across Northeast Asian and represents a valuable medical resource. A high-quality assembly of the genome has not yet been reproted. This species has 2n=24 chromosomes, but a huge genome size that estimated at 3.5 ~4.6 Gb in the previous studies. The relatively large chromosome size, exceeding hundreds of megabases, may result in difficulties of obtaining a complete chromosome level genome. Here, we constructed a chromosome-level genome assembly of R. dybowskii by integrating PacBio HiFi long-read sequencing for de novo assembly and CiFi (3C coupled with HiFi sequencing) for scaffolding. The final assembly consists of 12 chromosomes with a total of 3.95 Gb and a scaffold N50 length of 455 Mb. BUSCO assessment using the tetrapoda_odb12 database identified 94.2% complete and 0.5% fragmented orthologs, suggesting a high level of completeness of the assembly. Genomic annotation revealed that repetitive sequences comprise over 53% of the assembly, with retroelements and DNA transposons accounting for 22% and 25%, respectively. A total of 43,999 protein-coding genes were predicted with the assistance of RNA-seq reads from four tissues (muscle, eye, testis and skin). This high-quality chromosome-level reference genome provides a valuable genomic resource for advancing genetic studies of the species.

15

Linear plasmid prevalence and linezolid resistance gene carriage in vancomycin-resistant Enterococcus in Canada from 2009-2024

Lerminiaux, N.; McCracken, M.; Bartoszko, J. J.; Grewal, G.; Ahmed, S.; Johnstone, J.; Golding, G. R.; CNISP VRE working group,

2026-05-12 genetic and genomic medicine 10.64898/2026.05.08.26352429 medRxiv

Top 0.5%

2.2%

Show abstract

The incidence of vancomycin-resistant Enterococcus (VRE) is rising in hospitals in Canada, and resistance to last-resort antimicrobials including linezolid complicates treatment options for multidrug-resistant isolates. Recent reports from around the globe indicate that both linezolid and vancomycin resistance genes can be co-carried and mobilized by linear plasmids (named pELF) in Enterococcus species, often on the same backbone. We aimed to investigate linezolid resistance and linear plasmid prevalence in VRE bloodstream infection isolates collected by the Canadian Nosocomial Infection Surveillance Program from 2009 to 2024. We found that screening for pELF linear plasmid ends in short reads was a reliable way to predict linear plasmid presence in large-scale surveillance data (100 % accuracy on 85 reference samples). Almost half of the isolates in our collection were predicted to carry pELF plasmids (45.4 %, 941/2071) and we found that this proportion has increased from 2018 (32.2 %, 59/183) to 72 % of isolates between 2021 and 2024 (2021: 68.5 % (115/168); 2022: 71.6 % (146/204); 2023: 72.8 % (166/228); 2024: 71.6 % (235/328)). This trend of increasing linear plasmid carriage is evident from 2018 to 2024 across the dominant emerging sequence types (ST80, ST17, ST117). Linezolid resistance based on phenotypic antimicrobial susceptibility testing was low (1.0 %, 21/2071). Using long read sequencing, we characterized the linezolid resistant isolates and confirmed pELF plasmid presence in 13/21 (61.9 %) isolates. Six isolates harboured pELF plasmids encoding linezolid resistance genes (optrA, cfr(D), poxtA) and five of these also encoded vancomycin resistance genes (vanA). We compared these six plasmids to 39 public plasmid sequences and clustered them using MOB-suite and pling. Overall, this study provides further examples of the co-carriage of vancomycin and linezolid resistance genes on mobile linear plasmids and shows that linear plasmid prevalence is detectable and increasing across VRE in Canada. IMPACT STATEMENTGiven the increasing prevalence of multidrug-resistant hospital-acquired pathogens, resistance to last-resort antibiotics is a global public health threat. Linezolid is a last-resort antibiotic used to treat vancomycin-resistant Enterococcus isolates, and the dissemination of linezolid resistance genes is significantly facilitated by mobile elements that can transfer between unrelated strains and species. Linezolid resistance genes have recently been described on linear plasmids and are often co-localized with other resistance genes on the same plasmid backbone. Consequently, understanding the features and distribution of linear plasmids and those harbouring linezolid resistance genes is crucial for pathogen surveillance and mitigation of resistance. In this work, we used long-read and short-read sequencing to characterize genomic epidemiology of linear plasmids across 16 years of Enterococcus surveillance data in Canada. This study furthers knowledge of linear plasmids by demonstrating that they are relatively common across vancomycin-resistant Enterococcus blood isolates and by providing more examples of co-localized vancomycin and linezolid resistance genes on the same linear plasmid backbone. DATA SUMMARYSequencing data and genome sequences were deposited in National Centre for Biotechnology BioProject PRJNA1279082, and accessions are listed in Table S1. Supplementary materials for this study are available at the Figshare portal through DOI: XXX.

16

Comparative analysis of Illumina and Ultima-Genomics sequencing for plasma cell-free small RNA profiling in pancreatic cancer

Levon, A.; Volkov, H.; Shlayem, R.; Shomron, N.

2026-06-25 genomics 10.64898/2026.06.21.733585 medRxiv

Top 0.5%

2.2%

Show abstract

Plasma-derived cell-free small non-coding RNAs are promising non-invasive biomarkers for cancer detection and monitoring. However, variability in sequencing output limits standardization, and cross-platform performance for plasma small RNA profiling has not been systematically evaluated. Illumina short-read sequencing is the current standard, whereas the newcomer, Ultima-Genomics platform, has been less extensively studied for circulating small RNA in plasma. To directly compare platform performance, we sequenced plasma cell-free RNA from 39 patients with pancreatic cancer and 39 matched controls on both platforms. After filtering, Ultima-Genomics retained more mature microRNA reads, whereas Illumina achieved slightly higher enrichment efficiency and mapping rates. Despite these technical differences, both platforms produced concordant expression profiles, with strong cross-platform correlations for shared microRNAs and clear separation of cases and controls within each dataset. Differential expression analysis identified 14 significant microRNAs on both platforms with concordant directions of change, most of which are supported by pancreatic cancer databases. Pathway enrichment analysis highlighted signaling pathways implicated in pancreatic cancer, supporting the biological relevance of both shared and platform-specific signatures. These findings indicate that both Illumina and Ultima Genomics platforms are suitable for plasma small RNA profiling and capture biologically relevant signals in pancreatic cancer.

17

Germline polygenic score for prostate cancer aggressiveness

Xu, G. J.; Karunamuni, R.; Dornisch, A. M.; Brunette, C. A.; Danowski, M. E.; Desai, H.; Dochtermann, D.; Garraway, I. P.; Hauger, R. L.; Kibel, A. S.; Lynch, J. A.; Pyarajan, S.; Rose, B. S.; Teerlink, C. C.; Andreassen, O. A.; Dale, A. M.; Donovan, J. L.; Hamdy, F.; Kachuri, L.; Lane, A.; Martin, R. M.; Mills, I. G.; Neal, D. E.; Turner, E. L.; Witte, J. S.; Schleutker, J.; Pashayan, N.; Batra, J.; Australian Prostate Cancer BioResource (APCB), ; Nordestgaard, B. G.; Hamilton, R. J.; Wolk, A.; Albanes, D.; Atkins, J.; Blot, W. J.; Mucci, L. A.; Nielsen, S. F.; Cussenot, O.; Berndt, S. I.; K

2026-05-10 genetic and genomic medicine 10.64898/2026.05.07.26352488 medRxiv

Top 0.6%

2.1%

Show abstract

BackgroundRisk stratification for prostate cancer (PCa) progression or aggressiveness is often based on clinicopathologic features, some of which may be influenced by genetic factors. We developed a novel, germline polygenic risk score (PRSagg) to predict likelihood of developing aggressive PCa. MethodsPRSagg was developed using data from 38,688 patients with PCa (case-only analysis) from the Million Veteran Program (MVP) through a genome-wide search for variants associated with PCa grade group at diagnosis. We tested associations of PRSagg with grade group using the entire MVP dataset using the .632 bootstrap method. In an MVP cohort with localized PCa that was initially monitored without treatment, we tested PRSagg for association with unfavorable outcomes (subsequent development of grade group 4-5, metastasis, and/or biochemical recurrence after definitive treatment). We performed external validation in data from patients in the PRACTICAL Consortium (n=45,214) and from participants in the ProtecT randomized trial who underwent active monitoring (n=316). Odds ratios (ORs) were calculated per standard deviation (SD) increase with 95% confidence intervals, while adjusting for age, genetic ancestry, a previously developed polygenic score for risk of PCa (PHS601), and a polygenic score for benign elevated prostate-specific antigen (PRSPSA). For the outcome of metastasis, we additionally adjusted for PSA at diagnosis. ResultsIn the MVP training dataset, PRSagg (172 variants) was associated with higher grade group at diagnosis (OR = 1.53 [1.51-1.56]) and with increased risk of unfavorable outcomes during monitoring (OR = 1.13 [1.09-1.18]). These findings were confirmed in the external datasets. PRSagg was associated with greater odds of higher grade group at diagnosis (OR = 1.09 [1.06-1.11]). Among ProtecT participants undergoing active monitoring, PRSagg was associated with higher risk of metastasis (OR = 2.15 [1.02-3.88]). Among MVP participants with high polygenic risk of developing any PCa, the risk of aggressive disease was highest in men with high PRSagg and low genetic risk of PSA elevation. ConclusionsAmong men who develop PCa, a weighted sum of common germline variants (PRSagg) is independently associated with PCa aggressiveness. These findings may inform future study of germline influence on tumor evolution and risk-stratified intensity of active surveillance.

18

Identification of genes important for response of Pseudomonas aeruginosa biofilms to ciprofloxacin exposure

Wang, M.; Holden, E. R.; Yasir, M. R.; Bastkowski, S.; Turner, K.; Sims, L. P.; Gilmour, M. W.; Charles, I. G. W.; Webber, M. A.

2026-05-29 genomics 10.64898/2026.05.27.728104 medRxiv

Top 0.6%

2.1%

Show abstract

Pseudomonas aeruginosa is an opportunistic pathogen that can cause severe infections in immunocompromised individuals, such as patients with cystic fibrosis where it commonly forms biofilms. Ciprofloxacin is used extensively to treat P. aeruginosa infections, but its effectiveness can be significantly reduced due to biofilm formation. Although many individual genes associated with biofilm formation or ciprofloxacin resistance have been characterised, the genetic basis of P. aeruginosa biofilm fitness related to antibiotic challenge remains incompletely understood. In this study we employed a whole genome screen to assay the impact of gene disruptions or altered gene expression on survival of P. aeruginosa biofilms exposed to different concentrations of ciprofloxacin. Genes impacting fitness in the biofilm context were identified by comparing the biofilm samples to planktonic samples harvested at 12h, 24h and 48h with and without ciprofloxacin. Genes associated with c-di-GMP regulation and Gac/Rsm signalling were identified as primary regulators for biofilm formation in the presence and absence of ciprofloxacin. In addition, a group of genes involved in respiration, metabolism (especially polyamine metabolism), and various transporter and efflux systems were identified as important for biofilm fitness. Ciprofloxacin specifically imposed a selective pressure on flagellar function and Psl production which were essential for survival in early biofilms. Moreover, transposon insertions within the CPA gene clusters (PA5448-PA5451 and PA5455-PA5456) and the salvage peptidoglycan recycling pathway showed reduced fitness in late biofilms at high concentration of ciprofloxacin, indicating that cell envelope integrity is beneficial for mature biofilms. This study identifies important determinants of survival for biofilms at different stages of maturity in the presence and absence of ciprofloxacin and implicates potential therapeutic targets for antibiofilm drug development.

19

Chromosome-level genome assembly of macroalgae Gracilariopsis lemaneiformis

Hu, Y.; Huang, Y.; Yong, Y.; Shang, E.; Zhang, B.; Sui, Z.

2026-04-30 genomics 10.64898/2026.04.28.721235 medRxiv

Top 0.6%

2.1%

Show abstract

As an important cultivated red alga, Gracilariopsis lemaneiformis has great economic and ecological value. However, its existing genome assembly is highly fragmented and inadequately annotated. In this study, we constructed the first high-quality chromosome-level genome of Gp. lemaneiformis using PacBio long reads, Illumina short reads and Hi-C sequencing data. The assembled genome was approximately 86.66 Mb and the assembled sequences were anchored to 28 pseudo-chromosomes with lengths ranging from 1.70 to 7.81 Mb. 99.91% of the PacBio reads could be mapped to our assembly. In total, 8,664 genes were annotated, and the repeat elements identified in Gp. lemaneiformis constituted 65.04% of the whole genome, including 2.24% tandem repeat sequences and 62.81% interspersed repeats. We also established a high-evidence phylogenetic tree from 19 representative algae species, with the main aim to calculate their divergence times. This high-quality genome of Gp. lemaneiformis provides a crucial foundation for understanding genetic characteristics, investigating the genomic evolution, and facilitating molecular breeding.

20

Population analysis and host-disease associations of Shiga toxin-producing Escherichia coli from various sources across eleven European countries using whole genome sequencing

Tozzoli, R.; Schadron, T.; Knijn, A.; De Sabato, L.; Morabito, S.; Montalbano Di Filippo, M.; Fiskebeck, E.; Johannessen, G.; Antony-Samy, J. K.; Good, L.; Soderlund, R.; van Hoek, A.; Mughini Gras, L.; Franz, E.; Wieczorek, K.; Scavia, G.; Moro, O.; Chiani, P.; Michelacci, V.; Burgess, C. M.; Duffy, G.; Rodgers, J.; Kirchner, M.; Pista, A.; Silveira, L.; Amaro, A.; Clemente, L.; Chattaway, M. A.; Jenkins, C.; Dallman, T.; Schjorring, S.; Scheutz, F.; Byrne, B.; Gutierrez, M.; Lopez-Chavarrias, V.; Ugarte-Ruiz, M.; Brandal, L.; Naseer, U.; Kolackova, I.; Zomer, A. L.; Wagenaar, J. A.; Pires, S

2026-04-28 genomics 10.64898/2026.04.27.721056 medRxiv

Top 0.6%

2.1%

Show abstract

Shiga toxin-producing Escherichia coli (STEC) are important foodborne pathogens, able to cause severe disease in humans. In the DiSCoVeR project (https://onehealthejp.eu/jrp-discover/) a STEC inventory from human and non-human sources from 11 European countries was set up and [≥] 3500 strains were sequenced to perform comparative genomics analysis. We used this dataset to assess STEC population structure and to investigate potential associations between genomic features, host reservoirs and symptoms. Most STEC isolates analysed by Whole Genome Sequencing (WGS) in this study were collected between years 2010-2020. An ad hoc pipeline was deployed for a harmonised characterization of the STEC in the database, allowing the determination of serotyping, stx gene subtyping, 7-loci MLST, virulotyping and cgMLST. The results were analysed with Principal Component Analysis (PCoA) in relation with isolation source to assess clustering of STEC subpopulations. When human STEC data were analysed, the PCoA revealed three distinct human STEC subpopulations (STEC_1, STEC_2 and STEC_3), which were further analysed for associations between genomic features, symptoms and variance. The non-human STEC showed a more dispersed distribution, except for one subpopulation with genes linked to specific host species, and some virulence profiles overlapping with the STEC_1 population. In conclusion, our analysis identified distinct STEC subpopulations from human cases, each characterized by specific genetic features and associated with varying proportions of severe disease outcomes. These findings provide novel insights supporting the risk assessment of STEC. Impact statement[This lay summary of your article should be no more than 200 words, and should a) provide a perspective of how this article adds to the literature in the field; b) identify breadth of interest/utility; and c) state the significance of output (incremental or step), in terms of relevance.] This study is based on the establishment of a One Health STEC genomes database, including sequences from isolates of different sources. Most of the isolates had been isolated in the ten-years time span 2010-2020, in 11 different countries, for surveillance and monitoring activities or specific surveys and research purposes. The final dataset included the whole genome sequencing of 3,418 STEC isolates, mainly from human cases of infections. The metadata included the host symptoms, where available, for human STEC strains and the animal source the strains had been isolated from. We set up a pipeline for the harmonized analysis of STEC WGS, called Discover, made available though ARIES webserver or GitHub. The analysis allowed a deep characterization of STEC strains circulating in Europe. We used this resource to assess STEC population structure and to investigate potential associations between genomic features, host reservoirs, and various symptoms associated with STEC infection by PCoA. This analysis highlighted the presence of subpopulation of human STEC associated with specific features. We provide new information useful for risk characterization, as well as a large dataset genome database and associated metadata compiled from STEC strains, representing a valuable resource for the scientific community, enabling further investigations into STEC diversity, evolution, source attribution and public health relevance. Data summaryThe authors confirm all supporting data, including sequence data accession numbers, code and protocols have been provided within the article or through supplementary data files. One supplementary method and five supplementary tables are available with the online version of this article