Back

Fast, accurate and scalable normalization for RNA sequencing data with the RUVprps software package

Trussart, M.; Foroutan, M.; Milton, M.; Beltran, H.; Speed, T. P.; Molania, R.

2026-01-29 bioinformatics
10.64898/2026.01.28.702384 bioRxiv
Show abstract

Unwanted variation refers to any source of variability in the data that can compromise down-stream analysis. Effective removal of such variation from gene expression data is essential to derive accurate and meaningful biological results. We refer to this process as normalization. Data may come from a single study or from multiple studies with different sources of unwanted variation. We have previously developed the RUV-III method for normalizing omics data with a strong focus on transcriptomics. Initially, we introduced RUV-III for the normalization of Nanostring nCounter gene expression data, utilizing genuine technical replicates and pseudo-replicates as control samples. Subsequently, we proposed RUV-III with pseudo-replicates of pseudo-samples (PRPS), and which demonstrated its potential in mitigating the effects of different sources of unwanted variation in large and complex RNA-seq studies. To enhance accessibility and performance of this method, we present a new comprehensive R package named RUVprps. The package offers over 100 functions including ones for assessing variation in both biological and unwanted variables, an automated RUV-III normalization process, and metrics for evaluating the effectiveness of the resulting normalizations. Further, it introduces several new features such as ways of identifying unknown sources of unwanted variation, strategies to identify suitable negative control genes, and methods for generating PRPS when information on the biological and unwanted variation is unavailable. The package also implements a faster approach to RUV-III normalization, streamlining its application to large RNA-seq datasets. Our freely available R package and normalization assessment pipeline can help find effective data normalization methods for new data and help benchmark new methods.

Matching journals

The top 4 journals account for 50% of the predicted probability mass.