Back

An assessment of normalization and differential expression methods for miRNA-seq analysis using a realistic benchmark dataset

Aparicio-Puerta, E.; Baran, A. M.; Ashton, J. M.; Pritchett, E. M.; Gaca, A.; Becker, J.; Halushka, M. K.; Jun, S.-H.; McCall, M. N.

2026-05-13 bioinformatics
10.64898/2026.05.08.723923 bioRxiv
Show abstract

MicroRNAs are short noncoding RNAs that regulate gene expression and are commonly profiled by small RNA sequencing (miRNA-seq). Despite the widespread use of miRNA-seq, datasets are often analyzed with RNA-seq method such as DESeq2 or edgeR, which do not take into account the specific characteristics of miRNA-seq data. Here, we present a benchmark study of normalization and differential expression approaches using a realistic ground-truth dataset. By mixing mouse RNA of two organs, we generated expression trends while capturing biological and technical variability. Using monotonicity across the dataset and expected fold changes from the mixture design, we assessed normalization and differential expression methods. Normalization benchmarking showed that within-sample scaling, particularly Read Per Million (RPM), best preserved the expected monotonic trends, outperforming cross-sample methods such as TMM, rlog, and VST. These approaches sometimes recovered apparent monotonicity among abundant miRNAs, but inspection of individual profiles suggested likely over-correction. Regarding differential expression, edgeR consistently ranked among the best-performing methods across several metrics, including log2 fold-change estimation, with performance comparable to miRNA-seq-specific tools such as miRglmm and NBSR. DESeq2, edgeR-v4, and limma-based approaches tended to systematically underestimate log2 fold changes. Applying a common RPM-based normalization substantially improved the performance of cross-sample methods, highlighting the strong influence of normalization on differential expression analysis. Overall, our findings support within-sample scaling methods such as RPM for normalization, and edgeR, miRglmm, or NBSR for differential expression. The dataset has been made publicly available, providing a valuable resource for objective method comparison and future miRNA-seq software development.

Matching journals

The top 5 journals account for 50% of the predicted probability mass.

1
BMC Bioinformatics
383 papers in training set
Top 0.2%
22.9%
2
Briefings in Bioinformatics
326 papers in training set
Top 0.5%
8.5%
3
Scientific Reports
3102 papers in training set
Top 13%
6.9%
4
NAR Genomics and Bioinformatics
214 papers in training set
Top 0.2%
6.5%
5
Nucleic Acids Research
1128 papers in training set
Top 3%
6.4%
50% of probability mass above
6
Bioinformatics Advances
184 papers in training set
Top 0.4%
6.4%
7
Frontiers in Genetics
197 papers in training set
Top 1%
4.9%
8
Bioinformatics
1061 papers in training set
Top 5%
4.4%
9
PeerJ
261 papers in training set
Top 2%
4.0%
10
Computational and Structural Biotechnology Journal
216 papers in training set
Top 2%
3.6%
11
PLOS ONE
4510 papers in training set
Top 43%
2.8%
12
PLOS Computational Biology
1633 papers in training set
Top 12%
2.5%
13
GigaScience
172 papers in training set
Top 0.9%
2.1%
14
RNA Biology
70 papers in training set
Top 0.2%
1.9%
15
BMC Genomics
328 papers in training set
Top 3%
1.4%
16
Genomics, Proteomics & Bioinformatics
171 papers in training set
Top 5%
0.9%
17
RNA
169 papers in training set
Top 0.4%
0.8%
18
iScience
1063 papers in training set
Top 31%
0.8%
19
Genome Biology
555 papers in training set
Top 7%
0.8%
20
Frontiers in Bioinformatics
45 papers in training set
Top 1%
0.5%
21
BMC Biology
248 papers in training set
Top 7%
0.5%
22
International Journal of Molecular Sciences
453 papers in training set
Top 19%
0.5%
23
Molecular Therapy Nucleic Acids
32 papers in training set
Top 1%
0.5%