Back

The impact of distributional assumptions in gene-set and pathway analysis: how far can it go wrong?

Ho, C.-H.; Huang, Y.-J.; Lai, Y.-J.; Mukherjee, R.; Hsiao, C. K.

2021-02-02 bioinformatics
10.1101/2021.02.01.429279 bioRxiv
Show abstract

Gene-set analysis (GSA) has been one of the standard procedures for exploring potential biological functions when a group of differentially expressed genes have been derived. The development of its methodology has been an active research topic in recent decades. Many GSA methods, when newly proposed, rely on simulation studies to evaluate their performance with a common implicit assumption that the multivariate expression values are normally distributed. The validity of this assumption has been disputed in several studies but no systematic analysis has been carried out to assess the influence of this distributional assumption. Our goal in this study is not to propose a new GSA method but to first examine if the multi-dimensional gene expression data in gene sets follow a multivariate normal distribution (MVN). Six statistical methods in three categories of MVN tests were considered and applied to a total of twenty-two datasets of expression data from studies involving tumor and normal tissues, with ten signaling pathways chosen as the gene sets. Second, we evaluated the influence of non-normality on the performance of current GSA tools, including parametric and non-parametric methods. Specifically, the scenario of mixture distributions representing the case of different tumor subtypes was considered. Our first finding suggests that the MVN assumption should be carefully dealt with. It does not hold true in many applications tested here. The second investigation of the GSA tools demonstrates that the non-normality does affect the performance of these GSA methods, especially when subtypes exist. We conclude that the use of the inherent multivariate normality assumption should be assessed with care in evaluating new GSA tools, since this MVN assumption cannot be guaranteed and this assumption affects strongly the performance of GSA methods. If a newly proposed GSA method is to be evaluated, we recommend the incorporation of multivariate non-normal distributions or sampling from large databases if available.

Matching journals

The top 5 journals account for 50% of the predicted probability mass.

1
PeerJ
261 papers in training set
Top 0.1%
19.2%
2
BMC Bioinformatics
383 papers in training set
Top 0.4%
15.2%
3
PLOS ONE
4510 papers in training set
Top 21%
8.7%
4
PLOS Computational Biology
1633 papers in training set
Top 7%
5.0%
5
Scientific Reports
3102 papers in training set
Top 28%
4.3%
50% of probability mass above
6
Bioinformatics
1061 papers in training set
Top 5%
3.7%
7
Briefings in Bioinformatics
326 papers in training set
Top 2%
2.8%
8
Frontiers in Genetics
197 papers in training set
Top 3%
2.1%
9
Computational and Structural Biotechnology Journal
216 papers in training set
Top 3%
1.9%
10
Statistics in Medicine
34 papers in training set
Top 0.1%
1.9%
11
Computers in Biology and Medicine
120 papers in training set
Top 2%
1.7%
12
Journal of Computational Biology
37 papers in training set
Top 0.3%
1.4%
13
Frontiers in Bioinformatics
45 papers in training set
Top 0.3%
1.4%
14
Cancers
200 papers in training set
Top 3%
1.3%
15
Biostatistics
21 papers in training set
Top 0.1%
1.3%
16
Applied Sciences
24 papers in training set
Top 0.5%
1.0%
17
npj Systems Biology and Applications
99 papers in training set
Top 2%
0.9%
18
Physical Biology
43 papers in training set
Top 2%
0.8%
19
Entropy
20 papers in training set
Top 0.3%
0.8%
20
Journal of Bioinformatics and Systems Biology
14 papers in training set
Top 0.5%
0.8%
21
Computational Biology and Chemistry
23 papers in training set
Top 0.4%
0.8%
22
Informatics in Medicine Unlocked
21 papers in training set
Top 1%
0.8%
23
Expert Systems with Applications
11 papers in training set
Top 0.4%
0.8%
24
iScience
1063 papers in training set
Top 32%
0.7%
25
BioMed Research International
25 papers in training set
Top 3%
0.7%
26
Computer Methods and Programs in Biomedicine
27 papers in training set
Top 1%
0.7%
27
BMC Genomics
328 papers in training set
Top 6%
0.7%
28
Gene
41 papers in training set
Top 2%
0.7%
29
International Journal of Molecular Sciences
453 papers in training set
Top 19%
0.5%
30
Heliyon
146 papers in training set
Top 9%
0.5%