Back

Housekeeping Gene Expression Normalization in Transcriptomics Mitigates Data Leakage in Machine Learning Models

Ribas, G. T.; Riella, C. V.; Guizelini, D.; Menegatti Rigo, M.; Riella, L. V.; Borges, T. J.

2026-04-24 bioinformatics
10.64898/2026.04.24.720637 bioRxiv
Show abstract

BackgroundInappropriate normalization can lead to data leakage and overfitting in machine learning models. Accurately identifying housekeeping genes (HKGs) can overcome this problem and is crucial for normalizing gene expression data, particularly in RNA-Seq experiments. ResultsFirst, we demonstrate that the gene expression of commonly used HKGs significantly changes over time due to immunosuppressive treatments in transplant recipients. Using large public transcriptomic datasets of kidney transplantation, we developed a pipeline based on the genes coefficient of variation, stability, and Gini coefficient, and identified nine stable and better-suitable HKG candidates. Our results demonstrate that these HKGs improve the robustness and generalizability of machine learning models by minimizing data leakage, as evidenced by superior performance compared to benchmark methods like median ratio normalization and trimmed mean of M values. ConclusionsThis approach enables more accurate comparison of gene expression datasets across different clinical scenarios, improving the reliability of biomarker identification and enhancing personalized treatment strategies.

Matching journals

The top 9 journals account for 50% of the predicted probability mass.

1
BMC Bioinformatics
383 papers in training set
Top 0.8%
10.1%
2
Scientific Reports
3102 papers in training set
Top 10%
8.4%
3
Bioinformatics
1061 papers in training set
Top 3%
8.4%
4
PLOS ONE
4510 papers in training set
Top 25%
6.8%
5
PLOS Computational Biology
1633 papers in training set
Top 9%
3.7%
6
Bioinformatics Advances
184 papers in training set
Top 1%
3.6%
7
BMC Medical Informatics and Decision Making
39 papers in training set
Top 0.8%
3.6%
8
Computational and Structural Biotechnology Journal
216 papers in training set
Top 2%
3.6%
9
Computers in Biology and Medicine
120 papers in training set
Top 1%
2.6%
50% of probability mass above
10
Biology Methods and Protocols
53 papers in training set
Top 0.4%
2.6%
11
Communications Biology
886 papers in training set
Top 7%
1.8%
12
Frontiers in Pharmacology
100 papers in training set
Top 2%
1.8%
13
Briefings in Bioinformatics
326 papers in training set
Top 4%
1.8%
14
Journal of Medical Internet Research
85 papers in training set
Top 3%
1.7%
15
Frontiers in Physiology
93 papers in training set
Top 3%
1.7%
16
BMC Genomics
328 papers in training set
Top 3%
1.5%
17
Kidney360
22 papers in training set
Top 0.4%
1.3%
18
Frontiers in Genetics
197 papers in training set
Top 6%
1.3%
19
Cytometry Part A
30 papers in training set
Top 0.2%
1.2%
20
Clinical Chemistry
22 papers in training set
Top 0.5%
1.2%
21
Journal of Translational Medicine
46 papers in training set
Top 1%
1.2%
22
Journal of the American Medical Informatics Association
61 papers in training set
Top 2%
1.2%
23
NAR Genomics and Bioinformatics
214 papers in training set
Top 3%
0.9%
24
PeerJ
261 papers in training set
Top 12%
0.9%
25
Computational Biology and Chemistry
23 papers in training set
Top 0.4%
0.8%
26
BMC Medical Research Methodology
43 papers in training set
Top 1%
0.7%
27
Transplantation
13 papers in training set
Top 0.4%
0.7%
28
BMC Research Notes
29 papers in training set
Top 0.8%
0.6%
29
Clinical Pharmacology & Therapeutics
25 papers in training set
Top 0.9%
0.6%
30
Nature Communications
4913 papers in training set
Top 66%
0.6%