Housekeeping Gene Expression Normalization in Transcriptomics Mitigates Data Leakage in Machine Learning Models
Ribas, G. T.; Riella, C. V.; Guizelini, D.; Menegatti Rigo, M.; Riella, L. V.; Borges, T. J.
Show abstract
BackgroundInappropriate normalization can lead to data leakage and overfitting in machine learning models. Accurately identifying housekeeping genes (HKGs) can overcome this problem and is crucial for normalizing gene expression data, particularly in RNA-Seq experiments. ResultsFirst, we demonstrate that the gene expression of commonly used HKGs significantly changes over time due to immunosuppressive treatments in transplant recipients. Using large public transcriptomic datasets of kidney transplantation, we developed a pipeline based on the genes coefficient of variation, stability, and Gini coefficient, and identified nine stable and better-suitable HKG candidates. Our results demonstrate that these HKGs improve the robustness and generalizability of machine learning models by minimizing data leakage, as evidenced by superior performance compared to benchmark methods like median ratio normalization and trimmed mean of M values. ConclusionsThis approach enables more accurate comparison of gene expression datasets across different clinical scenarios, improving the reliability of biomarker identification and enhancing personalized treatment strategies.
Matching journals
The top 9 journals account for 50% of the predicted probability mass.