Back

Baseline Acute Myeloid Leukemia Prognosis Models using Transcriptomic and Clinical Profiles by Studying the Impacts of Dimensionality Reductions and Gene Signatures on Cox-Proportional Hazard

Sauve, L.; Hebert, J.; Sauvageau, G.; Lemieux, S.

2022-12-10 bioinformatics
10.1101/2022.12.06.519415 bioRxiv
Show abstract

Gene marker extraction to evaluate risk in cancer can refine the diagnosis process and lead to adapted therapies and better survival. These survival analyses can be done through computer systems and Machine Learning (ML) algorithms such as the Cox-Proportional-Hazard model from gene expression (GE) RNA-Seq data. However, optimal tuning of CPH from genome-wide GE data is challenging and poorly assessed so far. In this work we propose to interrogate an Acute Myeloid Leukemia (AML) dataset (Leucegene) to derive key components of the CPH driving down its performance and discovering its sensitivity to various factors in hoping to ameliorate the system. In this study, we compare the projection and selection data reduction techniques, mainly the PCA and LSC17 gene signature in combination with the CPH in a ML framework. Results reveals that CPH performs better with a combination of clinical and genetic expression features. We determine that projections performs better than selections without clinical information. We ascertain that CPH is affected by overfitting and that this overfitting is linked to the number and the content of input covariables. We show that PCA links clinical features via ability to learn from the input data directly and generalizes better than LSC17 on Leucegene. We postulate that projection are preferred than selection on harder task such as assessing risk in the intermediate subset of Leucegene. We extrapolate that these findings apply in the more general context of risk detection via machine learning in cancer. We see that higher capacity models such as CPH-DNNs systems can be improved via survival-derived projections and combat overfitting through heavy regularization. Author summaryThis study aims to investigate the feasibility of using gene expression to evaluate risk in cancer, and to compare the projection and selection data reduction techniques. The study used the Leucegene dataset to compare the PCA method and a previously published 17 genes signature in combination with the Cox-Proportional-Hazard model in a machine learning framework. Results showed that CPH was affected by overfitting and that this overfitting was linked to the number and the content of input covariables. The study found that PCA links clinical features via ability to learn from the input data directly and generalizes better than LSC17 on Leucegene. The study concluded that projections are preferred than selection on harder task such as assessing risk in the intermediate subset of Leucegene and can be tuned to improve their performance. Data availability statementSource code for pipelines and algorithms, as well as gene expression matrices, are available here: https://github.com/lemieux-lab/dimensions_coxph. Access to the Leucegene cohorts survival times can be granted upon request and following ethical review.

Matching journals

The top 7 journals account for 50% of the predicted probability mass.

1
Biology Methods and Protocols
53 papers in training set
Top 0.1%
12.5%
2
PLOS Computational Biology
1633 papers in training set
Top 3%
10.5%
3
Frontiers in Artificial Intelligence
18 papers in training set
Top 0.1%
8.5%
4
BMC Bioinformatics
383 papers in training set
Top 2%
6.4%
5
PLOS ONE
4510 papers in training set
Top 31%
4.9%
6
Frontiers in Genetics
197 papers in training set
Top 1%
4.9%
7
Bioinformatics
1061 papers in training set
Top 5%
4.2%
50% of probability mass above
8
Frontiers in Bioinformatics
45 papers in training set
Top 0.1%
3.6%
9
PeerJ
261 papers in training set
Top 4%
2.6%
10
Scientific Reports
3102 papers in training set
Top 47%
2.4%
11
Computational Biology and Chemistry
23 papers in training set
Top 0.1%
2.1%
12
BioData Mining
15 papers in training set
Top 0.2%
2.1%
13
BMC Medical Informatics and Decision Making
39 papers in training set
Top 1%
1.7%
14
Computers in Biology and Medicine
120 papers in training set
Top 2%
1.7%
15
Computational and Structural Biotechnology Journal
216 papers in training set
Top 5%
1.5%
16
GigaScience
172 papers in training set
Top 2%
1.3%
17
BMC Genomics
328 papers in training set
Top 3%
1.2%
18
Artificial Intelligence in Medicine
15 papers in training set
Top 0.4%
1.2%
19
Cancers
200 papers in training set
Top 4%
1.2%
20
Physical Biology
43 papers in training set
Top 1%
1.2%
21
Journal of Computational Biology
37 papers in training set
Top 0.3%
1.2%
22
Frontiers in Physiology
93 papers in training set
Top 4%
1.1%
23
F1000Research
79 papers in training set
Top 3%
0.9%
24
Archives of Clinical and Biomedical Research
28 papers in training set
Top 2%
0.9%
25
Computer Methods and Programs in Biomedicine
27 papers in training set
Top 0.7%
0.9%
26
npj Systems Biology and Applications
99 papers in training set
Top 2%
0.8%
27
Informatics in Medicine Unlocked
21 papers in training set
Top 1%
0.8%
28
Expert Systems with Applications
11 papers in training set
Top 0.5%
0.7%
29
NAR Genomics and Bioinformatics
214 papers in training set
Top 4%
0.6%
30
Life
27 papers in training set
Top 0.8%
0.5%