Back

A Novel Machine Learning Systematic Framework and Web Tool for Breast Cancer Survival Rate Assessment

Ji, J. M.; Shen, W. H.

2022-09-17 oncology
10.1101/2022.09.16.22280052 medRxiv
Show abstract

Cancer research, including that of breast cancer, has increasingly relied on molecular profiling based on advances in genomic technology. Although these techniques have permitted scientists to unravel the process by which cancer develops, scientists still struggle to effectively translate the vast amounts of patient data into clinically meaningful results. As a result, tasks such as predicting the human response to differing treatments remains a major challenge in cancer treatment. There have been many studies attempting to determine the survival indicators of breast cancer patients. However, most of these analyses were predominantly performed using traditional statistical methods, which are imperfect and inadequate in tackling vast amounts of data or unstructured data on human breast cancer. With the exponential progress in computing power and artificial intelligence approaches, we believe that there is an opportunity for machine learning to supersede our current capabilities in fully understanding the correlations between geneset alterations, drug responses, and the prognosis of breast cancer patients. This information would greatly benefit scientists and physicians in developing clinical therapeutic strategies, such as performing personalized treatment. This machine learning project employs multiple machine learning approaches, including a novel deep learning algorithm, in building models for the detection and visualization of significant prognostic indicators of breast cancer patient survival rate. The clinical and genomic data of 1,980 primary breast cancer samples used in this project were obtained from the Molecular Taxonomy of Breast Cancer International Consortium (METABRIC) database of cBioPortal. The data was preprocessed and then split to train eight classical machine learning models and the aforementioned deep learning Convolutional Neural Network (CNN) model. These models were evaluated using the recall scores, the accuracy scores, the receiver operating characteristic (ROC) curve, and the area under the ROC curve (AUC) on the training dataset and confirmed using the rest of the data of the dataset. Both the deep learning and machine learning methods produced desirable prediction accuracies. However, the deep learning model noticeably outperformed all other classifiers and achieved the highest accuracy (AUC = 0.900). This project was constructed in the Google Colab environment based on python and its programming libraries with data visualization, Tensorflow, and Keras. The CNN model demonstrates a powerful ability to be used as a systematic framework for real time prediction by end users. A web application for the breast cancer survival rate prediction was designed and developed using streamlit, Tensorflow, Keras and python libraries to allow end-users to interact with the model with ease and obtain quick and accurate prediction.

Matching journals

The top 11 journals account for 50% of the predicted probability mass.

1
PLOS Computational Biology
1633 papers in training set
Top 4%
8.4%
2
Frontiers in Bioinformatics
45 papers in training set
Top 0.1%
6.8%
3
BMC Cancer
52 papers in training set
Top 0.3%
6.3%
4
Scientific Reports
3102 papers in training set
Top 28%
4.3%
5
PLOS ONE
4510 papers in training set
Top 34%
4.3%
6
Biology Methods and Protocols
53 papers in training set
Top 0.2%
4.0%
7
Heliyon
146 papers in training set
Top 0.3%
4.0%
8
Frontiers in Oncology
95 papers in training set
Top 1%
3.7%
9
Frontiers in Genetics
197 papers in training set
Top 2%
3.6%
10
Artificial Intelligence in Medicine
15 papers in training set
Top 0.1%
3.6%
11
PeerJ
261 papers in training set
Top 3%
3.6%
50% of probability mass above
12
iScience
1063 papers in training set
Top 6%
3.1%
13
Computers in Biology and Medicine
120 papers in training set
Top 1%
2.9%
14
Cancers
200 papers in training set
Top 2%
2.7%
15
International Journal of Molecular Sciences
453 papers in training set
Top 5%
2.4%
16
Briefings in Bioinformatics
326 papers in training set
Top 3%
1.9%
17
BMC Bioinformatics
383 papers in training set
Top 4%
1.7%
18
IEEE Transactions on Computational Biology and Bioinformatics
17 papers in training set
Top 0.3%
1.5%
19
Frontiers in Neuroscience
223 papers in training set
Top 5%
1.3%
20
Frontiers in Pharmacology
100 papers in training set
Top 3%
1.2%
21
JCO Clinical Cancer Informatics
18 papers in training set
Top 0.6%
1.2%
22
European Journal of Cancer
10 papers in training set
Top 0.4%
0.9%
23
eLife
5422 papers in training set
Top 53%
0.9%
24
Database
51 papers in training set
Top 0.7%
0.9%
25
Frontiers in Immunology
586 papers in training set
Top 6%
0.9%
26
Genomics, Proteomics & Bioinformatics
171 papers in training set
Top 6%
0.8%
27
BMC Research Notes
29 papers in training set
Top 0.5%
0.8%
28
Annals of Biomedical Engineering
34 papers in training set
Top 1%
0.7%
29
Cancer Medicine
24 papers in training set
Top 1%
0.7%
30
BMC Medical Informatics and Decision Making
39 papers in training set
Top 3%
0.7%