Back

Machine Learning Approach to Integrate and Analyse Multiomics data to Identify Actionable Biomarkers for Head and Neck Squamous Cell Carcinoma (HNSCC)

Panchal, K.; Arockia Rajesh Packiam, K.; MAJUMDAR, S.

2025-10-13 genetic and genomic medicine
10.1101/2025.10.09.25335922 medRxiv
Show abstract

Head and neck squamous cell carcinoma (HNSCC) is ranked sixth among all the common cancers worldwide and is a major cause of death. A molecular understanding of disease progression can aid in timely diagnosis and therapy. This study aims to identify potential HNSCC biomarkers using a machine learning-based approach to integrate and analyse multi-omics data (namely publicly available Human Papillomavirus (HPV) negative patients multiomics datasets from the CPTAC-HNSCC project, including transcriptomics, methylomics, proteomics, and phosphoproteomics). A three-step feature selection method was utilized to identify potential molecular biomarkers using machine learning algorithms. The top 1000 important features (genes) were filtered using Mutual Information, followed by a random forest-based feature importance ranking, and Recursive Feature Elimination with cross-validation coupled with Support Vector Machine (SVM-RFECV) to get a minimal gene set important for machine learning based tumor-normal classification task. To benchmark these top-selected features, Logistic Regression (LogR), Random Forest (RF), Multi-layer perceptron (MLP), and Support Vector Machines (SVC) were used. The prediction performance of classifiers trained on these selected gene sets was evaluated using the accuracy metric, which was then compared against that of models trained on randomly selected gene sets. The entire workflow was repeated 100 times for different random states to establish statistical confidence in the pipeline and the selected gene set. Our integrative approach identified both omics-specific and cross-omics candidate genes with very high classification accuracy, ranging from [~] 95% to 100%. These genes reveal convergent biological processes central to HNSCC pathogenesis, which reinforces the robustness of the methodology used, which can be adopted to analyse similar multiomics datasets for other pathologies and foundational biological questions.

Matching journals

The top 9 journals account for 50% of the predicted probability mass.

1
Heliyon
146 papers in training set
Top 0.1%
10.1%
2
Frontiers in Molecular Biosciences
100 papers in training set
Top 0.1%
10.1%
3
Scientific Reports
3102 papers in training set
Top 8%
9.2%
4
Computers in Biology and Medicine
120 papers in training set
Top 0.3%
6.3%
5
Frontiers in Bioinformatics
45 papers in training set
Top 0.1%
4.9%
6
International Journal of Molecular Sciences
453 papers in training set
Top 3%
3.3%
7
Frontiers in Oncology
95 papers in training set
Top 1%
3.1%
8
PLOS ONE
4510 papers in training set
Top 44%
2.7%
9
Briefings in Bioinformatics
326 papers in training set
Top 3%
2.6%
50% of probability mass above
10
Genomics
60 papers in training set
Top 0.7%
2.1%
11
Cancers
200 papers in training set
Top 2%
2.1%
12
Frontiers in Microbiology
375 papers in training set
Top 4%
2.1%
13
Journal of Translational Medicine
46 papers in training set
Top 0.6%
1.9%
14
Journal of Proteomics
27 papers in training set
Top 0.2%
1.7%
15
Biosensors and Bioelectronics
52 papers in training set
Top 0.8%
1.7%
16
iScience
1063 papers in training set
Top 18%
1.5%
17
Biology
43 papers in training set
Top 0.9%
1.5%
18
Journal of Proteome Research
215 papers in training set
Top 1%
1.3%
19
Frontiers in Bioengineering and Biotechnology
88 papers in training set
Top 2%
1.3%
20
Biomedicines
66 papers in training set
Top 2%
1.2%
21
Frontiers in Genetics
197 papers in training set
Top 7%
1.2%
22
Viruses
318 papers in training set
Top 4%
1.0%
23
Virus Research
36 papers in training set
Top 0.9%
1.0%
24
Cancer Research Communications
46 papers in training set
Top 0.9%
0.9%
25
Journal of Clinical Medicine
91 papers in training set
Top 5%
0.9%
26
Frontiers in Cell and Developmental Biology
218 papers in training set
Top 7%
0.9%
27
Analytical Chemistry
205 papers in training set
Top 2%
0.9%
28
Data in Brief
13 papers in training set
Top 0.3%
0.9%
29
BMC Medical Genomics
36 papers in training set
Top 1%
0.8%
30
Informatics in Medicine Unlocked
21 papers in training set
Top 1%
0.8%