Back

Evaluating computational approaches for comparison of protein expression across cancer indications

Wang, J.; Tian, X.; Yu, W.; Pullman, B.; Bullen, J.; Hurt, E.; Zhong, W.

2024-08-27 bioinformatics
10.1101/2024.08.26.609731 bioRxiv
Show abstract

BackgroundThe National Cancer Institutes Clinical Proteomic Tumor Analysis Consortium (CPTAC) recently generated harmonized genomic, transcriptomic, proteomic, and clinical data for over 1,000 tumors across 10 cohorts to facilitate pan-cancer discovery research. However, protein expression comparison across CPTAC cohorts remains challenging due to non-uniform missing data and varying protein expression distribution patterns across tumor types. Here, we present our efforts to evaluate various missing data handling and normalization strategies to create a normalized pan-cancer protein expression dataset. ResultsFirst, we developed a novel algorithm to select robustly expressed proteins in tumors within any CPTAC cohort. Second, we applied a cohort hybrid imputation approach to protein abundance values from FragPipe within each cohort based on protein expression distribution patterns. Third, we calculated intensity-based absolute quantification using protein abundance values and applied both global and smooth quantile normalization methods. Our results indicate that global quantile normalization ensured identical distribution across cohorts for both tumor and normal tissues, while smooth quantile normalization preserved distribution differences between biological conditions. We assessed our method by comparing differential protein expression analysis results with and without normalization. Additionally, we examined the ranks of protein expression in the normalized CPTAC dataset for selected proteins with high protein-to-RNA expression correlation across CPTAC cohorts. We then compared these protein expression ranks with their RNA expression ranks across corresponding cohorts in The Cancer Genome Atlas (TCGA). Differential protein expression analysis revealed a high level of agreement in the fold change of tumor versus normal tissue within cohorts before and after normalization. Furthermore, our results indicate that global quantile normalization resulted in the highest cohort rank correlation between CPTAC and TCGA for selected proteins. ConclusionsIn summary, our thorough analysis demonstrates that global quantile normalization surpasses both smooth quantile normalization and no normalization, as evidenced by its higher rank correlation across cancer cohorts between CPTAC and TCGA for selected proteins. The findings suggest that combining cohort hybrid imputation with global quantile normalization is an effective method for creating a normalized CPTAC pan-cancer protein dataset, which can facilitate the study of protein expression across different cancer types.

Matching journals

The top 4 journals account for 50% of the predicted probability mass.

1
PROTEOMICS
35 papers in training set
Top 0.1%
22.8%
2
Journal of Proteome Research
215 papers in training set
Top 0.2%
18.8%
3
Analytical Chemistry
205 papers in training set
Top 0.6%
4.9%
4
BMC Bioinformatics
383 papers in training set
Top 2%
4.9%
50% of probability mass above
5
PLOS ONE
4510 papers in training set
Top 31%
4.9%
6
Molecular & Cellular Proteomics
158 papers in training set
Top 0.5%
4.9%
7
Bioinformatics
1061 papers in training set
Top 5%
4.4%
8
Scientific Reports
3102 papers in training set
Top 35%
3.6%
9
PLOS Computational Biology
1633 papers in training set
Top 15%
1.8%
10
Computational and Structural Biotechnology Journal
216 papers in training set
Top 4%
1.7%
11
PeerJ
261 papers in training set
Top 7%
1.7%
12
ACS Omega
90 papers in training set
Top 2%
1.3%
13
Journal of Proteomics
27 papers in training set
Top 0.2%
1.3%
14
Genomics, Proteomics & Bioinformatics
171 papers in training set
Top 5%
1.0%
15
SoftwareX
15 papers in training set
Top 0.4%
0.8%
16
Journal of the American Society for Mass Spectrometry
33 papers in training set
Top 0.5%
0.8%
17
BMC Genomics
328 papers in training set
Top 6%
0.8%
18
Bioinformatics Advances
184 papers in training set
Top 5%
0.8%
19
GigaScience
172 papers in training set
Top 3%
0.7%
20
Molecular Omics
21 papers in training set
Top 0.5%
0.7%
21
BMC Medical Genomics
36 papers in training set
Top 2%
0.7%
22
Cancer Research Communications
46 papers in training set
Top 1%
0.7%
23
Briefings in Bioinformatics
326 papers in training set
Top 8%
0.5%
24
Biochimica et Biophysica Acta (BBA) - Bioenergetics
17 papers in training set
Top 0.3%
0.5%
25
Journal of Clinical Medicine
91 papers in training set
Top 8%
0.5%
26
ImmunoInformatics
11 papers in training set
Top 0.3%
0.5%
27
Frontiers in Bioinformatics
45 papers in training set
Top 1%
0.5%