Back

Comparative evaluation of imputation and batch-effect correction for proteomics/peptidomics differential-expression analysis

Gonidaki, C.; Vlahou, A.; Stroggilos, R.; Mischak, H.; Latosinska, A.

2025-08-16 health informatics
10.1101/2025.08.14.25333694 medRxiv
Show abstract

Mass spectrometry (MS)-based proteomics offers powerful opportunities for biomarker discovery; nevertheless, it is associated with technical challenges, some of them being missing values and batch effects. Both can obscure biological signal and bias results. Although imputation and batch-correction methods are well established in transcriptomics, their impact, particularly on large-scale, real-world clinical proteomics datasets, remains unclear. In this study, we systematically compared the impact of two popular imputation methods ([1/2] LOD replacement and KNN) in combination with three batch-effect correction approaches (ComBat, ComBat with disease covariate, and MNN) on differential expression analysis in a CE-MS urine peptidomics dataset of 1,050 samples across 13 batches collected for early detection of chronic kidney disease (CKD), separated into discovery (n = 525) and validation (n = 525) sets. Our results show that the choice of imputation method (between [1/2] LOD and KNN) had minimal impact on the final list of differentially expressed peptides (DEPs). In contrast, batch-effect correction had a much stronger influence on the results. ComBat without covariate adjustment removed most DEPs, suggesting loss of true biological signal. Along these lines, incorporating disease status into the model preserved most of this information. MNN yielded a moderate to low number of validated DEPs overall, especially when paired with KNN imputation. These findings show that imputation and batch correction are not entirely independent processes and that they can influence downstream results. Overall, preprocessing choices should be chosen based on the characteristics of each dataset and especially considering batch severity and biological covariates. Statement of significance of the studyFinding reliable biomarkers in clinical proteomics first requires addressing the technical noise that can hide true biological signals. In this work, we investigate how different imputation and batch correction methods influence the list of peptides that emerge as differentially expressed. Instead of relying on simulations or small datasets, we examine a large, real-world urine-peptidomics cohort of more than 1,000 samples screened for early-stage chronic kidney disease. The results show that no preprocessing pipeline is universally optimal and that the best choice depends on the characteristics of the dataset. This study offers practical guidance for improving reproducibility in urine-based peptide studies and supports more confident identification of disease-associated molecular signatures.

Published in PROTEOMICS (predicted rank #7) · training set

Matching journals

The top 5 journals account for 50% of the predicted probability mass.

1
Journal of Proteome Research
215 papers in training set
Top 0.2%
19.1%
2
Molecular & Cellular Proteomics
158 papers in training set
Top 0.1%
14.8%
3
Analytical Chemistry
205 papers in training set
Top 0.3%
9.4%
4
Bioinformatics
1061 papers in training set
Top 4%
6.5%
5
Clinical Proteomics
10 papers in training set
Top 0.1%
3.7%
50% of probability mass above
6
Journal of Proteomics
27 papers in training set
Top 0.1%
3.7%
PROTEOMICS · published here
35 papers in training set
Top 0.2%
3.7%
8
Scientific Reports
3102 papers in training set
Top 34%
3.7%
9
PLOS ONE
4510 papers in training set
Top 44%
2.7%
10
Heliyon
146 papers in training set
Top 2%
1.5%
11
Advanced Biology
29 papers in training set
Top 0.5%
1.4%
12
Journal of the American Society for Mass Spectrometry
33 papers in training set
Top 0.3%
1.4%
13
Frontiers in Microbiology
375 papers in training set
Top 6%
1.4%
14
Nature Communications
4913 papers in training set
Top 56%
1.3%
15
Frontiers in Immunology
586 papers in training set
Top 6%
1.0%
16
Analytical and Bioanalytical Chemistry
17 papers in training set
Top 0.3%
1.0%
17
The Analyst
15 papers in training set
Top 0.4%
0.9%
18
Frontiers in Plant Science
240 papers in training set
Top 5%
0.8%
19
iScience
1063 papers in training set
Top 31%
0.8%
20
Data in Brief
13 papers in training set
Top 0.4%
0.8%
21
Frontiers in Neurology
91 papers in training set
Top 5%
0.7%
22
SoftwareX
15 papers in training set
Top 0.5%
0.7%
23
Analytica Chimica Acta
17 papers in training set
Top 0.7%
0.7%
24
Computational and Structural Biotechnology Journal
216 papers in training set
Top 10%
0.7%
25
Biomedicines
66 papers in training set
Top 4%
0.5%
26
JMIR Public Health and Surveillance
45 papers in training set
Top 5%
0.5%