A unified framework for batch correction and missing data handling in large-scale and single-cell mass spectrometry proteomics
Anwar, A. M.; Bayoumi, S.; Lahti, L.; Coffey, E.
Show abstract
Large-scale mass spectrometry (MS)-based proteomics, including single-cell proteomics, is routinely affected by technical variation arising from discrete batch effects, inter-laboratory differences and continuous signal drift during data acquisition. Current correction strategies typically address these sources of unwanted variation independently and often require either removal of proteins with missing values or imputation before correction, both of which may lead to information loss and potential amplification of technical bias. Here we present NMFBatch, a unified statistical framework that simultaneously models discrete and continuous unwanted variation in bulk and single-cell proteomics data. NMFBatch integrates non-negative matrix factorization with generalized additive modelling and directly accommodates missing values, thereby enabling both on-the-fly imputation during correction and optional post-correction imputation. Benchmarking against six batch-correction methods using multi-laboratory reference datasets and a large plasma proteomics cohort, shows that NMFBatch consistently reduces batch-associated variation while preserving biological structure under both balanced and confounded experimental designs. Application to single-cell proteomics data further showed effective reduction of TMT- and acquisition-associated variation while retaining biologically meaningful clustering. Together, these results establish NMFBatch as a flexible framework for modelling unwanted variation in proteomics experiments, with potential applications in cross-cohort harmonization and integrative proteomics analysis. Graphical AbstractCreated in BioRender. Youssef, A. (2026) https://BioRender.com/c1q1yxt O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=181 SRC="FIGDIR/small/726178v2_ufig1.gif" ALT="Figure 1"> View larger version (45K): org.highwire.dtl.DTLVardef@2b7cd1org.highwire.dtl.DTLVardef@10fada3org.highwire.dtl.DTLVardef@50e66corg.highwire.dtl.DTLVardef@147f81c_HPS_FORMAT_FIGEXP M_FIG C_FIG
Matching journals
The top 2 journals account for 50% of the predicted probability mass.