Classification with Missing Data - A NIFty Pipeline for Single-Cell Proteomics
Nitz, A. A.; Echarry, B.; McGee, B.; Payne, S. H.
Show abstract
Single-cell proteomics (SCP) is uniquely suited for cell-type characterization, trajectory-based inference, and microenvironment mapping. Evaluating biological hypotheses in these experiments requires labeled cells. Without a pre-measurement label, machine learning is used to identify features that characterize the cell types and classify unlabeled samples. Current implementations of annotation methods come with several statistical and computational disadvantages. First, machine-learning methods require complete data, leading to large amounts of missing-value imputation in SCP. Additionally, some machine-learning methods select features and classify samples via cross-sample comparisons, nullifying downstream cross-sample comparisons, like differential expression, through double dipping. Finally, measurements from different proteomic experiments are not directly comparable due to batch effects, significantly limiting the accuracy of classifiers trained on external data. Here we present NIFty, a top-scoring pairs based feature selection method, implemented in a full classification pipeline, that does not require pre-imputed data as input or employ circular analysis techniques, and overcomes batch effects without batch correction. When tested on imputed vs unimputed data, data with large batch effects, and multiclass data, NIFty successfully overcame the targeted classification challenges and comparably, or more accurately, classified the samples in the varied datasets.
Matching journals
The top 3 journals account for 50% of the predicted probability mass.