Back

Classification with Missing Data - A NIFty Pipeline for Single-Cell Proteomics

Nitz, A. A.; Echarry, B.; McGee, B.; Payne, S. H.

2026-03-09 bioinformatics
10.64898/2026.03.06.710179 bioRxiv
Show abstract

Single-cell proteomics (SCP) is uniquely suited for cell-type characterization, trajectory-based inference, and microenvironment mapping. Evaluating biological hypotheses in these experiments requires labeled cells. Without a pre-measurement label, machine learning is used to identify features that characterize the cell types and classify unlabeled samples. Current implementations of annotation methods come with several statistical and computational disadvantages. First, machine-learning methods require complete data, leading to large amounts of missing-value imputation in SCP. Additionally, some machine-learning methods select features and classify samples via cross-sample comparisons, nullifying downstream cross-sample comparisons, like differential expression, through double dipping. Finally, measurements from different proteomic experiments are not directly comparable due to batch effects, significantly limiting the accuracy of classifiers trained on external data. Here we present NIFty, a top-scoring pairs based feature selection method, implemented in a full classification pipeline, that does not require pre-imputed data as input or employ circular analysis techniques, and overcomes batch effects without batch correction. When tested on imputed vs unimputed data, data with large batch effects, and multiclass data, NIFty successfully overcame the targeted classification challenges and comparably, or more accurately, classified the samples in the varied datasets.

Matching journals

The top 3 journals account for 50% of the predicted probability mass.

1
Nature Communications
4913 papers in training set
Top 2%
22.8%
2
Molecular & Cellular Proteomics
158 papers in training set
Top 0.1%
18.9%
3
Journal of Proteome Research
215 papers in training set
Top 0.3%
10.2%
50% of probability mass above
4
Analytical Chemistry
205 papers in training set
Top 0.5%
6.4%
5
Cell Systems
167 papers in training set
Top 3%
4.4%
6
Nature Methods
336 papers in training set
Top 2%
4.0%
7
Bioinformatics
1061 papers in training set
Top 5%
4.0%
8
PLOS ONE
4510 papers in training set
Top 46%
2.4%
9
PROTEOMICS
35 papers in training set
Top 0.3%
2.1%
10
Genome Biology
555 papers in training set
Top 4%
1.8%
11
PLOS Computational Biology
1633 papers in training set
Top 17%
1.5%
12
Peer Community Journal
254 papers in training set
Top 3%
1.2%
13
eLife
5422 papers in training set
Top 53%
0.9%
14
Nature Biotechnology
147 papers in training set
Top 6%
0.9%
15
Genomics, Proteomics & Bioinformatics
171 papers in training set
Top 5%
0.9%
16
Nature Machine Intelligence
61 papers in training set
Top 3%
0.8%
17
Scientific Reports
3102 papers in training set
Top 74%
0.8%
18
Nature Chemical Biology
104 papers in training set
Top 3%
0.8%
19
Advanced Science
249 papers in training set
Top 20%
0.7%
20
Cell Reports Methods
141 papers in training set
Top 5%
0.7%
21
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 45%
0.7%
22
Molecular Systems Biology
142 papers in training set
Top 2%
0.7%