Back

A Context-Aware Single-Cell Proteomics Analysis pipeline.

Salomo Coll, C.; Makar, A. N.; Brenes, A. J.; Inns, J.; Trost, M.; Rajan, N.; Wilkinson, S.; von Kriegsheim, A.

2026-04-07 bioinformatics
10.64898/2026.04.03.716382 bioRxiv
Show abstract

Single-cell proteomics (SCP) by mass spectrometry can now quantify hundreds to thousands of proteins per cell, but the field still lacks standardised analytical pipelines that accommodate the diversity of instruments, sample preparation workflows and biological contexts encountered in practice. Existing workflows, largely adapted from single-cell transcriptomics, do not account for the informative missingness, pervasive ambient protein contamination and limited feature space that distinguish proteomic from transcriptomic data. In addition, cell type annotation remains a manual bottleneck that is subjective, difficult to reproduce and hard to scale. Here we present an end-to-end pipeline that integrates adaptive quality control, entropy-guided iterative batch correction, multi-modal marker discovery that exploits detection patterns unique to proteomics, and context-aware annotation by large language models (LLMs) coupled to structured contradiction reasoning and orthogonal data-driven validation. Benchmarking on published single-cell proteomic datasets from developing human brain and glioblastoma-associated neutrophils revealed systematic LLM failure modes, including context-insensitive marker vocabulary and misinterpretation of phagocytic or lytic cell states. We addressed these errors using a three-round prompt architecture that combines general biological principles with auto-generated dataset-specific constraints. In held-out validation on a skin tumour dataset acquired, the pipeline showed high concordance with FACS-sorted ground truth. In the caerulein-injured pancreas, orthogonal immunohistochemistry further supported annotations of macrophage, stellate and immune populations. The pipeline is fully automated under fixed settings, and available as Context-Aware Single-Cell Proteomics Analysis (CASPA), providing SCP laboratories and facilities with a reproducible workflow that delivers interpretable, confidence-quantified annotations suitable for downstream expert review.

Matching journals

The top 3 journals account for 50% of the predicted probability mass.

1
Nature Communications
4913 papers in training set
Top 1%
28.7%
2
Molecular & Cellular Proteomics
158 papers in training set
Top 0.1%
13.2%
3
Nature Methods
336 papers in training set
Top 0.9%
10.5%
50% of probability mass above
4
Journal of Proteome Research
215 papers in training set
Top 0.5%
6.5%
5
Cell Systems
167 papers in training set
Top 3%
3.7%
6
Nature Biotechnology
147 papers in training set
Top 3%
2.8%
7
Bioinformatics
1061 papers in training set
Top 6%
2.8%
8
Analytical Chemistry
205 papers in training set
Top 1%
2.2%
9
Genome Biology
555 papers in training set
Top 3%
2.2%
10
Nature Machine Intelligence
61 papers in training set
Top 2%
2.0%
11
Advanced Science
249 papers in training set
Top 10%
1.8%
12
PLOS ONE
4510 papers in training set
Top 52%
1.8%
13
Peer Community Journal
254 papers in training set
Top 2%
1.5%
14
Nature Chemical Biology
104 papers in training set
Top 2%
1.5%
15
Cell Reports Methods
141 papers in training set
Top 4%
1.0%
16
Molecular Systems Biology
142 papers in training set
Top 1%
0.8%
17
PROTEOMICS
35 papers in training set
Top 0.6%
0.8%
18
Alzheimer's & Dementia
143 papers in training set
Top 3%
0.8%
19
PLOS Computational Biology
1633 papers in training set
Top 24%
0.8%
20
Genomics, Proteomics & Bioinformatics
171 papers in training set
Top 7%
0.7%
21
Nucleic Acids Research
1128 papers in training set
Top 20%
0.5%
22
Communications Biology
886 papers in training set
Top 31%
0.5%
23
Communications Chemistry
39 papers in training set
Top 2%
0.5%