Back

Benchmarking single-cell foundation models for real-world RNA-seq data integration

Han, S.; Sztanka-Toth, T.; Senel, E.; Elnaggar, A.; Patel, J.; Mansi, T.; Smirnov, D.; Greshock, J.; Javidi, A.

2026-04-21 bioinformatics
10.64898/2026.04.17.719314 bioRxiv
Show abstract

Single-cell foundation models enable reusable representations and streamlined analysis workflows, yet rigorous evaluation of their performance and robustness in real-world pharmaceutical settings remain underexplored. Here, we benchmarked leading single-cell foundation models (scGPT; scGPT_CP, a continually pretrained checkpoint of scGPT; scFoundation; scMulan; CellFM) against established baseline methods (scVI; Harmony) for data integration using over 1.5 million cells from clinical and preclinical samples. Performance was assessed using well-established and complementary metrics for technical correction and biological structure preservation. We further introduced robustness-oriented rankings to summarize metric trade-offs and quantify performance consistency across datasets and evaluation settings. Our findings show that fine-tuning improved technical correction performance; among the foundation models, fine-tuned scGPT_CP performed best. However, the baseline scVI was the top overall performer, ranking first by our multi-metric Leximax ranking and achieving the highest Pareto Front-1 hit. Collectively, our study provides practical insights for adapting foundation models to real-world drug design and development.

Matching journals

The top 8 journals account for 50% of the predicted probability mass.

1
Briefings in Bioinformatics
326 papers in training set
Top 0.3%
12.0%
2
Nature Communications
4913 papers in training set
Top 21%
8.9%
3
Advanced Science
249 papers in training set
Top 3%
6.2%
4
Bioinformatics
1061 papers in training set
Top 4%
6.2%
5
Nature Biotechnology
147 papers in training set
Top 2%
4.7%
6
Nucleic Acids Research
1128 papers in training set
Top 4%
4.7%
7
Genomics, Proteomics & Bioinformatics
171 papers in training set
Top 1%
4.2%
8
Nature Methods
336 papers in training set
Top 3%
3.9%
50% of probability mass above
9
Cell Systems
167 papers in training set
Top 3%
3.9%
10
Journal of Chemical Information and Modeling
207 papers in training set
Top 1%
3.5%
11
Computational and Structural Biotechnology Journal
216 papers in training set
Top 2%
3.2%
12
PLOS Computational Biology
1633 papers in training set
Top 12%
2.8%
13
Nature Machine Intelligence
61 papers in training set
Top 1%
2.7%
14
Genome Medicine
154 papers in training set
Top 3%
2.4%
15
Bioinformatics Advances
184 papers in training set
Top 2%
2.0%
16
PLOS ONE
4510 papers in training set
Top 55%
1.7%
17
NAR Genomics and Bioinformatics
214 papers in training set
Top 2%
1.7%
18
Patterns
70 papers in training set
Top 1.0%
1.7%
19
GigaScience
172 papers in training set
Top 1%
1.7%
20
Genome Biology
555 papers in training set
Top 6%
1.2%
21
BMC Bioinformatics
383 papers in training set
Top 6%
0.9%
22
Scientific Reports
3102 papers in training set
Top 74%
0.8%
23
npj Systems Biology and Applications
99 papers in training set
Top 2%
0.8%
24
Communications Biology
886 papers in training set
Top 22%
0.8%
25
npj Digital Medicine
97 papers in training set
Top 3%
0.8%
26
Nature Biomedical Engineering
42 papers in training set
Top 2%
0.8%
27
iScience
1063 papers in training set
Top 34%
0.7%
28
Cell Reports Methods
141 papers in training set
Top 5%
0.7%
29
Cell Genomics
162 papers in training set
Top 7%
0.7%
30
Communications Chemistry
39 papers in training set
Top 2%
0.6%