Back

Transcriptome-based cell type assignment for kidney cell culture models

Schobert, M.; Boehm, S.; Borisov, O.; Li, Y.; Greve, G.; Edemir, B.; Woodward, O. M.; Jung, H. J.; Koettgen, M. M.; Westermann, L.; Schlosser, P.; Hutter, F.; Kottgen, A.; Haug, S.

2026-04-01 bioinformatics
10.64898/2026.03.30.715265 bioRxiv
Show abstract

BackgroundKidney cell lines are widely used to model kidney physiology and disease; however, their gene expression profiles may differ from primary cells due to immortalization, culture conditions, or experimental treatments. Determining whether a cell line resembles its native cell type is critical for interpreting in vitro findings. We developed a transcriptome-based approach that matches bulk RNA-seq data from kidney cell lines, primary cells, or tissues to reference cell types derived from single-cell RNA-seq (scRNA-seq) datasets. MethodsReference transcriptomic profiles were generated from two human and two murine kidney scRNA-seq datasets by pseudobulk aggregation. Bulk RNA-seq data from microdissected kidney tissue, non-kidney negative controls, and kidney cell lines were matched to these references using three statistical similarity measures (Spearman correlation, Euclidean distance, Poisson distance) and three machine learning classifiers (Random Forest, XGBoost, TabPFN). Each was assessed with global gene expression, curated kidney marker gene lists, and the most variable genes. Matching accuracy was evaluated through a three-step validation strategy: within-dataset matching, cross-reference comparison, and validation against primary kidney tissue and negative controls. ResultsGene expression rank-based Spearman correlation and TabPFN, a foundation model for tabular data, emerged as the most accurate and specific approaches, particularly with curated kidney marker gene lists. Both methods correctly identified microdissected kidney tubule segments and were robust against non-kidney negative controls. Applied to commonly used kidney cell lines, OK cells retained proximal tubule identity, particularly under shear stress, while other proximal tubule lines (HK-2, HKC-8, HKC-11) showed inconsistent matching. Collecting duct-derived mIMCD-3 maintained stable similarity across passages, culture conditions, and genetic modifications. ConclusionWe provide two complementary implementations: CellMatchR, an accessible web-based tool using Spearman correlation for routine use, and comprehensive scripts for TabPFN-based matching (link will be added after peer reviewed publication). Together, these resources enable researchers to make informed decisions about kidney cell culture model selection, interpretation, and stability. Translational StatementKidney cell lines are fundamental tools in nephrology research, yet their transcriptomic similarity to native cell types is rarely validated systematically. We demonstrate that combining bulk RNA-seq data with single-cell reference datasets enables robust assessment of cell line identity using gene expression-rank-based correlation and machine learning approaches. By providing a comprehensive evaluation of matching methods, curated kidney marker gene lists, and reference datasets, our study serves as both a practical resource and a methodological framework for the kidney research community, facilitating informed selection of cell culture models, quality control of experimental conditions, developing new experimental cell culture models, and more reliable translation of in vitro findings to kidney physiology and disease.

Matching journals

The top 6 journals account for 50% of the predicted probability mass.

1
PLOS ONE
4510 papers in training set
Top 13%
14.5%
2
BMC Nephrology
13 papers in training set
Top 0.1%
12.6%
3
Kidney360
22 papers in training set
Top 0.1%
12.5%
4
Scientific Reports
3102 papers in training set
Top 23%
4.9%
5
Journal of the American Society of Nephrology
52 papers in training set
Top 0.3%
3.6%
6
JCI Insight
241 papers in training set
Top 1%
3.6%
50% of probability mass above
7
Frontiers in Physiology
93 papers in training set
Top 2%
2.6%
8
Bioinformatics
1061 papers in training set
Top 6%
2.4%
9
Journal of the American Medical Informatics Association
61 papers in training set
Top 1%
1.9%
10
PLOS Computational Biology
1633 papers in training set
Top 16%
1.7%
11
Wellcome Open Research
57 papers in training set
Top 1%
1.5%
12
Bioinformatics Advances
184 papers in training set
Top 3%
1.5%
13
Biology Methods and Protocols
53 papers in training set
Top 1%
1.3%
14
BMC Bioinformatics
383 papers in training set
Top 5%
1.3%
15
Computational and Structural Biotechnology Journal
216 papers in training set
Top 6%
1.3%
16
Cytometry Part A
30 papers in training set
Top 0.2%
1.3%
17
Frontiers in Pharmacology
100 papers in training set
Top 3%
1.2%
18
Journal of Translational Medicine
46 papers in training set
Top 2%
1.0%
19
Transplantation
13 papers in training set
Top 0.3%
0.9%
20
Frontiers in Veterinary Science
30 papers in training set
Top 0.8%
0.8%
21
Nature Communications
4913 papers in training set
Top 61%
0.8%
22
Frontiers in Genetics
197 papers in training set
Top 9%
0.8%
23
Kidney International
25 papers in training set
Top 0.3%
0.8%
24
Database
51 papers in training set
Top 0.9%
0.8%
25
BMC Genomics
328 papers in training set
Top 6%
0.8%
26
Journal of Immunological Methods
24 papers in training set
Top 0.2%
0.8%
27
PeerJ
261 papers in training set
Top 15%
0.8%
28
American Journal of Physiology-Renal Physiology
25 papers in training set
Top 0.3%
0.7%
29
iScience
1063 papers in training set
Top 37%
0.7%
30
BMC Medical Informatics and Decision Making
39 papers in training set
Top 3%
0.7%