Back

Large Scale Foundation Model on Single-cell Transcriptomics

Hao, M.; Gong, J.; Zeng, X.; Liu, C.; Guo, Y.; Cheng, X.; Wang, T.; Ma, J.; Song, L.; Zhang, X.

2023-05-31 bioinformatics
10.1101/2023.05.29.542705 bioRxiv
Show abstract

Large-scale pretrained models have become foundation models leading to breakthroughs in natural language processing and related fields. Developing foundation models in life science for deciphering the "languages" of cells and facilitating biomedical research is promising yet challenging. We developed a large-scale pretrained model scFoundation with 100M parameters for this purpose. scFoundation was trained on over 50 million human single-cell transcriptomics data, which contain high-throughput observations on the complex molecular features in all known types of cells. scFoundation is currently the largest model in terms of the size of trainable parameters, dimensionality of genes and the number of cells used in the pre-training. Experiments showed that scFoundation can serve as a foundation model for single-cell transcriptomics and achieve state-of-the-art performances in a diverse array of downstream tasks, such as gene expression enhancement, tissue drug response prediction, single-cell drug response classification, and single-cell perturbation prediction.

Matching journals

The top 5 journals account for 50% of the predicted probability mass.

1
Nature Machine Intelligence
61 papers in training set
Top 0.1%
23.2%
2
Nature Communications
4913 papers in training set
Top 9%
15.2%
3
Advanced Science
249 papers in training set
Top 2%
6.6%
4
Genome Medicine
154 papers in training set
Top 1%
4.7%
5
Briefings in Bioinformatics
326 papers in training set
Top 1%
4.5%
50% of probability mass above
6
Genome Biology
555 papers in training set
Top 2%
4.1%
7
Nucleic Acids Research
1128 papers in training set
Top 5%
3.7%
8
Bioinformatics
1061 papers in training set
Top 6%
2.4%
9
Nature Methods
336 papers in training set
Top 4%
2.1%
10
Cell Systems
167 papers in training set
Top 6%
1.9%
11
Genomics, Proteomics & Bioinformatics
171 papers in training set
Top 3%
1.7%
12
Communications Biology
886 papers in training set
Top 8%
1.7%
13
Genome Research
409 papers in training set
Top 2%
1.7%
14
Patterns
70 papers in training set
Top 1%
1.3%
15
Frontiers in Genetics
197 papers in training set
Top 7%
1.1%
16
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 39%
1.0%
17
Scientific Reports
3102 papers in training set
Top 69%
1.0%
18
iScience
1063 papers in training set
Top 25%
0.9%
19
Cell
370 papers in training set
Top 16%
0.8%
20
Cell Genomics
162 papers in training set
Top 6%
0.8%
21
BMC Bioinformatics
383 papers in training set
Top 7%
0.8%
22
Nature Biotechnology
147 papers in training set
Top 7%
0.8%
23
Science Advances
1098 papers in training set
Top 30%
0.7%
24
Cell Research
49 papers in training set
Top 3%
0.7%
25
Nature
575 papers in training set
Top 17%
0.7%
26
Journal of Genetics and Genomics
36 papers in training set
Top 3%
0.7%
27
Computational and Structural Biotechnology Journal
216 papers in training set
Top 10%
0.7%
28
PLOS ONE
4510 papers in training set
Top 73%
0.5%
29
Bioinformatics Advances
184 papers in training set
Top 5%
0.5%
30
npj Systems Biology and Applications
99 papers in training set
Top 3%
0.5%