Back

The Paipu framework enables creation of a large-scale mammalian cancer transcriptomics atlas

Smith, B. S.; Smith, L. A.; Lee, J.-H.; Cahill, J. A.; Graim, K.

2026-05-18 bioinformatics
10.64898/2026.05.14.725161 bioRxiv
Show abstract

A plethora of studies have identified shared molecular mechanisms involved in tumor development across humans and other mammalian species. While these two-species analyses advance understanding of human disease, extending them across many species would provide evolutionary insight into molecular mechanisms driving human cancers. However, this expansion requires knowledge transfer and harmonization across species. Genomic differences between species, including variation in genome annotation quality, have historically hindered multi-species large-scale atlas creation. To overcome these challenges, we present Paipu, a comprehensive pipeline designed to streamline querying, preprocessing, harmonization, and retrieval of large-scale RNA-seq data and associated metadata from the NCBI Sequence Read Archive (SRA). Paipu facilitates multi-species analysis by creating a harmonized atlas from user-defined search terms and species. It consists of three components: reference genome preparation, SRA metadata retrieval, and RNA-seq data processing. We apply Paipu to 188 cancer-related terms in 239 non-human mammalian species, creating a harmonized atlas of 3,484 RNA-seq samples spanning 17 species and 35 cancers. This pan-mammalian pan-cancer atlas enables myriad comparative genomics analyses that leverage genetic variation to better understand rare human cancers. As such, Paipu serves as a resource for cross-species cancer genomics and supports atlas creation for any set of species and search terms. Graphical Abstract

Matching journals

The top 5 journals account for 50% of the predicted probability mass.

1
Bioinformatics
1061 papers in training set
Top 1%
22.1%
2
Briefings in Bioinformatics
326 papers in training set
Top 0.4%
9.9%
3
Genomics, Proteomics & Bioinformatics
171 papers in training set
Top 0.6%
9.9%
4
BMC Bioinformatics
383 papers in training set
Top 1%
7.0%
5
Genome Medicine
154 papers in training set
Top 1%
6.2%
50% of probability mass above
6
Genome Biology
555 papers in training set
Top 2%
4.8%
7
Bioinformatics Advances
184 papers in training set
Top 0.8%
4.8%
8
Nucleic Acids Research
1128 papers in training set
Top 5%
3.9%
9
GigaScience
172 papers in training set
Top 0.4%
3.9%
10
Nature Communications
4913 papers in training set
Top 44%
2.8%
11
NAR Genomics and Bioinformatics
214 papers in training set
Top 1%
2.7%
12
Advanced Science
249 papers in training set
Top 9%
2.0%
13
Cell Genomics
162 papers in training set
Top 3%
1.8%
14
Nature Methods
336 papers in training set
Top 4%
1.7%
15
Nature Biotechnology
147 papers in training set
Top 6%
1.3%
16
Computational and Structural Biotechnology Journal
216 papers in training set
Top 6%
1.3%
17
Database
51 papers in training set
Top 0.6%
1.2%
18
PLOS Computational Biology
1633 papers in training set
Top 21%
1.1%
19
Journal of Genetics and Genomics
36 papers in training set
Top 2%
0.9%
20
Molecular Plant
36 papers in training set
Top 1%
0.9%
21
Scientific Reports
3102 papers in training set
Top 73%
0.8%
22
iScience
1063 papers in training set
Top 30%
0.8%
23
Frontiers in Genetics
197 papers in training set
Top 10%
0.7%
24
Plant Communications
35 papers in training set
Top 2%
0.6%