Back

Basal Contamination of Sequencing: Lessons from the GTEx dataset

Nieuwenhuis, T. O.; Yang, S.; Verma, R. X.; Pillalamarri, V.; Arking, D.; Rosenberg, A. Z.; McCall, M. N.; Halushka, M. K.

2020-01-02 genomics
10.1101/602367 bioRxiv
Show abstract

One of the challenges of next generation sequencing (NGS) is read contamination. We used the Genotype-Tissue Expression (GTEx) project, a large, diverse, and robustly generated dataset, to understand the factors that contribute to contamination. We obtained GTEx datasets and technical metadata and validating RNA-Seq from other studies. Of 48 analyzed tissues in GTEx, 26 had variant co-expression clusters of four known highly expressed and pancreas-enriched genes (PRSS1, PNLIP, CLPS, and/or CELA3A). Fourteen additional highly expressed genes from other tissues also indicated contamination. Sample contamination by non-native genes was associated with a sample being sequenced on the same day as a tissue that natively expressed those genes. This was highly significant for pancreas and esophagus genes (linear model, p=9.5e-237 and p=5e-260 respectively). Nine SNPs in four genes shown to contaminate non-native tissues demonstrated allelic differences between DNA-based genotypes and contaminated sample RNA-based genotypes, validating the contamination. Low-level contamination affected 4,497 (39.6%) samples (defined as 10 PRSS1 TPM). It also led [≥] to eQTL assignments in inappropriate tissues among these 18 genes. We note this type of contamination occurs widely, impacting bulk and single cell data set analysis. In conclusion, highly expressed, tissue-enriched genes basally contaminate GTEx and other datasets impacting analyses. Awareness of this process is necessary to avoid assigning inaccurate importance to low-level gene expression in inappropriate tissues and cells.

Matching journals

The top 3 journals account for 50% of the predicted probability mass.

1
BMC Genomics
328 papers in training set
Top 0.1%
37.3%
2
Scientific Reports
3102 papers in training set
Top 8%
9.0%
3
PLOS ONE
4510 papers in training set
Top 22%
8.3%
50% of probability mass above
4
Genome Biology
555 papers in training set
Top 2%
4.8%
5
Genome Medicine
154 papers in training set
Top 2%
3.5%
6
Microbial Genomics
204 papers in training set
Top 0.6%
3.5%
7
GigaScience
172 papers in training set
Top 0.7%
2.7%
8
NAR Genomics and Bioinformatics
214 papers in training set
Top 1%
2.6%
9
Frontiers in Genetics
197 papers in training set
Top 5%
1.7%
10
Genome Research
409 papers in training set
Top 3%
1.5%
11
PLOS Computational Biology
1633 papers in training set
Top 18%
1.5%
12
F1000Research
79 papers in training set
Top 2%
1.3%
13
PeerJ
261 papers in training set
Top 10%
1.2%
14
BMC Bioinformatics
383 papers in training set
Top 6%
1.2%
15
Communications Biology
886 papers in training set
Top 16%
1.1%
16
Computational and Structural Biotechnology Journal
216 papers in training set
Top 7%
0.9%
17
Nature Biotechnology
147 papers in training set
Top 7%
0.9%
18
Cell Genomics
162 papers in training set
Top 7%
0.7%
19
The American Journal of Human Genetics
206 papers in training set
Top 4%
0.7%
20
npj Genomic Medicine
33 papers in training set
Top 1%
0.7%
21
International Journal of Molecular Sciences
453 papers in training set
Top 17%
0.7%
22
BMC Biology
248 papers in training set
Top 6%
0.6%
23
PLOS Genetics
756 papers in training set
Top 17%
0.6%
24
Cancer Research Communications
46 papers in training set
Top 2%
0.6%