Back

Bio informatics: Integrate negative controls to get the good data

van Nues, R. W.

2024-10-09 bioinformatics
10.1101/2024.10.08.617225 bioRxiv
Show abstract

High-throughput datasets, like any experimental output, can be full of noise. Negative controls, i.e. mock experiments not providing information concerning the biological system under study, visualise background. Overlooking this training set of wrong examples in publicly available datasets can seriously undermine validity of bioinformatics analyses. We present a program, COALISPR, for explicit and transparent application of negative control data in the comparison of high-throughput sequencing results. This yields mapping coordinates that guide fast counting of reads, bypassing the need for a reference file, and is especially relevant when small RNA sequencing libraries contaminated with breakdown products are analysed for poorly annotated organisms. We have re-analysed small RNA datasets for mouse and fungus Cryptococcus neoformans, leading to consistent identification of miRNAs and of fungal transcripts targeted by siRNAs. Cryptococcal Argonautes are directed to spliced transcripts indicating that RNAi must be triggered by events downstream of intron removal. Negative control datasets contain large amounts of ribosomal RNA (rRNA) fragments (rRFs). These differ from small RNAs associated with RNAi, making a biological role for rRFs in association with Argonautes unlikely. Background signals enabled identification of cryptococcal genes for RNase P, U1 snRNA, 37 H/ACA and 63 Box C/D snoRNAs, including U3 and U14 essential for pre-rRNA processing. To gain meaning, high-throughput RNA-Seq analyses need to incorporate negative data. GRAPHICAL ABSTRACT O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=45 SRC="FIGDIR/small/617225v4_ufig1.gif" ALT="Figure 1"> View larger version (15K): org.highwire.dtl.DTLVardef@c44bdcorg.highwire.dtl.DTLVardef@1509468org.highwire.dtl.DTLVardef@13f398borg.highwire.dtl.DTLVardef@1dae4b3_HPS_FORMAT_FIGEXP M_FIG C_FIG

Matching journals

The top 4 journals account for 50% of the predicted probability mass.

1
NAR Genomics and Bioinformatics
214 papers in training set
Top 0.1%
23.0%
2
BMC Bioinformatics
383 papers in training set
Top 0.6%
12.9%
3
Computational and Structural Biotechnology Journal
216 papers in training set
Top 0.1%
10.3%
4
Bioinformatics
1061 papers in training set
Top 4%
6.5%
50% of probability mass above
5
Nucleic Acids Research
1128 papers in training set
Top 3%
6.5%
6
PLOS Computational Biology
1633 papers in training set
Top 7%
4.9%
7
RNA
169 papers in training set
Top 0.1%
4.0%
8
Bioinformatics Advances
184 papers in training set
Top 1%
3.7%
9
PLOS ONE
4510 papers in training set
Top 42%
3.1%
10
GigaScience
172 papers in training set
Top 0.7%
2.9%
11
PeerJ
261 papers in training set
Top 7%
1.7%
12
Genome Biology
555 papers in training set
Top 5%
1.5%
13
BMC Genomics
328 papers in training set
Top 3%
1.4%
14
Frontiers in Genetics
197 papers in training set
Top 7%
1.2%
15
Molecular Biology and Evolution
488 papers in training set
Top 3%
1.1%
16
RNA Biology
70 papers in training set
Top 0.4%
1.0%
17
Journal of Open Source Software
22 papers in training set
Top 0.2%
0.9%
18
BMC Biology
248 papers in training set
Top 3%
0.8%
19
Scientific Reports
3102 papers in training set
Top 76%
0.7%
20
Cell Reports Methods
141 papers in training set
Top 5%
0.7%
21
Journal of Molecular Biology
217 papers in training set
Top 5%
0.5%
22
Journal of Genetics and Genomics
36 papers in training set
Top 3%
0.5%
23
GENETICS
189 papers in training set
Top 2%
0.5%
24
iScience
1063 papers in training set
Top 40%
0.5%
25
Genetics
225 papers in training set
Top 5%
0.5%