Bio informatics: Integrate negative controls to get the good data
van Nues, R. W.
Show abstract
High-throughput datasets, like any experimental output, can be full of noise. Negative controls, i.e. mock experiments not providing information concerning the biological system under study, visualise background. Overlooking this training set of wrong examples in publicly available datasets can seriously undermine validity of bioinformatics analyses. We present a program, COALISPR, for explicit and transparent application of negative control data in the comparison of high-throughput sequencing results. This yields mapping coordinates that guide fast counting of reads, bypassing the need for a reference file, and is especially relevant when small RNA sequencing libraries contaminated with breakdown products are analysed for poorly annotated organisms. We have re-analysed small RNA datasets for mouse and fungus Cryptococcus neoformans, leading to consistent identification of miRNAs and of fungal transcripts targeted by siRNAs. Cryptococcal Argonautes are directed to spliced transcripts indicating that RNAi must be triggered by events downstream of intron removal. Negative control datasets contain large amounts of ribosomal RNA (rRNA) fragments (rRFs). These differ from small RNAs associated with RNAi, making a biological role for rRFs in association with Argonautes unlikely. Background signals enabled identification of cryptococcal genes for RNase P, U1 snRNA, 37 H/ACA and 63 Box C/D snoRNAs, including U3 and U14 essential for pre-rRNA processing. To gain meaning, high-throughput RNA-Seq analyses need to incorporate negative data. GRAPHICAL ABSTRACT O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=45 SRC="FIGDIR/small/617225v4_ufig1.gif" ALT="Figure 1"> View larger version (15K): org.highwire.dtl.DTLVardef@c44bdcorg.highwire.dtl.DTLVardef@1509468org.highwire.dtl.DTLVardef@13f398borg.highwire.dtl.DTLVardef@1dae4b3_HPS_FORMAT_FIGEXP M_FIG C_FIG
Matching journals
The top 4 journals account for 50% of the predicted probability mass.