Reproducible and accessible analysis of transposon insertion data at scale
Lariviere, D.; Wickham, L.; Keiler, K. C.; Nekrutenko, A.
Show abstract
Significant progress has been made in advancing and standardizing tools for human genomic and biomedical research, yet the field of next generation sequencing (NGS) analysis for microorganisms (including multiple pathogens) remains fragmented, lacks accessible and reusable tools, is hindered by local computational resource limitations, and does not offer widely accepted standards. One of such "problem areas" is the analysis of Transposon Insertion Sequencing (TIS) data. TIS allows perturbing the entire genome of a microorganism by introducing random insertions of transposon-derived constructs. The impact of the insertions on the survival and growth provides precise information about genes affecting specific phenotypic characteristics. A wide array of tools has been developed to analyze TIS data and among the variety of options available, it is often difficult to identify which one can provide a reliable and reproducible analysis. Here we sought to understand the challenges and propose reliable practices for the analysis of TIS experiments. Using data from two recent TIS studies we have developed a series of workflows that include multiple tools for data de-multiplexing, promoter sequence identification, transposon flank alignment, and read count repartition across the genome. Particular attention was paid to quality control procedures such as the determination of the optimal tool parameters for the analysis and removal of contamination. Our work provides an assessment of the currently available tools for TIS data analysis and offers ready to use workflows that can be invoked by anyone in the world using our public Galaxy platform (https://usegalaxy.org). To lower the entry barriers we have also developed interactive tutorials explaining details of TIS data analysis procedures at https://bit.ly/gxy-tis. ImportanceA wide array of tools has been developed to analyze TIS data and among the variety of options available, it is often difficult to identify which one can provide a reliable and reproducible analysis. Here we sought to understand the challenges and propose reliable practices for the analysis of TIS experiments. Using data from two recent TIS studies we have developed a series of workflows that include multiple tools for data de-multiplexing, promoter sequence identification, transposon flank alignment, and read count repartition across the genome. Particular attention was paid to quality control procedures such as the determination of the optimal tool parameters for the analysis and removal of contamination. Our work democratizes the TIS data analysis by providing open workflows supported by public computational infrastructure.
Matching journals
The top 5 journals account for 50% of the predicted probability mass.