Back

Reproducible and accessible analysis of transposon insertion data at scale

Lariviere, D.; Wickham, L.; Keiler, K. C.; Nekrutenko, A.

2020-05-20 microbiology
10.1101/2020.05.19.105429 bioRxiv
Show abstract

Significant progress has been made in advancing and standardizing tools for human genomic and biomedical research, yet the field of next generation sequencing (NGS) analysis for microorganisms (including multiple pathogens) remains fragmented, lacks accessible and reusable tools, is hindered by local computational resource limitations, and does not offer widely accepted standards. One of such "problem areas" is the analysis of Transposon Insertion Sequencing (TIS) data. TIS allows perturbing the entire genome of a microorganism by introducing random insertions of transposon-derived constructs. The impact of the insertions on the survival and growth provides precise information about genes affecting specific phenotypic characteristics. A wide array of tools has been developed to analyze TIS data and among the variety of options available, it is often difficult to identify which one can provide a reliable and reproducible analysis. Here we sought to understand the challenges and propose reliable practices for the analysis of TIS experiments. Using data from two recent TIS studies we have developed a series of workflows that include multiple tools for data de-multiplexing, promoter sequence identification, transposon flank alignment, and read count repartition across the genome. Particular attention was paid to quality control procedures such as the determination of the optimal tool parameters for the analysis and removal of contamination. Our work provides an assessment of the currently available tools for TIS data analysis and offers ready to use workflows that can be invoked by anyone in the world using our public Galaxy platform (https://usegalaxy.org). To lower the entry barriers we have also developed interactive tutorials explaining details of TIS data analysis procedures at https://bit.ly/gxy-tis. ImportanceA wide array of tools has been developed to analyze TIS data and among the variety of options available, it is often difficult to identify which one can provide a reliable and reproducible analysis. Here we sought to understand the challenges and propose reliable practices for the analysis of TIS experiments. Using data from two recent TIS studies we have developed a series of workflows that include multiple tools for data de-multiplexing, promoter sequence identification, transposon flank alignment, and read count repartition across the genome. Particular attention was paid to quality control procedures such as the determination of the optimal tool parameters for the analysis and removal of contamination. Our work democratizes the TIS data analysis by providing open workflows supported by public computational infrastructure.

Matching journals

The top 5 journals account for 50% of the predicted probability mass.

1
GigaScience
172 papers in training set
Top 0.1%
28.3%
2
Microbial Genomics
204 papers in training set
Top 0.3%
7.0%
3
PLOS Computational Biology
1633 papers in training set
Top 5%
7.0%
4
PLOS ONE
4510 papers in training set
Top 34%
4.3%
5
mSystems
361 papers in training set
Top 3%
3.8%
50% of probability mass above
6
Computational and Structural Biotechnology Journal
216 papers in training set
Top 1%
3.7%
7
PeerJ
261 papers in training set
Top 2%
3.7%
8
F1000Research
79 papers in training set
Top 0.6%
3.3%
9
Gigabyte
60 papers in training set
Top 0.3%
3.1%
10
Scientific Reports
3102 papers in training set
Top 49%
2.1%
11
Access Microbiology
22 papers in training set
Top 0.1%
2.1%
12
Bioinformatics Advances
184 papers in training set
Top 3%
1.7%
13
Bioinformatics
1061 papers in training set
Top 7%
1.7%
14
NAR Genomics and Bioinformatics
214 papers in training set
Top 2%
1.5%
15
BMC Genomics
328 papers in training set
Top 3%
1.4%
16
mSphere
281 papers in training set
Top 4%
1.3%
17
BMC Bioinformatics
383 papers in training set
Top 6%
1.1%
18
Frontiers in Bioinformatics
45 papers in training set
Top 0.5%
1.0%
19
iScience
1063 papers in training set
Top 24%
1.0%
20
Biology Methods and Protocols
53 papers in training set
Top 2%
1.0%
21
Nucleic Acids Research
1128 papers in training set
Top 15%
0.9%
22
Journal of Visualized Experiments
30 papers in training set
Top 0.7%
0.8%
23
Genome Biology
555 papers in training set
Top 7%
0.8%
24
Frontiers in Microbiology
375 papers in training set
Top 9%
0.8%
25
Viruses
318 papers in training set
Top 5%
0.8%
26
Wellcome Open Research
57 papers in training set
Top 3%
0.7%
27
Mobile DNA
27 papers in training set
Top 0.2%
0.7%
28
Peer Community Journal
254 papers in training set
Top 5%
0.5%
29
G3 Genes|Genomes|Genetics
351 papers in training set
Top 3%
0.5%
30
Journal of Proteome Research
215 papers in training set
Top 3%
0.5%