Back

Identifying widespread and recurrent variants of genetic parts to improve annotation of engineered DNA sequences

McGuffie, M. J.; Barrick, J. E.

2023-04-10 synthetic biology
10.1101/2023.04.10.536277 bioRxiv
Show abstract

Engineered plasmids have been workhorses of recombinant DNA technology for nearly half a century. Plasmids are used to clone DNA sequences encoding new genetic parts and to reprogram cells by combining these parts in new ways. Historically, many genetic parts on plasmids were copied and reused without routinely checking their DNA sequences. With the widespread use of high-throughput DNA sequencing technologies, we now know that plasmids often contain variants of common genetic parts that differ slightly from their canonical sequences. Because the exact provenance of a genetic part on a particular plasmid is usually unknown, it is difficult to determine whether these differences arose due to mutations during plasmid construction and propagation or due to intentional editing by researchers. In either case, it is important to understand how the sequence changes alter the properties of the genetic part. We analyzed the sequences of over 50,000 engineered plasmids using depositor metadata and a metric inspired by the natural language processing field. We detected 217 uncatalogued genetic part variants that were especially widespread or were likely the result of convergent evolution or engineering. Several of these uncatalogued variants are known mutants of plasmid origins of replication or antibiotic resistance genes that are missing from current annotation databases. However, most are uncharacterized, and 3/5 of the plasmids we analyzed contained at least one of the uncatalogued variants. Our results include a list of genetic parts to prioritize for refining engineered plasmid annotation pipelines, highlight widespread variants of parts that warrant further investigation to see whether they have altered characteristics, and suggest cases where unintentional evolution of plasmid parts may be affecting the reliability and reproducibility of science. Author SummaryPlasmids are used in molecular biology and biotechnology for a wide variety of tasks such as cloning DNA, expressing recombinant proteins, and creating vaccines. One challenge in working with plasmids is that there has been a long, and often lost history of pieces of plasmids being copied and remixed by researchers to create new plasmids. Current databases used for annotating key genetic parts in plasmids are incomplete, especially with respect to cataloguing closely related versions of parts that can have very different characteristics. Some genetic part variants have arisen due to purposeful editing while others are the result of unplanned mutations and evolution. When a researcher finds differences between a database sequence and a genetic part in their newly constructed plasmid, it is often unclear how and when it arose and whether it will affect their experiments. We identified 217 genetic part variants that are either widespread or have likely arisen independently more than once on plasmids due to convergent evolution or engineering. We propose that these variants should be prioritized for inclusion in curated databases of engineered DNA sequences and for functional characterization to improve the reliability and reproducibility of science.

Matching journals

The top 6 journals account for 50% of the predicted probability mass.

1
ACS Synthetic Biology
256 papers in training set
Top 0.2%
23.0%
2
Synthetic Biology
21 papers in training set
Top 0.1%
8.6%
3
Nucleic Acids Research
1128 papers in training set
Top 3%
6.9%
4
Nature
575 papers in training set
Top 4%
6.4%
5
Molecular Systems Biology
142 papers in training set
Top 0.1%
4.4%
6
Nature Communications
4913 papers in training set
Top 34%
4.4%
50% of probability mass above
7
Cell
370 papers in training set
Top 6%
3.7%
8
Philosophical Transactions of the Royal Society B
51 papers in training set
Top 2%
2.8%
9
eLife
5422 papers in training set
Top 31%
2.8%
10
Cell Systems
167 papers in training set
Top 5%
2.1%
11
PLOS ONE
4510 papers in training set
Top 47%
2.1%
12
G3: Genes, Genomes, Genetics
222 papers in training set
Top 0.3%
1.8%
13
Nature Biotechnology
147 papers in training set
Top 4%
1.8%
14
mSystems
361 papers in training set
Top 5%
1.7%
15
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 36%
1.4%
16
PLOS Computational Biology
1633 papers in training set
Top 18%
1.4%
17
Science
429 papers in training set
Top 16%
1.4%
18
Structure
175 papers in training set
Top 2%
1.2%
19
Protein Science
221 papers in training set
Top 1%
1.2%
20
NAR Genomics and Bioinformatics
214 papers in training set
Top 3%
0.8%
21
Bioinformatics
1061 papers in training set
Top 9%
0.8%
22
The CRISPR Journal
33 papers in training set
Top 0.2%
0.8%
23
Journal of Molecular Biology
217 papers in training set
Top 3%
0.8%
24
Frontiers in Bioengineering and Biotechnology
88 papers in training set
Top 3%
0.8%
25
Mobile DNA
27 papers in training set
Top 0.2%
0.8%
26
Cell Genomics
162 papers in training set
Top 6%
0.8%
27
Nature Methods
336 papers in training set
Top 6%
0.7%
28
Scientific Reports
3102 papers in training set
Top 76%
0.7%
29
BMC Genomics
328 papers in training set
Top 7%
0.5%
30
Protein Engineering, Design and Selection
14 papers in training set
Top 0.1%
0.5%