Back

Ratatoskr: A tool for automated retrieval of taxonomic type strain sequences and metadata

Turkington, C.; Bastiaanssen, F.; Nezam-Abadi, N.; Shkoporov, A. N.; Hill, C.

2026-01-27 bioinformatics
10.64898/2026.01.26.700362 bioRxiv
Show abstract

Bacterial taxonomic type strains anchor species names to physical and genomic reference material, making them essential for reproducible and comparable prokaryotic research. While reference strains are often well-characterised through curated metadata, nomenclature histories, and sequence records, no single database holds up-to-date information on all these aspects, resulting in fragmented information. Gathering the complete set of information for a type strain is further complicated by inconsistencies in nomenclature between sources due to the often-numerous synonyms that can describe a single strain. As a result, collecting type strain data for taxonomic proposals and emendations can be an onerous task requiring extensive manual curation. To address this issue, we introduce Ratatoskr, a Python-based tool that automates the retrieval of sequences and metadata for bacterial taxonomic type strains. Ratatoskr facilitates this by collecting the latest type strain information of the List of Prokaryotic names with Standing in Nomenclature (LPSN) and using this information to query the BacDive and NCBI databases. By applying known taxonomic synonym information Ratatoskr is able to resolve cross-database inconsistencies and streamline the retrieval process. We show that through its use, Ratatoskr can obtain metadata and sequence data for type strains of bacteria within minutes to seconds, depending on the number of members within the requested taxon. By automating this retrieval, Ratatoskr provides fast, accurate, and readily shareable starting points for studies involving the use of taxonomic type strains and data, such as new taxonomic proposals or emendations. Data summaryRatatoskr was developed using Python 3 and is freely available at https://github.com/Fabian-Bastiaanssen/Ratatoskr under a GPL-3.0 licence.

Matching journals

The top 3 journals account for 50% of the predicted probability mass.

1
Bioinformatics
1061 papers in training set
Top 0.5%
34.9%
2
Nucleic Acids Research
1128 papers in training set
Top 2%
8.6%
3
BMC Bioinformatics
383 papers in training set
Top 1%
7.3%
50% of probability mass above
4
Bioinformatics Advances
184 papers in training set
Top 0.6%
4.9%
5
Genome Biology
555 papers in training set
Top 2%
4.0%
6
NAR Genomics and Bioinformatics
214 papers in training set
Top 0.6%
3.7%
7
PLOS ONE
4510 papers in training set
Top 38%
3.7%
8
Journal of Open Source Software
22 papers in training set
Top 0.1%
3.1%
9
GigaScience
172 papers in training set
Top 0.8%
2.4%
10
PLOS Computational Biology
1633 papers in training set
Top 13%
2.1%
11
Nature Biotechnology
147 papers in training set
Top 4%
1.7%
12
Scientific Reports
3102 papers in training set
Top 66%
1.2%
13
Genome Research
409 papers in training set
Top 3%
0.9%
14
Microbial Genomics
204 papers in training set
Top 2%
0.9%
15
Frontiers in Microbiology
375 papers in training set
Top 8%
0.9%
16
Cell Reports Methods
141 papers in training set
Top 4%
0.9%
17
Molecular Biology and Evolution
488 papers in training set
Top 4%
0.8%
18
Database
51 papers in training set
Top 1%
0.7%
19
Microbiome
139 papers in training set
Top 3%
0.7%
20
Nature Methods
336 papers in training set
Top 7%
0.7%
21
mSystems
361 papers in training set
Top 8%
0.7%
22
Nature Computational Science
50 papers in training set
Top 2%
0.5%
23
Nature Communications
4913 papers in training set
Top 67%
0.5%
24
PeerJ
261 papers in training set
Top 19%
0.5%
25
iScience
1063 papers in training set
Top 40%
0.5%
26
G3 Genes|Genomes|Genetics
351 papers in training set
Top 3%
0.5%
27
mSphere
281 papers in training set
Top 7%
0.5%