dna-parser: a Python library written in Rust for fast encoding of DNA and RNA sequences
Vilain, M.; Aris-Brosou, S.
Show abstract
BackgroundThe ever-growing amount of available biological data leads modern analysis to be performed on large datasets. Unfortunately, bioinformatics tools for preprocessing and analyzing data are not always designed to treat such large amounts of data efficiently. Notably, this is the case when encoding DNA and RNA sequences into numerical representations, also called descriptors, before passing them to machine learning models. Furthermore, current Python tools available for this preprocessing step are not well suited to be integrated into pipelines resulting in slow encoding speeds. ResultsWe introduce dna-parser, a Python library written in Rust to encode DNA and RNA sequences into numerical features. The combination of Rust and Python allows to encode sequences rapidly and in parallel across multiple threads while maintaining compatibility with packages from the Python ecosystem. Moreover, this library implements many of the most widely used types of numerical feature schemes coming from bioinformaticss and natural language processing. Conclusiondna-parser is an easy to install Python library that offers many Python wheels for Linux (muslinux and manylinux), macOS, and Windows via pip (https://pypi.org/project/dna-parser/). The open source code is available on GitHub (https://github.com/Mvila035/dna_parser) along with the documentation (https://mvila035.github.io/dna_parser/documentation/).
Matching journals
The top 1 journal accounts for 50% of the predicted probability mass.