Back

LoReMINE: Long Read-based Microbial genome mining pipeline

Agrawal, A. A.; Bader, C. D.; Kalinina, O. V.

2026-02-04 bioinformatics
10.64898/2026.02.02.703231 bioRxiv
Show abstract

Microbial natural products represent a chemically diverse repertoire of small molecules with major pharmaceutical potential. Despite the increasing availability of microbial genome sequences, large-scale natural product discovery remains challenging because the existing genome mining approaches lack integrated workflows for rapid dereplication of known compounds and prioritization of novel candidates, forcing researchers to rely on multiple tools that requires extensive manual curation and expert intervention at each step. To address these limitations, we introduce LoReMINE (Long Read-based Microbial genome mining pipeline), a fully automated end-to-end pipeline that generates high-quality assemblies, performs taxonomic classification, predicts biosynthetic gene clusters (BGCs) responsible for biosynthesis of natural products, and clusters them into gene cluster families (GCFs) directly from long-read sequencing data. By integrating state-of-the-art tools into a seamless pipeline, LoReMINE enables scalable, reproducible, and comprehensive genome mining across diverse microbial taxa. The pipeline is openly available at https://github.com/kalininalab/LoReMINE and can be installed via Conda (https://anaconda.org/kalininalab/loremine), facilitating broad adoption by the natural product research community. Author summaryFor decades, microbial natural products have been a major source of medicines, with most of the clinically used antibiotics being their derivatives. Recent advances in DNA sequencing technologies now allow the reconstruction of more complete and continuous microbial genomes, revealing a vast and largely untapped diversity of biosynthetic gene clusters responsible for natural product biosynthesis. Despite these advances, large-scale natural product discovery remains difficult because current genome mining approaches rely on many separate tools and lack an integrated workflow to dereplicate known compounds and prioritize novel biosynthetic pathways. To address these limitations, we introduce LoReMINE, an automated pipeline designed to simplify microbial genome mining directly from long-read sequencing data. LoReMINE integrates genome assembly, taxonomic classification, identification of biosynthetic gene clusters, and their clustering into gene cluster families within a single, reproducible workflow. This streamlined approach enables scalable analysis across diverse microbial taxa and facilitates comprehensive exploration of microbial biosynthetic potential. The pipeline is designed for both experimental and computational researchers, helping to advance natural product research and contribute towards the discovery of new therapeutic drugs.

Matching journals

The top 4 journals account for 50% of the predicted probability mass.

1
Nature Biotechnology
147 papers in training set
Top 0.1%
33.4%
2
Bioinformatics
1061 papers in training set
Top 3%
7.3%
3
Cell Systems
167 papers in training set
Top 2%
6.9%
4
Nature Methods
336 papers in training set
Top 2%
6.4%
50% of probability mass above
5
Nucleic Acids Research
1128 papers in training set
Top 4%
4.9%
6
Cell Reports Methods
141 papers in training set
Top 0.7%
4.0%
7
Nature Communications
4913 papers in training set
Top 37%
4.0%
8
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 19%
3.6%
9
Advanced Science
249 papers in training set
Top 12%
1.5%
10
Nature
575 papers in training set
Top 12%
1.3%
11
Cell Host & Microbe
113 papers in training set
Top 3%
1.3%
12
Genome Biology
555 papers in training set
Top 5%
1.3%
13
Nature Machine Intelligence
61 papers in training set
Top 2%
1.2%
14
Bioinformatics Advances
184 papers in training set
Top 4%
1.1%
15
mSystems
361 papers in training set
Top 6%
1.0%
16
Scientific Reports
3102 papers in training set
Top 71%
0.9%
17
NAR Genomics and Bioinformatics
214 papers in training set
Top 3%
0.9%
18
eLife
5422 papers in training set
Top 57%
0.8%
19
Nature Biomedical Engineering
42 papers in training set
Top 2%
0.8%
20
Communications Biology
886 papers in training set
Top 23%
0.8%
21
Nature Genetics
240 papers in training set
Top 8%
0.7%
22
iScience
1063 papers in training set
Top 34%
0.7%
23
Computational and Structural Biotechnology Journal
216 papers in training set
Top 11%
0.7%
24
GigaScience
172 papers in training set
Top 4%
0.7%
25
Cell Chemical Biology
81 papers in training set
Top 4%
0.7%
26
ISME Communications
103 papers in training set
Top 2%
0.7%
27
Molecular Systems Biology
142 papers in training set
Top 2%
0.7%
28
Briefings in Bioinformatics
326 papers in training set
Top 7%
0.7%
29
Microbiome
139 papers in training set
Top 4%
0.5%
30
Cell Genomics
162 papers in training set
Top 8%
0.5%