Back

Generating, curating, and evaluating trnL reference sequence databases: Benchmarking OBITools3/ecoPCR, RESCRIPt, and MetaCurator

KUDDAR, O. S.; Meiklejohn, K. A.; Callahan, B. J.

2026-04-10 bioinformatics
10.64898/2026.04.07.717010 bioRxiv
Show abstract

Plant DNA metabarcoding enables the identification of plant taxa in mixed samples, with the trnL (UAA) intron and its P6 loop mini-barcode region performing as well as or better than other commonly used markers. Reliable metabarcoding requires high-quality reference databases, yet a regularly maintained trnL resource is currently lacking. Consequently, most studies use uncurated sequences downloaded directly from public repositories without essential validation. We address these gaps by providing guidance through a systematic comparison of three database curation tools - OBITools3/ecoPCR, RESCRIPt, and MetaCurator - to generate three trnL reference sequence databases and evaluate their classification performance across commonly sequenced trnL regions (CD, CH, and GH). Reference trnL sequences and taxonomy files were retrieved from public sequence repositories and curated using standardized filtering steps to reduce taxonomic errors, sequence ambiguity, and redundancy. Four simulated query datasets--two base sets and their mutated counterparts--were constructed to assess classification performance of the databases using the Naive Bayesian Classifier implemented in DADA2.- The evaluation showed that performance differed by trnL region: MetaCurator and RESCRIPt yielded higher and similar metrics for trnL CD; OBITools3/ecoPCR and RESCRIPt were comparable for trnL CH; and MetaCurator attained the highest performance for trnL GH region. All reference databases, taxonomy, and evaluation files are available at Zenodo (https://doi.org/10.5281/zenodo.17969450). The complete computational workflow and scripts are available on GitHub (https://github.com/oskuddar/trnL_DB). Although evaluation was focused on plant taxa in the United States, the resulting databases are suitable for use as global trnL reference databases.

Matching journals

The top 10 journals account for 50% of the predicted probability mass.

1
The Plant Journal
197 papers in training set
Top 0.4%
9.9%
2
Applications in Plant Sciences
21 papers in training set
Top 0.1%
8.3%
3
Plant Direct
81 papers in training set
Top 0.2%
6.7%
4
Molecular Ecology Resources
161 papers in training set
Top 0.3%
4.2%
5
Frontiers in Plant Science
240 papers in training set
Top 2%
4.2%
6
Plant Physiology
217 papers in training set
Top 1%
3.9%
7
New Phytologist
309 papers in training set
Top 2%
3.5%
8
PLOS ONE
4510 papers in training set
Top 41%
3.5%
9
The Plant Cell
141 papers in training set
Top 0.9%
3.5%
10
Scientific Data
174 papers in training set
Top 0.5%
3.5%
50% of probability mass above
11
Methods in Ecology and Evolution
160 papers in training set
Top 0.9%
3.2%
12
Plant Biotechnology Journal
56 papers in training set
Top 0.4%
2.8%
13
Plant Communications
35 papers in training set
Top 0.5%
2.8%
14
NAR Genomics and Bioinformatics
214 papers in training set
Top 1%
2.6%
15
The Plant Genome
53 papers in training set
Top 0.3%
2.0%
16
Scientific Reports
3102 papers in training set
Top 54%
1.9%
17
PeerJ
261 papers in training set
Top 6%
1.9%
18
BMC Bioinformatics
383 papers in training set
Top 4%
1.9%
19
Horticulture Research
43 papers in training set
Top 0.9%
1.8%
20
BMC Genomics
328 papers in training set
Top 3%
1.7%
21
Plant Methods
39 papers in training set
Top 0.4%
1.5%
22
GigaScience
172 papers in training set
Top 2%
1.5%
23
Genome Biology
555 papers in training set
Top 5%
1.5%
24
Bioinformatics Advances
184 papers in training set
Top 3%
1.3%
25
Physiologia Plantarum
35 papers in training set
Top 0.4%
0.9%
26
Genetics
225 papers in training set
Top 4%
0.9%
27
Nature Communications
4913 papers in training set
Top 60%
0.9%
28
Bioinformatics
1061 papers in training set
Top 9%
0.9%
29
International Journal of Molecular Sciences
453 papers in training set
Top 16%
0.7%
30
eLife
5422 papers in training set
Top 58%
0.7%