Back

Exon Targeted Retrieval and Classification Toolbox (ExTRaCT): a gene search pipeline to find APOBEC3 Z-domains in novel bat genomes

Delamonica, B.; Bat1K 21-Families Group, ; Larijani, M.; MacCarthy, T.; Davalos, L. M.

2026-03-18 genomics
10.64898/2026.03.15.711917 bioRxiv
Show abstract

MotivationSeveral computation gene search tools exist to identify and annotate an ever-growing body of newly sequenced genomes of different species. Many annotation tools, however, fall short when the target species diverges from well-studied model organisms, and when searching for short genes with multiple copies. ResultsWe have developed the Exon Targeted Retrieval and Classification Toolbox, ExTRaCT, an automated pipeline to identify any gene exon with conserved structure in novel species genome assemblies. In the use cases presented here, we applied our search tool to 102 bat genomes to find APOBEC3 gene family members. We show that our homolog search algorithm is efficient (run time average of 5 hours for over 100 genomes), works well with reference sequences distantly related to the target (1 out of 498 misclassifications, 0 false positives and 2 false negatives), and is easy to use. As genomic sequencing becomes faster and more accessible, ExTRaCT has downstream applications in phylogenetic, biochemical and genomic studies. It is a simple computational tool that provides a solution to target gene identification, requiring neither whole-genome-assembly annotations, nor prior knowledge of closely related species. Availabilityhttps://doi.org/10.5281/zenodo.15769018 ContactBrenda.delamonica@stonybrook.edu Supplementary informationSupplementary data are available at Bioinformatics online.

Matching journals

The top 3 journals account for 50% of the predicted probability mass.

1
Bioinformatics
1061 papers in training set
Top 0.6%
34.5%
2
Bioinformatics Advances
184 papers in training set
Top 0.1%
10.5%
3
BMC Bioinformatics
383 papers in training set
Top 1%
7.2%
50% of probability mass above
4
GigaScience
172 papers in training set
Top 0.3%
4.9%
5
Genome Biology
555 papers in training set
Top 2%
4.3%
6
Nucleic Acids Research
1128 papers in training set
Top 6%
3.6%
7
PLOS Computational Biology
1633 papers in training set
Top 10%
3.6%
8
G3: Genes, Genomes, Genetics
222 papers in training set
Top 0.2%
3.3%
9
BMC Genomics
328 papers in training set
Top 1%
2.7%
10
Nature Methods
336 papers in training set
Top 3%
2.5%
11
Methods in Ecology and Evolution
160 papers in training set
Top 1%
2.4%
12
NAR Genomics and Bioinformatics
214 papers in training set
Top 1%
2.1%
13
Genome Research
409 papers in training set
Top 2%
1.7%
14
Nature Communications
4913 papers in training set
Top 53%
1.5%
15
Virus Evolution
140 papers in training set
Top 0.9%
1.5%
16
Nature Biotechnology
147 papers in training set
Top 6%
1.0%
17
G3 Genes|Genomes|Genetics
351 papers in training set
Top 2%
0.9%
18
Molecular Ecology Resources
161 papers in training set
Top 1%
0.8%
19
Molecular Biology and Evolution
488 papers in training set
Top 4%
0.8%
20
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 45%
0.7%
21
PLOS ONE
4510 papers in training set
Top 69%
0.7%
22
PeerJ
261 papers in training set
Top 16%
0.7%
23
Genome Medicine
154 papers in training set
Top 9%
0.6%
24
Genome Biology and Evolution
280 papers in training set
Top 2%
0.6%
25
Peer Community Journal
254 papers in training set
Top 5%
0.5%