Back

AI-based prediction of herbarium sequencing success across the plant tree of life

Ranjbaran, Y.; Maurin, O.; Canadelli, E.; Morosinotto, T.; Weech, M.-H.; Kersey, P.; Antonelli, A.; Baker, W. J.; Sales, G. J.; Dal Grande, F.

2025-02-07 plant biology
10.1101/2025.02.03.636220 bioRxiv
Show abstract

DNA recovered from herbarium specimens represents a vital asset in botanical research, playing a pivotal role in unravelling the evolution, diversity, and ecological dynamics of plants. Despite its importance, challenges such as fragmented DNA and insufficient sequencing yields render molecular data retrieval a high-risk and costly endeavour involving the use of non-replaceable herbarium specimens. Here, we propose a framework based on Artificial Intelligence (AI) to forecast the success of genomic DNA extraction suitable for sequencing from herbarium samples. Our model integrates morphological characteristics and sample colour derived from scanned herbarium images, metadata including sample age and locality, and DNA quantity measurements of samples. We train a deep learning algorithm with ca. 2,000 specimens that have been digitized and sequenced in the framework of the Plant and Fungal Trees of Life (PAFTOL) Project, spanning from year 1832 to the present. As training datasets increase with ongoing digitization and genomic sequencing efforts, our AI predictive model can support researchers in selecting the herbarium samples with the highest likelihood of yielding high-quality genomic DNA from amongst a vast array of globally distributed candidate specimens. Our approach enhances the contribution of herbarium-derived DNA in large-scale studies and facilitates the utilisation of historical collections for a deeper understanding of plant evolution and ecology, with implications for conservation.

Matching journals

The top 5 journals account for 50% of the predicted probability mass.

1
Applications in Plant Sciences
21 papers in training set
Top 0.1%
22.5%
2
New Phytologist
309 papers in training set
Top 0.4%
12.5%
3
PLOS ONE
4510 papers in training set
Top 24%
7.2%
4
PLANTS, PEOPLE, PLANET
21 papers in training set
Top 0.1%
7.2%
5
The Plant Journal
197 papers in training set
Top 0.6%
7.2%
50% of probability mass above
6
Scientific Data
174 papers in training set
Top 0.4%
4.0%
7
Scientific Reports
3102 papers in training set
Top 37%
3.6%
8
Methods in Ecology and Evolution
160 papers in training set
Top 0.8%
3.6%
9
GigaScience
172 papers in training set
Top 0.8%
2.6%
10
Frontiers in Plant Science
240 papers in training set
Top 3%
2.1%
11
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 32%
1.7%
12
Systematic Biology
121 papers in training set
Top 0.3%
1.2%
13
BMC Genomics
328 papers in training set
Top 4%
1.2%
14
eLife
5422 papers in training set
Top 50%
1.1%
15
Nature Communications
4913 papers in training set
Top 58%
0.9%
16
Horticulture Research
43 papers in training set
Top 1%
0.9%
17
The Plant Phenome Journal
14 papers in training set
Top 0.2%
0.9%
18
Journal of Computational Biology
37 papers in training set
Top 0.5%
0.8%
19
Communications Biology
886 papers in training set
Top 21%
0.8%
20
BMC Biology
248 papers in training set
Top 4%
0.8%
21
Plant Phenomics
17 papers in training set
Top 0.3%
0.7%
22
Genome Biology
555 papers in training set
Top 7%
0.7%
23
Molecular Biology and Evolution
488 papers in training set
Top 5%
0.6%
24
Science
429 papers in training set
Top 21%
0.6%