Back

Assessing the potential of bee-collected pollen sequence data to train machine learning models for geolocation of sample origin

Hayes, R. A.; Kern, A. D.; Ponisio, L. C.

2026-04-01 bioinformatics
10.64898/2026.03.29.715128 bioRxiv
Show abstract

Pollen is a robust and widespread substance that captures a historical snapshot of a specific time and place, and it can be used to track movements through space by examining the pollen deposited on various objects. Palynology, the study of pollen, is used across fields such as conservation, natural history, and forensics, where it is particularly useful for tracing the origin and movement of objects. However, pollen has remained underutilized due to the difficulty of distinguishing many pollen taxa beyond the family level and limited pollen reference material to support location predictions. With recent developments in pollen DNA metabarcoding these issues have been rectified, but much of the available pollen data are primarily from wind-pollinated species, which are widespread and less informative of specific sample locations. Bee-collected pollen presents an untapped resource in training predictive models to geolocate sample origin. Here we compiled bee-collected pollen DNA sequence relative abundance data from three projects in the western U.S. and assessed the accuracy of supervised machine learning models to predict the location of sample origin based solely on pollen assemblage, without the need of incorporating additional data. Random Forest and k-Nearest Neighbors models yielded high accuracy across all projects. We also found that models trained on taxonomically clustered pollen assigned sequence variants (ASVs) performed slightly better than those trained on raw sequence data, but the difference was minor, indicating that models trained on raw sequence data can reliably predict location and avoid the time-consuming taxonomic assignment process. Our results demonstrate the utility of repurposing bee-collected pollen for geolocation and provide a framework for employing supervised machine learning in future geolocation efforts. HighlightsO_LIBee-collected pollen metabarcoding data was used to accurately predict sample origin C_LIO_LIRandom Forest and k-Nearest Neighbors algorithms were most accurate with lowest error C_LIO_LITaxonomically-classified and raw DNA sequence data training sets performed comparably C_LI

Matching journals

The top 5 journals account for 50% of the predicted probability mass.

1
Molecular Ecology Resources
161 papers in training set
Top 0.1%
14.9%
2
Environmental DNA
49 papers in training set
Top 0.1%
14.5%
3
PLOS ONE
4510 papers in training set
Top 18%
10.2%
4
Scientific Reports
3102 papers in training set
Top 8%
9.3%
5
Methods in Ecology and Evolution
160 papers in training set
Top 0.5%
6.5%
50% of probability mass above
6
PeerJ
261 papers in training set
Top 1%
4.4%
7
Gigabyte
60 papers in training set
Top 0.3%
3.3%
8
Frontiers in Plant Science
240 papers in training set
Top 3%
2.6%
9
Applications in Plant Sciences
21 papers in training set
Top 0.1%
1.9%
10
Ecological Informatics
29 papers in training set
Top 0.4%
1.7%
11
G3
33 papers in training set
Top 0.3%
1.3%
12
Ecology and Evolution
232 papers in training set
Top 3%
1.2%
13
Insects
36 papers in training set
Top 0.7%
1.2%
14
New Phytologist
309 papers in training set
Top 4%
1.2%
15
Molecular Ecology
304 papers in training set
Top 3%
1.1%
16
Genes
126 papers in training set
Top 2%
1.0%
17
BMC Bioinformatics
383 papers in training set
Top 6%
1.0%
18
BMC Genomics
328 papers in training set
Top 4%
1.0%
19
International Journal of Molecular Sciences
453 papers in training set
Top 14%
0.8%
20
BMC Ecology and Evolution
49 papers in training set
Top 2%
0.8%
21
G3 Genes|Genomes|Genetics
351 papers in training set
Top 2%
0.8%
22
Systematic Entomology
11 papers in training set
Top 0.1%
0.7%
23
Plant Direct
81 papers in training set
Top 2%
0.7%
24
eLife
5422 papers in training set
Top 63%
0.5%
25
Journal of Applied Ecology
35 papers in training set
Top 0.9%
0.5%
26
Peer Community Journal
254 papers in training set
Top 5%
0.5%
27
Bioinformatics Advances
184 papers in training set
Top 6%
0.5%
28
Ecosphere
53 papers in training set
Top 0.9%
0.5%
29
Metabarcoding and Metagenomics
12 papers in training set
Top 0.1%
0.5%