Back

RuHere (Are You Here?): An R package to obtain, validate, and clean species records using metadata and specialist range information

Ferreira Trindade, W. C.; Caron, F.

2026-02-04 ecology
10.64898/2026.02.02.703373 bioRxiv
Show abstract

O_LISpecies occurrence data are fundamental to understanding, predicting, and conserving global biodiversity. However, biodiversity datasets remain affected by substantial data-quality issues, particularly erroneous or imprecise geographic coordinates. Most available tools for identifying problematic records rely primarily on automated spatial or metadata-based checks and rarely integrate expert-curated species range information, which can reveal introductions or geographic errors that often escape standard validation procedures. C_LIO_LIHere, we introduce RuHere, an R package designed to manage species occurrence data, flag potential errors, and support the iterative exploration of problematic records. RuHere streamlines the data-cleaning process by integrating six main steps: (1) obtaining species occurrence records; (2) merging datasets and standardizing spatial information; (3) flagging records based on metadata; (4) flagging records using expert-derived distribution data; (5) visualizing, investigating, and summarizing flagged issues in the final datasets; and (6) exploring and reducing sampling bias. C_LIO_LIWe demonstrate the applicability of RuHere using occurrence data for a plant species (Araucaria angustifolia) and an animal species (Cyanocorax caeruleus). Nearly 75% of records were flagged as potentially problematic, including records identified exclusively by functions relying on specialist range information. C_LIO_LIThe main strengths of RuHere lie in its integrated and computationally efficient workflow, its tools for exploring and evaluating flagged records, and its ability to incorporate expert-derived distribution data to identify occurrences outside a species known natural range. By combining metadata-based checks, coordinate validation, and specialist knowledge, RuHere provides a robust and reproducible framework for improving the quality of species occurrence datasets. C_LI

Matching journals

The top 4 journals account for 50% of the predicted probability mass.

1
Methods in Ecology and Evolution
160 papers in training set
Top 0.1%
32.0%
2
Nature Communications
4913 papers in training set
Top 20%
9.8%
3
Ecography
50 papers in training set
Top 0.1%
7.0%
4
Bioinformatics Advances
184 papers in training set
Top 0.5%
6.2%
50% of probability mass above
5
PLOS ONE
4510 papers in training set
Top 33%
4.7%
6
PLOS Computational Biology
1633 papers in training set
Top 10%
3.5%
7
Diversity and Distributions
26 papers in training set
Top 0.1%
2.3%
8
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 29%
2.0%
9
Ecology and Evolution
232 papers in training set
Top 2%
1.7%
10
Scientific Reports
3102 papers in training set
Top 60%
1.6%
11
Ecological Informatics
29 papers in training set
Top 0.4%
1.6%
12
Bioinformatics
1061 papers in training set
Top 7%
1.6%
13
GigaScience
172 papers in training set
Top 2%
1.4%
14
Nature Methods
336 papers in training set
Top 5%
1.4%
15
Nature Ecology & Evolution
113 papers in training set
Top 3%
1.3%
16
PLOS Biology
408 papers in training set
Top 14%
1.2%
17
Molecular Ecology Resources
161 papers in training set
Top 0.8%
1.2%
18
Applications in Plant Sciences
21 papers in training set
Top 0.3%
0.9%
19
Patterns
70 papers in training set
Top 2%
0.9%
20
Global Ecology and Biogeography
41 papers in training set
Top 0.5%
0.9%
21
PeerJ
261 papers in training set
Top 13%
0.9%
22
Scientific Data
174 papers in training set
Top 2%
0.9%
23
eLife
5422 papers in training set
Top 59%
0.7%
24
Conservation Biology
14 papers in training set
Top 0.3%
0.7%
25
Journal of Animal Ecology
63 papers in training set
Top 1%
0.7%
26
Genomics, Proteomics & Bioinformatics
171 papers in training set
Top 7%
0.6%
27
Systematic Biology
121 papers in training set
Top 0.5%
0.6%
28
Peer Community Journal
254 papers in training set
Top 5%
0.6%
29
NAR Genomics and Bioinformatics
214 papers in training set
Top 4%
0.6%
30
Ecology Letters
121 papers in training set
Top 2%
0.6%