Back

A novel pipeline for the rapid expansion of ecological trait databases using LLMs

Ramos, R. J.; Afkhami, M. E.; Aguilar-Trigueros, C. A.; Barbour, K. M.; Chaverri, P.; Cuprewich, S. A.; Egan, C. P.; Lynn, K. M. T.; Peay, K. G.; Norros, V.; Romero-Olivares, A. L.; Ward, L.; Chaudhary, B.

2026-03-12 ecology
10.64898/2026.03.10.710865 bioRxiv
Show abstract

This paper presents a novel workflow leveraging Large Language Models (LLMs) to rapidly extract trait data from fungal species descriptions, addressing a significant bottleneck in ecological research. We developed and evaluated an LLM pipeline to extract morphological trait data from arbuscular mycorrhizal fungi, comparing performance against a manually curated dataset (TraitAM). Results demonstrate the potential of LLMs for automated trait data acquisition, though accuracy varies by trait and model, with systematic biases observed. This framework offers a blueprint for building trait databases across diverse taxa and domains, significantly accelerating ecological research and conservation efforts.

Matching journals

The top 4 journals account for 50% of the predicted probability mass.

1
PLOS ONE
4510 papers in training set
Top 11%
17.3%
2
PLOS Computational Biology
1633 papers in training set
Top 2%
14.2%
3
Methods in Ecology and Evolution
160 papers in training set
Top 0.3%
12.2%
4
Scientific Data
174 papers in training set
Top 0.2%
6.7%
50% of probability mass above
5
Applications in Plant Sciences
21 papers in training set
Top 0.1%
4.8%
6
Ecological Informatics
29 papers in training set
Top 0.1%
4.8%
7
Scientific Reports
3102 papers in training set
Top 25%
4.8%
8
New Phytologist
309 papers in training set
Top 2%
2.6%
9
GigaScience
172 papers in training set
Top 0.9%
2.3%
10
Bioinformatics Advances
184 papers in training set
Top 2%
2.0%
11
BMC Biology
248 papers in training set
Top 0.9%
1.9%
12
NAR Genomics and Bioinformatics
214 papers in training set
Top 2%
1.6%
13
Patterns
70 papers in training set
Top 1%
1.5%
14
iScience
1063 papers in training set
Top 22%
1.2%
15
Computational and Structural Biotechnology Journal
216 papers in training set
Top 6%
1.2%
16
Molecular Ecology Resources
161 papers in training set
Top 0.9%
0.9%
17
BMC Bioinformatics
383 papers in training set
Top 6%
0.9%
18
Frontiers in Plant Science
240 papers in training set
Top 5%
0.9%
19
ISME Communications
103 papers in training set
Top 2%
0.8%
20
Nature Communications
4913 papers in training set
Top 61%
0.8%
21
Communications Biology
886 papers in training set
Top 22%
0.8%
22
Remote Sensing in Ecology and Conservation
10 papers in training set
Top 0.3%
0.7%
23
Ecography
50 papers in training set
Top 1%
0.7%
24
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 46%
0.7%
25
Frontiers in Microbiology
375 papers in training set
Top 10%
0.6%
26
PLOS Neglected Tropical Diseases
378 papers in training set
Top 6%
0.6%
27
PeerJ
261 papers in training set
Top 18%
0.6%
28
Ecology and Evolution
232 papers in training set
Top 4%
0.6%
29
Bioinformatics
1061 papers in training set
Top 10%
0.6%
30
Nucleic Acids Research
1128 papers in training set
Top 20%
0.6%