Back

Using natural language processing to extract plant functional traits from unstructured text

Domazetoski, V.; Kreft, H.; Bestova, H.; Wieder, P.; Koynov, R.; Zarei, A.; Weigelt, P.

2023-11-06 ecology
10.1101/2023.11.06.565787 bioRxiv
Show abstract

Functional plant ecology aims to understand how functional traits govern the distribution of species along environmental gradients, the assembly of communities, and ecosystem functions and services. The rapid rise of functional plant ecology has been fostered by the mobilization and integration of global trait datasets, but significant knowledge gaps remain about the functional traits of the [~]380,000 vascular plant species worldwide. The acquisition of urgently needed information through field campaigns remains challenging, time-consuming and costly. An alternative and so far largely untapped resource for trait information is represented by texts in books, research articles and on the internet which can be mobilized by modern machine learning techniques. Here, we propose a natural language processing (NLP) pipeline that automatically extracts trait information from an unstructured textual description of a species and provides a confidence score. To achieve this, we employ textual classification models for categorical traits and question answering models for numerical traits. We demonstrate the proposed pipeline on five categorical traits (growth form, life cycle, epiphytism, climbing habit and life form), and three numerical traits (plant height, leaf length, and leaf width). We evaluate the performance of our new NLP pipeline by comparing results obtained using different alternative modeling approaches ranging from a simple keyword search to large language models, on two extensive databases, each containing more than 50,000 species descriptions. The final optimized pipeline utilized a transformer architecture to obtain a mean precision of 90.8% (range 81.6-97%) and a mean recall of 88.6% (77.4-97%) on the categorical traits, which is an average increase of 21.4% in precision and 57.4% in recall compared to a standard approach using regular expressions. The question answering model for numerical traits obtained a normalized mean absolute error of 10.3% averaged across all traits. The NLP pipeline we propose has the potential to facilitate the digitalization and extraction of large amounts of plant functional trait information residing in scattered textual descriptions. Additionally, our study adds to an emerging body of NLP applications in an ecological context, opening up new opportunities for further research at the intersection of these fields.

Matching journals

The top 5 journals account for 50% of the predicted probability mass.

1
Applications in Plant Sciences
21 papers in training set
Top 0.1%
18.5%
2
PLOS ONE
4510 papers in training set
Top 16%
12.3%
3
Ecological Informatics
29 papers in training set
Top 0.1%
8.2%
4
Scientific Reports
3102 papers in training set
Top 12%
7.1%
5
Frontiers in Plant Science
240 papers in training set
Top 2%
6.3%
50% of probability mass above
6
Methods in Ecology and Evolution
160 papers in training set
Top 0.6%
4.8%
7
New Phytologist
309 papers in training set
Top 2%
4.1%
8
Plant Phenomics
17 papers in training set
Top 0.1%
3.6%
9
Bioinformatics Advances
184 papers in training set
Top 2%
3.0%
10
Plant Methods
39 papers in training set
Top 0.3%
2.6%
11
iScience
1063 papers in training set
Top 9%
2.4%
12
GigaScience
172 papers in training set
Top 1.0%
2.1%
13
Remote Sensing in Ecology and Conservation
10 papers in training set
Top 0.2%
1.7%
14
BMC Bioinformatics
383 papers in training set
Top 5%
1.7%
15
PLOS Computational Biology
1633 papers in training set
Top 16%
1.7%
16
Patterns
70 papers in training set
Top 1%
1.2%
17
Heliyon
146 papers in training set
Top 5%
0.9%
18
Nature Communications
4913 papers in training set
Top 61%
0.8%
19
PeerJ
261 papers in training set
Top 15%
0.7%
20
Scientific Data
174 papers in training set
Top 2%
0.7%
21
Computational and Structural Biotechnology Journal
216 papers in training set
Top 9%
0.7%
22
Plant Physiology
217 papers in training set
Top 3%
0.6%
23
Bioinformatics
1061 papers in training set
Top 10%
0.6%
24
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 47%
0.6%
25
Plant Biotechnology Journal
56 papers in training set
Top 1%
0.6%
26
BMC Biology
248 papers in training set
Top 6%
0.6%
27
eLife
5422 papers in training set
Top 61%
0.6%