Back

Advancing Plant Metabolic Research By Using Large Language Models To Expand Databases And Extract Labelled Data

Knapp, R.; Johnson, B.; Busta, L.

2024-11-06 plant biology
10.1101/2024.11.05.622126 bioRxiv
Show abstract

Premise: Recently, plant science has seen transformative advances in scalable data collection for sequence and chemical data. These large datasets, combined with machine learning, revealed that conducting plant metabolic research on large scales yields remarkable insights. A key next step in increasing scale has been revealed with the advent of accessible large language models, which, even in their early stages, can distill structured data from literature. This brings us closer to creating specialized databases that consolidate virtually all published knowledge on a topic. Methods: Here, we first test different prompt engineering technique / language model combinations in the identification of validated enzyme-product pairs. Next, we evaluate automated prompt engineering and retrieval augmented generation applied to identifying compound-species associations. Finally, we build and determine the accuracy of a multimodal language model-based pipeline that transcribes images of tables into machine-readable formats. Results: When tuned for each specific task, these methods perform with high accuracies (80-90 percent for enzyme-product pair identification and table image transcription), or with modest accuracies (50 percent) but lower false-negative rates than previous methods (down to 40 percent from 55 percent) for compound-species pair identification. Discussion: We enumerate several suggestions for working with language models as researchers, among which is the importance of the users domain-specific expertise and knowledge. Significance StatementScientific databases have played a major role in advancing metabolic research. However, even todays advanced databases are incomplete and/or are not built to best suit certain research tasks. Here, we explored and evaluated the use of large language models and various prompt engineering techniques to expand and subset existing databases in task-specific ways. Our results illustrate the potential for high-accuracy additions and restructurings of existing databases using language models, assuming the specific methods by which the models are used are tuned and validated for the specific task. These findings are important because they outline a method by which we could greatly expand existing databases and rapidly tailor them to specific research efforts, leading to greater research productivity and effective utilization of past research findings. All authors collected data, analyzed data, prepared the manuscript, and approved its final version. The authors declare that they have no competing interests.

Matching journals

The top 4 journals account for 50% of the predicted probability mass.

1
Applications in Plant Sciences
21 papers in training set
Top 0.1%
22.0%
2
Plant Direct
81 papers in training set
Top 0.1%
12.4%
3
The Plant Journal
197 papers in training set
Top 0.2%
12.0%
4
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 8%
8.2%
50% of probability mass above
5
Plant Communications
35 papers in training set
Top 0.3%
3.9%
6
Plant Physiology
217 papers in training set
Top 1%
3.5%
7
GigaScience
172 papers in training set
Top 0.8%
2.5%
8
The Plant Cell
141 papers in training set
Top 1%
2.3%
9
Nature Plants
84 papers in training set
Top 0.9%
2.0%
10
PLOS ONE
4510 papers in training set
Top 51%
1.8%
11
Database
51 papers in training set
Top 0.3%
1.8%
12
Plant Biotechnology Journal
56 papers in training set
Top 0.6%
1.7%
13
The Plant Phenome Journal
14 papers in training set
Top 0.1%
1.7%
14
PLANTS, PEOPLE, PLANET
21 papers in training set
Top 0.4%
1.7%
15
New Phytologist
309 papers in training set
Top 4%
1.2%
16
Metabolites
50 papers in training set
Top 0.8%
1.1%
17
Plant Phenomics
17 papers in training set
Top 0.3%
0.9%
18
Scientific Data
174 papers in training set
Top 2%
0.9%
19
eLife
5422 papers in training set
Top 56%
0.8%
20
Genome Biology
555 papers in training set
Top 8%
0.7%
21
The Plant Genome
53 papers in training set
Top 0.7%
0.7%
22
PLOS Computational Biology
1633 papers in training set
Top 25%
0.7%
23
Nucleic Acids Research
1128 papers in training set
Top 18%
0.7%
24
Molecular Systems Biology
142 papers in training set
Top 2%
0.7%
25
BMC Bioinformatics
383 papers in training set
Top 7%
0.7%
26
BMC Biology
248 papers in training set
Top 5%
0.7%
27
Metabolic Engineering
68 papers in training set
Top 0.7%
0.7%
28
Briefings in Bioinformatics
326 papers in training set
Top 7%
0.7%
29
G3: Genes, Genomes, Genetics
222 papers in training set
Top 1%
0.7%
30
Methods in Ecology and Evolution
160 papers in training set
Top 2%
0.7%